Migrate a MySQL Database Preserving Special Characters 98
TomSlick writes "Michael Chu's blog provides a good solution for people migrating their MySQL databases and finding that special characters (like smart quotes) get mangled. He presents two practical solutions to migrating the database properly."
Unicode integration woes (Score:5, Interesting)
That bit me one time when one of my live servers crashed and I had to load the data on the backup onto a different server. I remember wondering to myself, when was the good old days when a database was a dumb (smart, depending your POV) engine that only worries about a string of bytes (chars). Seriously, it only should become smarter and start talking in unicode only when I want it to.
Issues with using unicode is not just limited to MySQL, as Python have similar issues. However they are almost always caused by poor programmers who mixes usage between the two, or not doing type checking on the proper type (basestring).
mysqldump is too think (Score:3, Interesting)
Re:How is this news? (Score:3, Interesting)
As well as a chance of posting an arcane method of database transition involving MySQL to start an ACID war.
As well as on the original subject of the article - the best way to migrate an application is to load all of the data from one datasource and dump it into another datasource. If the application fails this trivial test its database access libraries are broken. If the app sticks strictly to dynamic SQL, high level DBI functions and does no manual escaping - it just works. The escaping portion of the SQL libs take care of it and ensure it is mapped correctly both ways. If the app tries to escape by hand, sticks data into teh SQL statement itself, etc - it fails. Same for utf/latin transitions and the like.
Re:Useful? This is damned awesome! (Score:3, Interesting)
Re:What's with the sudden influx of gnubies? (Score:3, Interesting)
Small teapot in Big China. UCS-2 slices and dices. (Score:4, Interesting)
My one criticism of Joel was passing himself the "get out of jail free" card. Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified. This is a fair disclaimer, but it makes it impossible to judge where Joel was simplifying deliberately and where he simplified because he didn't know any better. The correction would be for Joel to state "I'm going to simplify issues X, Y, and Zed". Then mistakes in the middle of the alphabet would be entirely his own. Just as there is no such thing as a string without a coding system, there is no such thing as a useful disclaimer that doesn't specify precisely what it disclaims. It amused me to see Joel invoke the ASCII standard of accountability.
Concerning the claim that Joel has made the same mistake [over again], this same claim comes up all the time concerning address arithmetic. How much existing code is portable to a 128 bit address size? We're sure to need this by 2050. Or perhaps not. People tend to neglect the observation that we're talking about a doubly exponential progression in codespace: (2^2^3)^2^N, with the values N=0,1,2,3,4 plausible in photolithographic solid state. On the current progression, for N=5 transistors would need to be subatomic. As far as the present transition from 32 bits to 64 bits of address space, it makes sense that operating systems and file systems are 64-bit native, while 99% of user space applications continue to run in less time and space compiled for 32 bits. Among the growing sliver of applications that do run better in 64-bits are a few applications of especially high importance.
I worked extensively with CJK languages in the early 1990s, and my opinion at the time was that UCS-4 primarily catered to the SETI crowd, and potentially, belligerent Mandarins in mainland China. I recall more argument at the time about Korean, which is a syllabic script masquerading as ideographic blocks.
http://en.wikipedia.org/wiki/Hangul [wikipedia.org]
I've always had a lot of trouble understanding the opposition to Han unification. Many distinctions in the origins of the English language were lost in the adoption of ASCII, such as the ae ligature and the old-English thorn (which causes many Hollywood sets to feature "Ye old saloon").
http://en.wikipedia.org/wiki/Han_unification [wikipedia.org]
http://en.wikipedia.org/wiki/Thorn_(letter) [wikipedia.org]
And all this fits quite nicely in UCS-2 as advocated by Joel.
Re:Big Trouble in Little China. Don't use UCS-2. (Score:3, Interesting)
That's no different from "you need a bunch of code to take the incoming text and convert it to UCS-2 (or UTF-8, or UTF-16, or UCS-4)".
it's possible to create legal combinations which have no single unichar replacement.
Are they meaningful as well as legal, or ar they like the "n with an umlaut" in "Spinal Tap"?
I'm of the opinion that there should be *no* precomposed characters, or they should *all* be precomposed.
In any case, you can always use a guaranteed unused code and use a lookup table on input and output.