Slashdot Log In
Migrate a MySQL Database Preserving Special Characters
Posted by
kdawson
on Mon May 07, 2007 12:30 AM
from the encoding-juju dept.
from the encoding-juju dept.
TomSlick writes "Michael Chu's blog provides a good solution for people migrating their MySQL databases and finding that special characters (like smart quotes) get mangled. He presents two practical solutions to migrating the database properly."
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Migration (Score:4, Informative)
Re: (Score:2, Informative)
Useful? This is damned awesome! (Score:2)
Re:Useful? This is damned awesome! (Score:5, Informative)
http://www.joelonsoftware.com/articles/Unicode.ht
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Parent
Re: (Score:2)
But some of them Just Don't Get It [tm]. "We're Americans, we don't use fancy foreign letters, so we just want ASCII." Sigh.
--
The best cure for seasickness is to go sit under a tree. [Spike Milligan]
Joel is often awesome. (Score:2)
Re: (Score:3, Interesting)
Re: (Score:2)
Big Trouble in Little China. Don't use UCS-2. (Score:5, Informative)
In other words, Joel has made the same mistake as the people who wrote software that only works in 7-bit ASCII or 8-bit UTF-8 or the IBM or Apple or Adobe 8-bit extended ASCII sets or the 9-bit extended ASCII set that ITS used, or...
And it's already too late to try and cram everything into 2 bytes. After the Han Unification mess (the attempt to force Chinese and Japanese and everything else that used some variant of Chinese ideograms (Kanji, etc...) into a common subset of similar characters that fit in the 65535 available codes in the BMP) the People Republic of China decided to require their computers to support their national encoding anyway. As of 2000.
So you have to support the full UCS encoding anyway.
There's three storage formats that it's practical to use: UCS-4 (4 bytes per character, with the same byte-ordering problems as UCS-2), UTF-16 (2-4 bytes per character, same as UCS-2 for the BMP) or UTF-8 (1-4 bytes per character). Internally: you can use UCS-4 as your wide character type, and translate on the fly; use UTF-8 and use care to avoid breaking strings in the middle of glyphs or use UTF-16 and translate on the fly and use care to avoid breaking strings in the middle of glyphs.
If Joel is lucky the libraries he's using are actually operating on UTF-16 strings instead of UCS-2 strings. If he's *really* lucky they're designed to avoid breaking up codes outside the BMP. If he's *super* lucky he's managed to avoid creating any code that just operates on strings as a stream of wchar_t anywhere.
Personally, I think that UTF-16 gets you the worst of both worlds: your data is almost certainly less compact than if you use UTF-8; you still have to deal with multi-wchar_t strings so your code is no easier to write than if you used UTF-8... you're just less likely to find bugs in testing; and you get byte order issues in files just like you would with UCS-4. Unless you think UCS-2 is "good enough" and you just ignore everything outside the BMP and discover that people in China are suddenly getting hash when they use your program.
Parent
Re: (Score:2)
Re: (Score:3, Informative)
First, you need to be religious about it. But if you are, then the choice of internal encoding is really a performance issue only, and the choice of external encoding is a matter of following the principle of least astonishment. Your code shouldn't know nor care what encoding the string APIs use internally. The program should work the same whether wchar_t is (unsigned char), (unsigned short), (unsigned long), or even (double).
Second, there's a lot of overhead in
Small teapot in Big China. UCS-2 slices and dices. (Score:4, Interesting)
My one criticism of Joel was passing himself the "get out of jail free" card. Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified. This is a fair disclaimer, but it makes it impossible to judge where Joel was simplifying deliberately and where he simplified because he didn't know any better. The correction would be for Joel to state "I'm going to simplify issues X, Y, and Zed". Then mistakes in the middle of the alphabet would be entirely his own. Just as there is no such thing as a string without a coding system, there is no such thing as a useful disclaimer that doesn't specify precisely what it disclaims. It amused me to see Joel invoke the ASCII standard of accountability.
Concerning the claim that Joel has made the same mistake [over again], this same claim comes up all the time concerning address arithmetic. How much existing code is portable to a 128 bit address size? We're sure to need this by 2050. Or perhaps not. People tend to neglect the observation that we're talking about a doubly exponential progression in codespace: (2^2^3)^2^N, with the values N=0,1,2,3,4 plausible in photolithographic solid state. On the current progression, for N=5 transistors would need to be subatomic. As far as the present transition from 32 bits to 64 bits of address space, it makes sense that operating systems and file systems are 64-bit native, while 99% of user space applications continue to run in less time and space compiled for 32 bits. Among the growing sliver of applications that do run better in 64-bits are a few applications of especially high importance.
I worked extensively with CJK languages in the early 1990s, and my opinion at the time was that UCS-4 primarily catered to the SETI crowd, and potentially, belligerent Mandarins in mainland China. I recall more argument at the time about Korean, which is a syllabic script masquerading as ideographic blocks.
http://en.wikipedia.org/wiki/Hangul [wikipedia.org]
I've always had a lot of trouble understanding the opposition to Han unification. Many distinctions in the origins of the English language were lost in the adoption of ASCII, such as the ae ligature and the old-English thorn (which causes many Hollywood sets to feature "Ye old saloon").
http://en.wikipedia.org/wiki/Han_unification [wikipedia.org]
http://en.wikipedia.org/wiki/Thorn_(letter) [wikipedia.org]
And all this fits quite nicely in UCS-2 as advocated by Joel.
Parent
Re:Small teapot in Big China. UCS-2 slices and dic (Score:2)
What he should have said is "use a wide character library that supports Unicode-2, no matter what it uses internally, and make sure your code still works if they change the encoding behind your back".
I don't know why you're going on about "I have a tough time accepting the premise that the British Commonwealth was well served by ASCII". I didn't say that anyone was well served by
Re: (Score:2)
Re: (Score:2, Informative)
Re: (Score:3, Interesting)
That's no different from "you need a bunch of code to take the incoming text and convert it to UCS-2 (or UTF-8, or UTF-16, or UCS-4)".
it's possible to create legal combinations which have no single unichar replacement.
Are they meaningful as well as legal, or ar they like the "n with an umlaut" in "Spinal Tap"?
I'm of the opinion that there should be *no* precom
Re: (Score:2)
Pffft Easy... (Score:5, Funny)
Then you create these tables in the new one. Just so there are no problems with data types you should probably just make every field varchar(100) in the new one.
Then you fire up MS Access, the older the better. I try to stick to Access 95.
Then you create two ODBC links, one to your old one and one to the new one.
Then you use the linked table manager to link each table to ms access.
Then you open both the new table and the old table and select all, copy and paste the data into the new table.
It's so simple even a child could do it!
Re: (Score:3, Funny)
that only a child can do it!
Re:Pffft Easy... (Score:4, Funny)
Parent
Re: (Score:2)
Unicode integration woes (Score:5, Interesting)
That bit me one time when one of my live servers crashed and I had to load the data on the backup onto a different server. I remember wondering to myself, when was the good old days when a database was a dumb (smart, depending your POV) engine that only worries about a string of bytes (chars). Seriously, it only should become smarter and start talking in unicode only when I want it to.
Issues with using unicode is not just limited to MySQL, as Python have similar issues. However they are almost always caused by poor programmers who mixes usage between the two, or not doing type checking on the proper type (basestring).
Re: (Score:2)
Of course, since Dreamhost refuses to install a half-decent database on their servers, I'm stuck using it. Does anybody know how to install Firebird on a Dreamhost account and make it work? Is it even possible?
Re: (Score:2)
Databases use data types for the same reason data types are used in programming languages.
Relational databases offer a lot more as well that I won't go into. But if you don't care about any of those things and just want to store bytes, there are plenty of ways to do that, and there have been for a long time (files are the most obvious example).
Re: (Score:2)
As for my dumb/smart comment, I only want my database to be smart at what it was supposed to do, and only do what I want it to do, not guessing what I might want to do also and mess u
Re: (Score:2)
Fair enough.
not guessing what I might want to do also and mess up the output and resulting database dump.
That's the result of a very bad implementation of "smart"
I think correct use of character encodings and locales are the way to go with software, but only if done correctly. And you certainly can't count on MySQL AB to do something correctly.
mysqldump is too think (Score:3, Interesting)
Re: (Score:2)
mysqldump --opt --default-character-set=latin1 database
Something like this might be more interesting
mysqldump --opt --default-character-set=latin1 database1 | sed "s/SET NAMES latin1/SET NAMES utf8/" | mysql database2
And then simply switch the application to use database2 instead of database1 (or rename the databases)
What's with the sudden influx of gnubies? (Score:5, Insightful)
If you do insert unicode data into a latin1 table, you will get unexpected results.
What you do is make sure that your:
a) database(s) are set to utf8 by default
b) table(s) are set to utf8 by default
c) column(s) are set to utf8 by default
d) connection defaults to utf8
(provided, of course, that it's utf8 you're after)
That way, it'll "Just Work"(tm)
We've gone through upgrades from 3.23 -> 4.0 -> 4.1 -> 5.0 and never had a problem; and yes, our tables were all latin1 from the beginning.
Re: (Score:3, Insightful)
If you do insert unicode data into a latin1 table, you will get unexpected results.
Re: (Score:3, Interesting)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Or does that not work in MySQL?
Awesome! (Score:2)
Unbelievable.
Tagged slashdigg (Score:2)
Re: (Score:2)
"Excel & the power of CSV files"
"Bottleneck your CPU: How to gain MASSIVE performance improvements in any computer: defrag.exe"
"Make your data manageable: separating content from style the CSS way!!"
"FREE SOFTWARE! The best 10...er...20...er...100...free (and open source) software products"
Re: (Score:3, Informative)
Re: (Score:2)
Worked beautifully going from 4.1 to 5.0.
Different arch's too. Intel 32bit to AMD64. I wouldnt do the reverse though.
Re: (Score:2)
Re: (Score:2)
I'd always use mysqldump for smaller datasets though.
Re:How is this news? (Score:5, Funny)
Parent
Re: (Score:3, Interesting)
As well as a chance of posting an arcane method of database transition involving MySQL to start an ACID war.
As well as on the original subject of the article - the best way to migrate an application is to load all of the data from one datasource and dump it into another datasource. If the application fails this trivial test its database access libraries are broken. If the app sticks strictly to dynamic SQL, high level DBI functions and does no manual escaping - it just works. The escaping portion of th
Re: (Score:2)
Does anyone actually use PostgreSQL? I mean, I know it's the defacto database that we wave around when we want to bash MySQL, but that doesn't mean anyone uses it. I've yet to run into anyone who used PostgreSQL except as a rapid-prototype for an Oracle environment. Anyone have data points here? Does anyone know the rough sizes of the user bases? Are we really just waving PostgreSQL like a fl
Re: (Score:2)
Re: (Score:2)
Re:smart quotes? more like stupid fucked up quotes (Score:2)
Re: (Score:3, Funny)
Re: (Score:2)