Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Databases Programming Software IT

Migrate a MySQL Database Preserving Special Characters 98

TomSlick writes "Michael Chu's blog provides a good solution for people migrating their MySQL databases and finding that special characters (like smart quotes) get mangled. He presents two practical solutions to migrating the database properly."
This discussion has been archived. No new comments can be posted.

Migrate a MySQL Database Preserving Special Characters

Comments Filter:
  • Migration (Score:4, Informative)

    by dfetter ( 2035 ) <david@fetter.org> on Monday May 07, 2007 @01:38AM (#19017095) Homepage Journal
    Better still, install DBI-Link http://pgfoundry.org/projects/dbi-link/ [pgfoundry.org] inside PostgreSQL, migrate once and have done ;)
  • Re:how about ... (Score:3, Informative)

    by Majik Sheff ( 930627 ) on Monday May 07, 2007 @02:11AM (#19017265) Journal
    It usually works as long as you're staying on the same architecture. I successfully pulled this off when a client's DB server died horribly with no functional backups in sight. I salvaged the vast majority of the binary tables and dropped them into a fresh install of MySQL. After the migration of the binaries I renamed the tables to *_bak and told MySQL to dump the contents into freshly created data files. Then I set my client on the task of assessing damage to the data. I would only recommend this tactic if you're doing crash recovery on a borked system, as there are risks. P.S. Have you hugged your backups today?
  • by jamshid ( 140925 ) on Monday May 07, 2007 @02:38AM (#19017415)
    Then send the wordpress developers this link:

    http://www.joelonsoftware.com/articles/Unicode.htm l [joelonsoftware.com]
    The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
  • UCS-2 only covers plane zero (the Basic Multilingual Plane, or BMP). It doesn't cover code points outside that. Unicode actually supports the entire UCS, all 1.1 million (and growing) code points.

    In other words, Joel has made the same mistake as the people who wrote software that only works in 7-bit ASCII or 8-bit UTF-8 or the IBM or Apple or Adobe 8-bit extended ASCII sets or the 9-bit extended ASCII set that ITS used, or...

    And it's already too late to try and cram everything into 2 bytes. After the Han Unification mess (the attempt to force Chinese and Japanese and everything else that used some variant of Chinese ideograms (Kanji, etc...) into a common subset of similar characters that fit in the 65535 available codes in the BMP) the People Republic of China decided to require their computers to support their national encoding anyway. As of 2000.

    So you have to support the full UCS encoding anyway.

    There's three storage formats that it's practical to use: UCS-4 (4 bytes per character, with the same byte-ordering problems as UCS-2), UTF-16 (2-4 bytes per character, same as UCS-2 for the BMP) or UTF-8 (1-4 bytes per character). Internally: you can use UCS-4 as your wide character type, and translate on the fly; use UTF-8 and use care to avoid breaking strings in the middle of glyphs or use UTF-16 and translate on the fly and use care to avoid breaking strings in the middle of glyphs.

    If Joel is lucky the libraries he's using are actually operating on UTF-16 strings instead of UCS-2 strings. If he's *really* lucky they're designed to avoid breaking up codes outside the BMP. If he's *super* lucky he's managed to avoid creating any code that just operates on strings as a stream of wchar_t anywhere.

    Personally, I think that UTF-16 gets you the worst of both worlds: your data is almost certainly less compact than if you use UTF-8; you still have to deal with multi-wchar_t strings so your code is no easier to write than if you used UTF-8... you're just less likely to find bugs in testing; and you get byte order issues in files just like you would with UCS-4. Unless you think UCS-2 is "good enough" and you just ignore everything outside the BMP and discover that people in China are suddenly getting hash when they use your program.
  • Re:Migration (Score:2, Informative)

    by alexandreracine ( 859693 ) <alexandreracine@gmail.com> on Monday May 07, 2007 @12:24PM (#19022421) Homepage Journal
    Yeah, but he's talking about mysql, not pg. Use my solution : http://gallery.menalto.com/node/61073#comment-2238 08 [menalto.com]
  • Assuming he's ONLY using Windows string APIs.

    First, you need to be religious about it. But if you are, then the choice of internal encoding is really a performance issue only, and the choice of external encoding is a matter of following the principle of least astonishment. Your code shouldn't know nor care what encoding the string APIs use internally. The program should work the same whether wchar_t is (unsigned char), (unsigned short), (unsigned long), or even (double).

    Second, there's a lot of overhead in canned string code. Depending on the problem space that may be OK. For short strings or simple operations, or where there's significant per-token overhead otherwise, the overhead of dipping into the API for each character isn't significant. When performance matters, though, even inlined code can slow the critical path in the program down orders of magnitude. Getting decent performance requires a parser be character-set aware.

    So the advice should be:

    1. "Use a canned string handling library that's unicode-aware". That means using at least the ISO C wide character libraries and NOT depending on implementation details like whether they're UCS-2 or UTF-16 or UCS-4 or UTF-8 or for god's sake UTF-1 internally.
    2. "Only export data in UTF-8".
    3. "When performance matters, figure out what's most efficient for your problem space, and use that." And if you're not sure, performance probably doesn't matter as much as you think it does. And don't forget that cache matters and branches cost dozens of instructions.
  • Why can't you maintain characters in composed form internally?
    Because the set of composite characters that have a composed form differs from Unicode version to Unicode version.

Never call a man a fool. Borrow from him.

Working...