Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Software

Unicode Consortium Releases Unicode 8.0.0 164

An anonymous reader writes: The newest version of the Unicode standard adds 7,716 new characters to the existing 21,499 – that's more than 35% growth! Most of them are Chinese, Japan and Korean ideographs, but among those changes Unicode adds support for new languages like Ik, used in Uganda.
This discussion has been archived. No new comments can be posted.

Unicode Consortium Releases Unicode 8.0.0

Comments Filter:
  • Ithought (Score:5, Funny)

    by rossdee ( 243626 ) on Saturday June 20, 2015 @12:41AM (#49950565)

    That slashdot didn't support unicode

    • That slashdot didn't support unicode

      You thought right, Slashdot does not support unicode, this story is just news for nerds that is reported by accident, as stuff that matters for G[r]eeks only!

      note: i now continue my comment with a very interesting paragraph, but it is in Greek, so you can not read it, not even if you want to translate it:

    • Re: (Score:2, Informative)

      by Anonymous Coward

      > That slashdot didn't support unicode

      However Soylent News has had full unicode support since last year.
      Here is a recent thread with lots of greek. [soylentnews.org]

    • Re:Ithought (Score:5, Funny)

      by hcs_$reboot ( 1536101 ) on Saturday June 20, 2015 @01:52AM (#49950689)
      Slashdot supports Unicode / UTF8 from 0x20 to 0x7F.
    • Re:Ithought (Score:5, Interesting)

      by tlhIngan ( 30335 ) <slashdot AT worf DOT net> on Saturday June 20, 2015 @03:46AM (#49950917)

      That slashdot didn't support unicode

      It does. It's actually fully Unicode-compliant. It's just on the input and recently (as of a couple of years ago) the output side passes through a Unicode whitelist.

      You see a Unicode codepoint is not necessarily a character.. It can be a character modifier. So you can be handling a string containing multiple codepoints, and yet on screen it only resolves to one character. Some of these include right-to-left overrides (which alter the flow of text on the screen so you can write a string and the display agent will reverse it). There are other modifiers that include flourishes, and Unicode 8 adds "skin type modifiers" as well for emoji. As in, if you display a face, the font should use a "non-human shading" (Apple chose a Simpsons-like yellow, Microsoft chose a pale zombie-ish hue). But with the addition of a skintone/diversity modifier, when combined with the emoji codepoint, can give you a variety of skin tones.

      And it's also what screwed up iOS - the string you send is full of modifiers which makes it extremely hard to decide where to break the line. (Arabic is one where there are lots of modifiers because a character can appear differently based on the characters that appear before and after it).

      And what does this have to do with /.? Easy - a lot of commenters abused the modifiers to screw with the website. And unless you know how to handle Unicode, it's really hard to properly reset the parser state. /. used to be able to display the screwed up the comments - if you Google for the oddball string n"5:erocS" it would show it (because Google ignores modifiers). If you wonder, that's the string "Score:5" as commenters use to fake-moderate their posts. But since /. strips unicode on display now, you get to see the messed up post as it was typed out

      • Re:Ithought (Score:5, Insightful)

        by KiloByte ( 825081 ) on Saturday June 20, 2015 @05:16AM (#49951049)

        That slashdot didn't support unicode

        It does. It's actually fully Unicode-compliant

        No, Slashdot's database works in ISO-8859-1. You're confusing Slashcode which can do Unicode with Slashdot which still hasn't deployed it.

      • Thank you for ninjaing me. I often chime in about this issue when someone complains about Slashdot's lack of support for Unicode. Most of the time, after I explain the code point whitelist and the reason for it, someone complains that a blacklist of dangerous code points would work better. My usual reply is that new versions of Unicode may insert new control code points that get activated before the Slashdot admins have the chance to add them to the blacklist. And besides, many characters outside the curren

  • by Anonymous Coward on Saturday June 20, 2015 @01:04AM (#49950615)

    CJK in Unicode really kills me. I once had to write an appointment that generated PDF documents with both Japanese and Chinese text. When you do this with, say, English and Russian, you just need to pick a font set that covers both alphabets and basta. Not Chinese/Japanese. There are a number of glyphs that share a common historic root in these languages, and the Unicode folks decided to consecrate this historical relationship by recycling the character codes between the languages. Yet, the glyphs are substantially different when rendered. So you don't know what the glyph really represents until you know what font set is being applied to the string.

    What I ended up doing was processing each character individually and using a "look around" algorithm that would try to find clues in the context as to what language the glyph was in and render it with the right font. It never worked very well, but it worked well enough that the client decided not to redactor the controller that was generating the mixed language strings.

    But I learned two valuable lessons that day: Unicode isn't that great after all and stay away from CJK contracts.

    • by fisted ( 2295862 )

      the Unicode folks decided to consecrate this historical relationship by recycling the character codes between the languages.

      For example? I thought exactly this was /not/ being done by unicode.

    • by SirSlud ( 67381 )

      Unicode trips up shitty programmers. It sounds a little more like it was one of those "out of your league" problems. Probably best to say away from CJK contracts tho. I've shipped shitloads of software with CJK localizations, and frankly, from that word soup you've produced, I don't think you have any idea what you're talking about.

      • by gustygolf ( 3979423 ) on Saturday June 20, 2015 @03:45AM (#49950915) Homepage

        In short:
        To render text properly in Japanese, you need a Japanese font. To render text properly in Chinese, you need a Chinese font. It's not just because of character coverage, but because of a thing called Han unification [wikipedia.org] the consortium did.

        The Unicode consortium decided to map similar characters to the same code-point. Personally, I'm not particularly bothered by this. but it leads to the technical problem that each text must be supplied with a language tag to select a correct font.

        And this is problematic when there are two CJK languages mixed in the same document -- in the GP's case, Chinese and Japanese --, or when a program must automatically decide which font to render things in.

        Take a web browser for example. It reaches a random Chinese web page, encoded in UTF-8. The page's author never bothered adding a language tag. Now the web browser must guess whether to render the page in a Chinese font or a Japanese one. And a "guess" is really all that it can do.

        (Typically, software used base the guesses on the user's locale. It's pretty accurate -- Chinese users tend to view Chinese documents, Japanese Japanese ones. But the problems start when someone tries viewing a 'foreign' document...)

        It's really quite ironic that the consortium decided on codepoint unification for the three languages that would most benefit from Unicode.

        • Personally, I'm not particularly bothered by this. but it leads to the technical problem that each text must be supplied with a language tag to select a correct font.

          That's roughly like saying that you need to render the words "automaton", "Tsirpas", and "Varoufakis" in Greek characters, and "Putin" and "Gorbachev" using Cyrillic characters, in Latin text: it serves little purpose and it would make the text unreadable for many readers.

          Most of the time, you should to show Japanese glyph variants to Japanese

          • by chad_r ( 79875 )

            That's roughly like saying that you need to render the words "automaton", "Tsirpas", and "Varoufakis" in Greek characters, and "Putin" and "Gorbachev" using Cyrillic characters, in Latin text: it serves little purpose and it would make the text unreadable for many readers.

            Yet, that was the spec the GP was trying to write to: a single PDF needing to render both Chinese and Japanese, each in their own font, yet with no language tagging on any of the text. I give him credit for trying to meet the requirements, but they were crap requirements.

            • Yes, I agree that that's what the GP was writing for. But it isn't Unicode's fault that the original text lacked the language tags or font information for his needs; Unicode didn't prevent the original authors from putting that in, either using Unicode's own language tag characters or (preferred) XML/HTML and/or metadata. However, most people really don't want the behavior he wants, which is why people don't usually do this.

    • There are a number of glyphs that share a common historic root in these languages, and the Unicode folks decided to consecrate this historical relationship by recycling the character codes between the languages.

      This is not substantially different for what happened with Latin and Greek characters, both pre-Unicode and in Unicode.

      Yet, the glyphs are substantially different when rendered.

      Most of those glyph variants are similar enough that people have no trouble figuring them out. Those are the ideographs that

      • I really can't explain this properly because I can't show you the symbols, but I'm sick of seeing the Chinese variant of the first kanji in "chokusetsu" whenever I type an email in Japanese. Of all the huge corpus of Chinese ideograms, the ones with different stroke order should've been separately encoded. It's really weird behavior and to me it's a bad enough oversight.

        These problems are intrinsic to the languages; they are not problems with Unicode. The real solution is political and cultural: if using strings across languages is a frequent use case, that use case can only be addressed by harmonizing the writing systems themselves and adapting real-world usage; it's not something that the encoding can solve.

        I don't follow what you are trying to say. Are you saying the Japs and the Chinks should unify their writing systems? Because that's as dis

        • I really can't explain this properly because I can't show you the symbols,

          I know the symbols; I can read Japanese.

          I'm sick of seeing the Chinese variant of the first kanji in "chokusetsu" whenever I type an email in Japanese

          Why would you be seeing the Chinese variant if you're writing an E-mail in Japanese, presumably using a Japanese font? Note that even Google Translate manages to show you the correct local variants:

          https://translate.google.com/#... [google.com]

          Are you saying the Japs and the Chinks should unify thei

  • by Anonymous Coward

    Is Unicode supposed to separate characters that look the same but are semantically different?

    Looks like the answer is yes...
    'LATIN CAPITAL LETTER A' (U+0041)
    'GREEK CAPITAL LETTER ALPHA' (U+0391)

    Looks like the answer is no...
    'RIGHT SINGLE QUOTATION MARK' (U+2019) -- this is the preferred character to use for apostrophe.
    (An apostrophe and closing a quotation are two very different things.)

    • by Anonymous Coward

      For you, they *look the same*, for a philologists and typographers (the ones who must learn and create them) they ain't.

      They not only differ in shape (though to your eyes they *look the same*) but also in kerning and spacing (both vertical and horizontal separation from other characters).

  • Unicode now has a set for pre-Latin Hungarian runes!
    Hanging out for the keyboard....

  • Unicode adds support for new languages like Ik, used in Uganda

    Now Uganda needs computers to see what Unicode looks like.

  • by divec ( 48748 ) on Saturday June 20, 2015 @01:53AM (#49950691) Homepage

    "...adds 7,716 new characters to the existing 21,499 – that's more than 35% growth!"

    There were already 113K characters in Unicode version 7.0. Which is more than 2^16 characters, so remember:

    • 1. UTF-16 is *not* two bytes per character [blogspot.com]
    • 2. Therefore a "character" in Java [oracle.com], C# [microsoft.com], Javascript [mathiasbynens.be] sometimes only holds half a Unicode character
    • 3. Even a whole unicode character may be only part of a grapheme cluster [unicode.org], which means that taking arbitrary substrings may not result in readable text.
    • by fnj ( 64210 )

      1. Thank you! I KNEW that 21,499 figure was wrong

      2. Why does ANYBODY still use the mind-numbingly stupid UTF-16?

      • 2. Why does ANYBODY still use ........... UTF-16?

        Programmers use it because the programming environments they work in use it. Notably Windows, .net and Java.

        the mind-numbingly stupid

        I wouldn't call it stupid. It was a way to add support for more characters to existing 16 bit unicode systems with minimal breakage.

    • "...adds 7,716 new characters to the existing 21,499 – that's more than 35% growth!"

      There were already 113K characters in Unicode version 7.0. Which is more than 2^16 characters, so remember:

      • 1. UTF-16 is *not* two bytes per character [blogspot.com]
      • 2. Therefore a "character" in Java [oracle.com], C# [microsoft.com], Javascript [mathiasbynens.be] sometimes only holds half a Unicode character
      • 3. Even a whole unicode character may be only part of a grapheme cluster [unicode.org], which means that taking arbitrary substrings may not result in readable text.

      But wasn't UTF-16 supposed to cover all the practical languages (I'm not talking about Klingon or other languages created out of movies). In which case, the 65k should have covered it. Why does Unicode need weirdass characters for playing cards or stuff of that nature? Just stick to their original roles - supporting the implementation of written & spoken languages in computers, and leave it at that.

      • UTF-16 is an encoding which explains how to map bytes to code-points (what you call characters), like UTF-8. UTF-16 encodes data in chunks of 16 bits, while UTF-8 encodes the data in chunks of 8 bits. UCS-2 was an encoding where only the 2^16 first code-points could be encoded, in the same way that ASCII is an encoding where only the first 2^7 code-points can be expressed, and ISO-latin only encodes the 2^8 first code-points. UCS-2 was an attempt to encode the "most common case" as you describe it. The prob
  • by Anonymous Coward

    They lost any and all respectability when they let the emoji cancer in. To hell with them.

    • There was a reason for it. It was to allow for interop with Japanese text messaging systems.
      https://www.youtube.com/watch?v=tITwM5GDIAI [youtube.com]
  • I know bits are cheap, but...really? [unicode.org]. Font designers have to actually implement the characters - specifying hundreds of clipart characters seems kind of ridiculous. Design by committee, where no one ever says "no".

    Unicode is beginning to remind me too much of CSS3, where they let the specification blow up beyond all reason - making it essentially impossible for anyone to ever have a fully compliant implementation.

  • ...even more "dominoes" to show up on my screen because the OS/applications can't render/display Unicode properly to save their lives.

  • So yet another major version number and they still haven't bothered to add the many arrow (and other directional) symbols that have been missing...

Top Ten Things Overheard At The ANSI C Draft Committee Meetings: (10) Sorry, but that's too useful.

Working...