Unicode Consortium Releases Unicode 8.0.0

Follow Slashdot stories on Twitter

Unicode Consortium Releases Unicode 8.0.0 164

Posted by timothy on Saturday June 20, 2015 @01:39AM from the hobo-symbols-and-the-black-tongue-of-mordor dept.

An anonymous reader writes: The newest version of the Unicode standard adds 7,716 new characters to the existing 21,499 – that's more than 35% growth! Most of them are Chinese, Japan and Korean ideographs, but among those changes Unicode adds support for new languages like Ik, used in Uganda.

This discussion has been archived. No new comments can be posted.

Unicode Consortium Releases Unicode 8.0.0

Load All Comments

Search 164 Comments Log In/Create an Account

Comments Filter:

Ithought (Score:5, Funny)

by rossdee ( 243626 ) writes: on Saturday June 20, 2015 @01:41AM (#49950565)

That slashdot didn't support unicode

Share
twitter facebook
- Re:I thought (Score:1, Offtopic)
  
  by antiperimetaparalogo ( 4091871 ) writes:
  
  That slashdot didn't support unicode
  You thought right, Slashdot does not support unicode, this story is just news for nerds that is reported by accident, as stuff that matters for G[r]eeks only!
  note: i now continue my comment with a very interesting paragraph, but it is in Greek, so you can not read it, not even if you want to translate it:
- Re: (Score:2, Informative)
  
  by Anonymous Coward writes:
  
  > That slashdot didn't support unicode
  However Soylent News has had full unicode support since last year.
  Here is a recent thread with lots of greek. [soylentnews.org]
  - - Re: (Score:2)
      
      by unixisc ( 2429386 ) writes:
      
      Also, /. even has an IPv6 address, from what I read
    - - Re: (Score:2)
        
        by Hognoxious ( 631665 ) writes:
        
        Good idea. As a bonus it might keep the fucking windbag quiet for a while.
- Re:Ithought (Score:5, Funny)
  
  by hcs_$reboot ( 1536101 ) writes: on Saturday June 20, 2015 @02:52AM (#49950689)
  
  Slashdot supports Unicode / UTF8 from 0x20 to 0x7F.
  
  Parent Share
  twitter facebook
  - - Re:Ithought (Score:4, Informative)
      
      by sound+vision ( 884283 ) writes: on Saturday June 20, 2015 @04:17AM (#49950849) Journal
      
      Your post... has a Facebook icon next to it.
      I knew Slashdot was going in a different direction, but... Facebook? The alt-text says "From Facebook". I'm not even completely sure what that means, but I don't want anything "from Facebook" in here. I hope you know your post has fucked up my world. I'm going to have a hard time sleeping now. Bro... your post has a Facebook icon on it! However you managed to get that to appear, don't do it ever again! And tell all your friends not to. Together, we can make Slashdot sane again...
      
      Parent Share
      twitter facebook
      - Re: (Score:3)
        
        by Smallpond ( 221300 ) writes:
        
        You're just mad because your post didn't get an AOL icon.
      - Re: (Score:2)
        
        by phantomfive ( 622387 ) writes:
        
        It means he logged in using a facebook account instead of a slashdot account.
        
        Re: (Score:2)
        
        by daveime ( 1253762 ) writes:
        
        when the G+ authentication user posts.
        
        FTFY
- Re:Ithought (Score:5, Interesting)
  
  by tlhIngan ( 30335 ) writes: <slashdot.worf@net> on Saturday June 20, 2015 @04:46AM (#49950917)
  
  That slashdot didn't support unicode
  It does. It's actually fully Unicode-compliant. It's just on the input and recently (as of a couple of years ago) the output side passes through a Unicode whitelist.
  You see a Unicode codepoint is not necessarily a character.. It can be a character modifier. So you can be handling a string containing multiple codepoints, and yet on screen it only resolves to one character. Some of these include right-to-left overrides (which alter the flow of text on the screen so you can write a string and the display agent will reverse it). There are other modifiers that include flourishes, and Unicode 8 adds "skin type modifiers" as well for emoji. As in, if you display a face, the font should use a "non-human shading" (Apple chose a Simpsons-like yellow, Microsoft chose a pale zombie-ish hue). But with the addition of a skintone/diversity modifier, when combined with the emoji codepoint, can give you a variety of skin tones.
  And it's also what screwed up iOS - the string you send is full of modifiers which makes it extremely hard to decide where to break the line. (Arabic is one where there are lots of modifiers because a character can appear differently based on the characters that appear before and after it).
  And what does this have to do with /.? Easy - a lot of commenters abused the modifiers to screw with the website. And unless you know how to handle Unicode, it's really hard to properly reset the parser state. /. used to be able to display the screwed up the comments - if you Google for the oddball string n"5:erocS" it would show it (because Google ignores modifiers). If you wonder, that's the string "Score:5" as commenters use to fake-moderate their posts. But since /. strips unicode on display now, you get to see the messed up post as it was typed out
  
  Parent Share
  twitter facebook
  - Re:Ithought (Score:5, Insightful)
    
    by KiloByte ( 825081 ) writes: on Saturday June 20, 2015 @06:16AM (#49951049)
    
    That slashdot didn't support unicode
    It does. It's actually fully Unicode-compliant
    No, Slashdot's database works in ISO-8859-1. You're confusing Slashcode which can do Unicode with Slashdot which still hasn't deployed it.
    
    Parent Share
    twitter facebook
  - The (5:erocS) problem (Score:2)
    
    by tepples ( 727027 ) writes:
    
    Thank you for ninjaing me. I often chime in about this issue when someone complains about Slashdot's lack of support for Unicode. Most of the time, after I explain the code point whitelist and the reason for it, someone complains that a blacklist of dangerous code points would work better. My usual reply is that new versions of Unicode may insert new control code points that get activated before the Slashdot admins have the chance to add them to the blacklist. And besides, many characters outside the curren
CJK is Unicode's big failing (Score:5, Interesting)

by Anonymous Coward writes: on Saturday June 20, 2015 @02:04AM (#49950615)

CJK in Unicode really kills me. I once had to write an appointment that generated PDF documents with both Japanese and Chinese text. When you do this with, say, English and Russian, you just need to pick a font set that covers both alphabets and basta. Not Chinese/Japanese. There are a number of glyphs that share a common historic root in these languages, and the Unicode folks decided to consecrate this historical relationship by recycling the character codes between the languages. Yet, the glyphs are substantially different when rendered. So you don't know what the glyph really represents until you know what font set is being applied to the string.
What I ended up doing was processing each character individually and using a "look around" algorithm that would try to find clues in the context as to what language the glyph was in and render it with the right font. It never worked very well, but it worked well enough that the client decided not to redactor the controller that was generating the mixed language strings.
But I learned two valuable lessons that day: Unicode isn't that great after all and stay away from CJK contracts.

Share
twitter facebook
- Re: (Score:2)
  
  by fisted ( 2295862 ) writes:
  
  the Unicode folks decided to consecrate this historical relationship by recycling the character codes between the languages.
  For example? I thought exactly this was /not/ being done by unicode.
  - Re: (Score:3)
    
    by mwvdlee ( 775178 ) writes:
    
    There are certainly plenty of "repeat" characters in different contexts.
    For example the math alphanumerics: http://unicode.org/charts/PDF/... [unicode.org]
- Re: (Score:1)
  
  by SirSlud ( 67381 ) writes:
  
  Unicode trips up shitty programmers. It sounds a little more like it was one of those "out of your league" problems. Probably best to say away from CJK contracts tho. I've shipped shitloads of software with CJK localizations, and frankly, from that word soup you've produced, I don't think you have any idea what you're talking about.
  - Re:CJK is Unicode's big failing (Score:4, Informative)
    
    by gustygolf ( 3979423 ) writes: on Saturday June 20, 2015 @04:45AM (#49950915) Homepage
    
    In short:
    To render text properly in Japanese, you need a Japanese font. To render text properly in Chinese, you need a Chinese font. It's not just because of character coverage, but because of a thing called Han unification [wikipedia.org] the consortium did.
    The Unicode consortium decided to map similar characters to the same code-point. Personally, I'm not particularly bothered by this. but it leads to the technical problem that each text must be supplied with a language tag to select a correct font.
    And this is problematic when there are two CJK languages mixed in the same document -- in the GP's case, Chinese and Japanese --, or when a program must automatically decide which font to render things in.
    Take a web browser for example. It reaches a random Chinese web page, encoded in UTF-8. The page's author never bothered adding a language tag. Now the web browser must guess whether to render the page in a Chinese font or a Japanese one. And a "guess" is really all that it can do.
    (Typically, software used base the guesses on the user's locale. It's pretty accurate -- Chinese users tend to view Chinese documents, Japanese Japanese ones. But the problems start when someone tries viewing a 'foreign' document...)
    It's really quite ironic that the consortium decided on codepoint unification for the three languages that would most benefit from Unicode.
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by NostalgiaForInfinity ( 4001831 ) writes:
      
      Personally, I'm not particularly bothered by this. but it leads to the technical problem that each text must be supplied with a language tag to select a correct font.
      That's roughly like saying that you need to render the words "automaton", "Tsirpas", and "Varoufakis" in Greek characters, and "Putin" and "Gorbachev" using Cyrillic characters, in Latin text: it serves little purpose and it would make the text unreadable for many readers.
      Most of the time, you should to show Japanese glyph variants to Japanese
      - Re: (Score:2)
        
        by chad_r ( 79875 ) writes:
        
        That's roughly like saying that you need to render the words "automaton", "Tsirpas", and "Varoufakis" in Greek characters, and "Putin" and "Gorbachev" using Cyrillic characters, in Latin text: it serves little purpose and it would make the text unreadable for many readers.
        Yet, that was the spec the GP was trying to write to: a single PDF needing to render both Chinese and Japanese, each in their own font, yet with no language tagging on any of the text. I give him credit for trying to meet the requirements, but they were crap requirements.
        
        Re: (Score:2)
        
        by NostalgiaForInfinity ( 4001831 ) writes:
        
        Yes, I agree that that's what the GP was writing for. But it isn't Unicode's fault that the original text lacked the language tags or font information for his needs; Unicode didn't prevent the original authors from putting that in, either using Unicode's own language tag characters or (preferred) XML/HTML and/or metadata. However, most people really don't want the behavior he wants, which is why people don't usually do this.
      - Re: (Score:2)
        
        by NostalgiaForInfinity ( 4001831 ) writes:
        
        Latin, greek and cyrillic scripts have their own code points for (historically) common characters.
        My point was about usage, not about the history of the alphabets: readers simply aren't served well by seeing characters whose meaning (phonetic or ideographic) they understand but that they don't recognize because they are rendered in a font that uses conventions they don't know.
        This makes it easy to embed snippets of greek script in an english text, for example, since a greek font will automatically be used f
        
        Re: (Score:2)
        
        by NostalgiaForInfinity ( 4001831 ) writes:
        
        But in reality it's actually causing problems since the same symbol is expected to look in one way for Chinese and slightly different for Japanese.
        Well, a large number of people (including myself) believe it's the right thing to do. People like you lost that argument, that's why Unicode is the way it is. I'm simply explaining it, and I'm telling you that the justification isn't Western imperialism or American ignorance or whatever other cultural b.s. people like to attach to it.
        By the way, there are plenty
- Re: (Score:2)
  
  by NostalgiaForInfinity ( 4001831 ) writes:
  
  There are a number of glyphs that share a common historic root in these languages, and the Unicode folks decided to consecrate this historical relationship by recycling the character codes between the languages.
  This is not substantially different for what happened with Latin and Greek characters, both pre-Unicode and in Unicode.
  Yet, the glyphs are substantially different when rendered.
  Most of those glyph variants are similar enough that people have no trouble figuring them out. Those are the ideographs that
  - Re: (Score:2)
    
    by baka_toroi ( 1194359 ) writes:
    
    I really can't explain this properly because I can't show you the symbols, but I'm sick of seeing the Chinese variant of the first kanji in "chokusetsu" whenever I type an email in Japanese. Of all the huge corpus of Chinese ideograms, the ones with different stroke order should've been separately encoded. It's really weird behavior and to me it's a bad enough oversight.
    These problems are intrinsic to the languages; they are not problems with Unicode. The real solution is political and cultural: if using strings across languages is a frequent use case, that use case can only be addressed by harmonizing the writing systems themselves and adapting real-world usage; it's not something that the encoding can solve.
    I don't follow what you are trying to say. Are you saying the Japs and the Chinks should unify their writing systems? Because that's as dis
    - Re: (Score:2)
      
      by NostalgiaForInfinity ( 4001831 ) writes:
      
      I really can't explain this properly because I can't show you the symbols,
      I know the symbols; I can read Japanese.
      I'm sick of seeing the Chinese variant of the first kanji in "chokusetsu" whenever I type an email in Japanese
      Why would you be seeing the Chinese variant if you're writing an E-mail in Japanese, presumably using a Japanese font? Note that even Google Translate manages to show you the correct local variants:
      https://translate.google.com/#... [google.com]
      Are you saying the Japs and the Chinks should unify thei
Unicode is badly designed (Score:2, Interesting)

by Anonymous Coward writes:

Is Unicode supposed to separate characters that look the same but are semantically different?
Looks like the answer is yes...
'LATIN CAPITAL LETTER A' (U+0041)
'GREEK CAPITAL LETTER ALPHA' (U+0391)
Looks like the answer is no...
'RIGHT SINGLE QUOTATION MARK' (U+2019) -- this is the preferred character to use for apostrophe.
(An apostrophe and closing a quotation are two very different things.)
- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  For you, they *look the same*, for a philologists and typographers (the ones who must learn and create them) they ain't.
  They not only differ in shape (though to your eyes they *look the same*) but also in kerning and spacing (both vertical and horizontal separation from other characters).
  - - Re: (Score:2)
      
      by arth1 ( 260657 ) writes:
      
      That depends on the font. Not on Unicode.
      You can have fonts where a Greek capital alpha looks very different from a Latin capital A, but that doesn't mean anything. There are fonts where zero and capital O look identical too, but that doesn't mean they are the same character, just because they appear identical looking in one particular font.
- - Re: (Score:2)
    
    by arth1 ( 260657 ) writes:
    
    or between a German Umlaut and an English dieresis
    Not to forget languages like Swedish, where Ã is a letter in its own right, and neither an umlaut nor a dieresis.
    That means that technically, when written in a language that uses diereses, the dots can be stacked. A Swedish word like "nÃÃ" (nah-ah) rendered in a language with diereses would be written with two extra dots on the second letter to show that the second Ã should also be pronounced.
    That's where Unicode fails - instead of having the diereses as a separate marker only, it has a
Runes (Score:1)

by Whiteox ( 919863 ) writes:

Unicode now has a set for pre-Latin Hungarian runes!
Hanging out for the keyboard....
Good for Uganda (Score:1)

by hcs_$reboot ( 1536101 ) writes:

Unicode adds support for new languages like Ik, used in Uganda
Now Uganda needs computers to see what Unicode looks like.
- - Re: (Score:2)
    
    by hcs_$reboot ( 1536101 ) writes:
    
    And literacy...
    Not if you use "OK Google" Voice search.
Already = 65K characters (Score:4, Informative)

by divec ( 48748 ) writes: on Saturday June 20, 2015 @02:53AM (#49950691) Homepage
"...adds 7,716 new characters to the existing 21,499 – that's more than 35% growth!"
There were already 113K characters in Unicode version 7.0. Which is more than 2^16 characters, so remember:
- 1. UTF-16 is *not* two bytes per character [blogspot.com]
- 2. Therefore a "character" in Java [oracle.com], C# [microsoft.com], Javascript [mathiasbynens.be] sometimes only holds half a Unicode character
- 3. Even a whole unicode character may be only part of a grapheme cluster [unicode.org], which means that taking arbitrary substrings may not result in readable text.
Share
twitter facebook
- Re: (Score:2)
  
  by fnj ( 64210 ) writes:
  
  1. Thank you! I KNEW that 21,499 figure was wrong
  2. Why does ANYBODY still use the mind-numbingly stupid UTF-16?
  - Re: (Score:2)
    
    by petermgreen ( 876956 ) writes:
    
    2. Why does ANYBODY still use ........... UTF-16?
    Programmers use it because the programming environments they work in use it. Notably Windows, .net and Java.
    the mind-numbingly stupid
    I wouldn't call it stupid. It was a way to add support for more characters to existing 16 bit unicode systems with minimal breakage.
- Re: (Score:2)
  
  by unixisc ( 2429386 ) writes:
  "...adds 7,716 new characters to the existing 21,499 – that's more than 35% growth!"
  There were already 113K characters in Unicode version 7.0. Which is more than 2^16 characters, so remember:
  
  1. UTF-16 is *not* two bytes per character [blogspot.com]
  2. Therefore a "character" in Java [oracle.com], C# [microsoft.com], Javascript [mathiasbynens.be] sometimes only holds half a Unicode character
  3. Even a whole unicode character may be only part of a grapheme cluster [unicode.org], which means that taking arbitrary substrings may not result in readable text.
  But wasn't UTF-16 supposed to cover all the practical languages (I'm not talking about Klingon or other languages created out of movies). In which case, the 65k should have covered it. Why does Unicode need weirdass characters for playing cards or stuff of that nature? Just stick to their original roles - supporting the implementation of written & spoken languages in computers, and leave it at that.
  - Re: (Score:2)
    
    by Matthias Wiesmann ( 221411 ) writes:
    
    UTF-16 is an encoding which explains how to map bytes to code-points (what you call characters), like UTF-8. UTF-16 encodes data in chunks of 16 bits, while UTF-8 encodes the data in chunks of 8 bits. UCS-2 was an encoding where only the 2^16 first code-points could be encoded, in the same way that ASCII is an encoding where only the first 2^7 code-points can be expressed, and ISO-latin only encodes the 2^8 first code-points. UCS-2 was an attempt to encode the "most common case" as you describe it. The prob
Unicode can go fuck themselves (Score:1)

by Anonymous Coward writes:

They lost any and all respectability when they let the emoji cancer in. To hell with them.
- Re: (Score:2)
  
  by oggiejnr ( 999258 ) writes:
  
  There was a reason for it. It was to allow for interop with Japanese text messaging systems.
  https://www.youtube.com/watch?v=tITwM5GDIAI [youtube.com]
- - Re: (Score:2)
    
    by unixisc ( 2429386 ) writes:
    
    Emoji is the equivalent of ICANN's TLDs - too many of them just smothering a limited resouce. There is no way TLDs can be adequately supported within IPv4, and IPv6 is by no means as widely adapted to justify being able to support this. Emojis are different b/w the platforms - iOS emojis can't be read on Android, whose Emojis can't be read on Window Phone.... And I agree on the skin tone emojis - that's a really stupid one. And sports - why are there symbols for just some sports (soccer, baseball,...)
    - Re: (Score:2)
      
      by pjt33 ( 739471 ) writes:
      
      You may or may not be pleased to know that this latest release of Unicode adds glyphs for volleyball and cricket [unicode.org].
Getting carried away? (Score:2)

by bradley13 ( 1118935 ) writes:

I know bits are cheap, but...really? [unicode.org]. Font designers have to actually implement the characters - specifying hundreds of clipart characters seems kind of ridiculous. Design by committee, where no one ever says "no".
Unicode is beginning to remind me too much of CSS3, where they let the specification blow up beyond all reason - making it essentially impossible for anyone to ever have a fully compliant implementation.
Cool... (Score:2)

by Red_Chaos1 ( 95148 ) writes:

...even more "dominoes" to show up on my screen because the OS/applications can't render/display Unicode properly to save their lives.
Lack of arrows, STILL (Score:2)

by petteyg359 ( 1847514 ) writes:

So yet another major version number and they still haven't bothered to add the many arrow (and other directional) symbols that have been missing...
- Re: (Score:3)
  
  by Zontar The Mindless ( 9002 ) writes:
  
  Most languages can be written with English characters (ie. plain latin).
  Name a language written with Latin characters (other than English) that does not use any special characters or diacritical marks whatsoever.
  (Even English requires extensions to the Latin character set. which originally had no "U", "J", or "W". )
  - Touché (Score:2)
    
    by pjt33 ( 739471 ) writes:
    
    I would have said that even English requires diacritics to support some of its loanwords.
    grep -E "[éóèâêûäöñç]" /usr/share/dict/words | grep -v "[A-Z]" | wc 174 174 1720
    - Re:Touché (Score:2)
      
      by fnj ( 64210 ) writes:
      
      Odd; that command only works for me if I replace the second grep with egrep. I wonder why.
      - Re: (Score:2)
        
        by fnj ( 64210 ) writes:
        
        Hilarious; I had grep aliased to "grep -i --color=auto" because the idiots have deprecated GREP_OPTIONS. A lesson in unexpected interaction (-i and -v in this case).
        I fixed my alias. Thank you for leading to me finding my bad practice.
    - - Re: (Score:2)
        
        by Hognoxious ( 631665 ) writes:
        
        That's because he missed out some characters from his pattern - there's no u with an umlaut either. As to u with a circumflex, I don't think I've ever seen that.
        
        Re: (Score:2)
        
        by pjt33 ( 739471 ) writes:
        
        No, it's because those words aren't in /usr/share/dict/words. I started with a much larger list of characters, and filtered it down before posting to just those which actually matched something. The û is for croûton, croûton's, croûtons.
  - - Re: (Score:2)
      
      by SirSlud ( 67381 ) writes:
      
      Dutch has an extra character not in the Latin character set.
      - Re: (Score:3)
        
        by ciaran2014 ( 3815793 ) writes:
        
        You mean "ij"? The unified ij character isn't used by anyone. Not sure if it's even recommended by any body.
        But Dutch does have accents (één, vóór, ...). News headlines this morning:
        "Verstekeling valt boven Londen uit vliegtuig na 11u lange vlucht, één overleeft"
        "Grieken demonstreren ook vóór de euro"
        
        Re: (Score:2)
        
        by smallfries ( 601545 ) writes:
        
        Does anybody else read this as the Librarian?
  - - Re: (Score:2)
      
      by ciaran2014 ( 3815793 ) writes:
      
      Accents in Tagalog are optional and rarely used, but they are there.
      I've never seen them used on websites, but they're used in most or all dictionaries.
    - Re: (Score:2)
      
      by Zontar The Mindless ( 9002 ) writes:
      
      Tagalog and Bahasa Malaysia/Indonesia aren't "most languages". :)
      That being said, I should have thought of the latter myself.
- Re: bloatware (Score:2)
  
  by Ilgaz ( 86384 ) writes:
  
  This is exactly why it took decades and crazy hacks for people to write their own language electronically.
  Thank God virtually failed (but won) Plan 9 (UNIX2) came by with idealistic developers who respects other cultures came up with Unicode and companies like IBM/Microsoft/Adobe along with Free software supported it.
  Who knows if the software/hardware/network combination you use had a line coded by a person who is from those "computer illiterate" regions?
  - Re: (Score:2)
    
    by Hognoxious ( 631665 ) writes:
    
    This is exactly why it took decades and crazy hacks for people to write their own language electronically.
    You're confusing languages and alphabets. Ever heard of pinyin?
- Re: (Score:2)
  
  by sexconker ( 1179573 ) writes:
  
  I'd be happy with a reduced set of 64 characters.
  A-Z
  0-9 ,./?;":[]\ =-+)(*&^%$#@!~
  EOF
  Return/Newline {we don't use typewriters, let's use a single character)
  Drop ', `, _, &;t;, >, {, }, tab, and |
  Yes, there are all in use, but fuck it.
  - Re: (Score:2)
    
    by OrangeTide ( 124937 ) writes:
    
    what are you using @ ~ ^ # for other than Rogue/Nethack ?
    ITA2 without the letter/figure shift (so 6-bit instead of 5-bit) would be fine.
    A single alphabet for every computer user seems like an efficient use of resources and enables wider communication. (and an alphabet that isn't constantly expanding in size at a seemingly exponential rate). Latin alphabet as used in English seems OK, Cyrillic is better in many ways.
    - Seems like it, but doesn't (Score:1)
      
      by Anonymous Coward writes:
      
      You can look at the size of the required but insufficient supporting libraries to get an indication (but only an indication, mind) of the cost of unicode. It's quite high. It even has a capturing effect for English, since lots of devs believe "it is the standard" or "it is the future" or somesuch nonsense, enabling the thing by default and adding even more code to "nicen up" any and all output even for text where pure ASCII would have been sufficient. This actually reduces interoperability for reasons of "m
      - Re:Seems like it, but doesn't (Score:5, Insightful)
        
        by Dutch Gun ( 899105 ) writes: on Saturday June 20, 2015 @05:21AM (#49950959)
        
        Well, no, it's not, they're still working on that bit meaning that you get to keep upgrading all your programs to use newer and ever bigger libraries supporting more complex rules regularly. It's not stable.
        Nonsense. The Unicode encoding formats are stable, and have been for a very long time. New character are added all the time, but the underlying OS and it's fonts are typically upgraded to support these, and so most programs need to do absolutely nothing once their support is in place. The vast, vast majority of applications that support Unicode don't actually explicitly need to use those "official" Unicode libraries (which are monstrously complex), because all modern operating systems provide most of the support they need. For simple conversions, there are a number of excellent free and simple-to-use libraries (many languages have standard libraries available), or you can just use OS-specific versions, or a number of very easy-to-use free and open-source libraries.
        If you're concerned about size, just use UTF-8. There's no need to "switch encodings on the fly", because that's what variable-width encodings already do for you. And the vast majority of common encodings, even in Asian languages, are only 16-bits, not 24 or 32. The issue of inefficiency of text size with Asian languages is greatly exaggerated, and becoming less and less relevant anyhow with our machines with gigabytes of RAM and processors efficient enough to compress and decompress text on the fly. BTW, you can do that just fine even in Microsoft and Apple environments. It just means you need to transcode from UTF-8 to UTF-16 or back again at any API boundary that takes text, and this is fairly simple to do. I've written my own cross-platform code this way because UTF-8 is a much easier encoding to work with internally IMO.
        I don't think anyone would try to argue that Unicode is a perfect solution, but it's a damn sight better than what we used to have. Your comparison to USB is pretty good, in fact. Ask just about any PC user what they'd prefer - modern USB devices or the old system of parallel, serial, PS/2, and joystick ports. Whatever faults USB has, it's a hell of an improvement over the old system.
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by AmiMoJo ( 196126 ) writes:
        
        Variable width encodings are a bad solution. UTF8 was a reasonable hack to ease the transition to Unicode, but the standard encoding should have been 32 bits with no modifiers it multi-word encodings at all. Just give every character and every variation/modification a code and let the font rendering system worry about compounds and stuff like that.
        Then string manipulation is easy. No need to try to interpret the characters and understand every language.
        Instead we are stuck with UTF16 as the default, and eve
        
        Re: (Score:3)
        
        by Dutch Gun ( 899105 ) writes:
        
        Instead we are stuck with UTF16 as the default, and even the larger encodings use modifiers etc.
        Who's "we"? Windows and Mac use UTF-16, while Linux and the web use the vastly superior UTF-8 [utf8everywhere.org]. Internally, assuming you're in a language that supports it like C++, you can actually use any encoding you want - it just means you need to transcode strings at API boundaries. You'd have to do this for one or more of your target platforms anyhow if you're writing cross-platform code (all three major PC OSes).
        A lot of Windows programmers think "Unicode == UTF-16", which is not the case at all. In my own applic
      - Re: (Score:2)
        
        by NostalgiaForInfinity ( 4001831 ) writes:
        
        In short, unicode is about as universal as USB, including the built-in crappiness.
        And like USB, it's a lot better than what we had before. There are a few things wrong with Unicode, but nothing major. And where Unicode has problems, they are usually just problems that specific writing system users did to themselves and don't hurt anybody else (e.g., Chinese and Japanese screwed up a bit, but that really doesn't matter to the rest of the world).
        There are many more problems with unicode, including security pr
    - Re: (Score:2)
      
      by pjt33 ( 739471 ) writes:
      
      what are you using @ for other than Rogue/Nethack ?
      Back in days of yore, before Facebook messaging and Whatsapp but after bang paths, @ was an essential part of a quaint communications system called "e-mail". Some old fogies still use it. Now get off my lawn and go ask the person who (I presume) sold you that 6-digit uid for a history lesson.
    - - Re: (Score:3)
        
        by Hognoxious ( 631665 ) writes:
        
        What are you doing these days? When Doves Cry was part of my childhood.
        
        Prince Rogers Nelson (Score:2)
        
        by tepples ( 727027 ) writes:
        
        Both of this musician's names can be represented in ASCII: "Prince Rogers Nelson" and "O(+>".
      - Re: (Score:2)
        
        by OrangeTide ( 124937 ) writes:
        
        You may have to adjust your expectations as to the correct form of your name.
        I would be OK defining a few standards for transcription. This is different than just using UTF-8 because the undecoded form is still human readable.
        ps - people pronounce my name wrong half the time but it's only 4 letters long and is a common noun in most parts of the US.
- Re:I'm going back to ASCII (Score:5, Interesting)
  
  by Dutch Gun ( 899105 ) writes: on Saturday June 20, 2015 @04:33AM (#49950881)
  
  Don't let the door hit you in the ass on the way out to the pasture of obsolescence. The rest of us will continue to use Unicode, which, despite some flaws, such as their mess-up with Han Unification [wikipedia.org], does a pretty good job at solving the problem of language intercommunication. If anyone thinks they can do a better job (not counting reverting back to English-only ASCII), have at it.
  So go ahead and use ASCII, don't type in any of those dern foreign charcturs, and pretend you're back in the happy past where we had a mess of incompatible standards, and no way to easily discern which of the many possible encodings was actually used, resulting in the scrambled text we always used to see (notice you *don't* actually see that much anymore?). And most software just ignored the rest of the non-English-speaking world anyhow because of that mess.
  Personally, I'm thankful people are willing to take on largely thankless (and mind-numbing to most of us) tasks such as these.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by AmiMoJo ( 196126 ) writes:
    
    It's a shame Unicode has become the standard. It's numerous flaws can't be overlooked, but we seem to be stuck with it now. Maybe the consortium will eventually fix those things, especial CJK support, but so far there is little sign that they care.
    It's terrible that we screwed up something so important.
    - - Re: (Score:2)
        
        by AmiMoJo ( 196126 ) writes:
        
        Han unification.
        Variable with encodings.
      - Re: (Score:2)
        
        by Dutch Gun ( 899105 ) writes:
        
        CJK is supported, but the early Unicode committee unfortunately decided to "unify" the codepoints of characters with shared ancestry [wikipedia.org], even though they may be rendered differently in different language [tofugu.com].
        Technically speaking, Unicode represents characters, not glyphs, so it's a matter of whether you consider the different language-specific visual representations of those characters to be distinct characters themselves, not simply visual differences of the same character (which most native speakers would, of co
  - - Re: (Score:2)
      
      by Smallpond ( 221300 ) writes:
      
      C isn't a good example. It doesn't even do a good job of handling strings of 1-byte characters.
- - Re: (Score:3, Insightful)
    
    by OrangeTide ( 124937 ) writes:
    
    Sorry, why do we need multiple languages again?
    - Why we need multiple languages (Score:2)
      
      by prefec2 ( 875483 ) writes:
      
      Humans developed different languages in different regions. Now we have different languages with different features and different cultural ties. While it is often possible to translate the semantics of one language is an equivalent in another language, you have more trouble doing so with pragmatics. And in addition the result does not "taste" as good as the original. It is a little bit like food. You could just consume a nutritious supplement to sustain life. However, all the culture and tastes and emotions
      - Re: (Score:2)
        
        by OrangeTide ( 124937 ) writes:
        
        Cyberspace is a single region.
        I'm willing to learn a new language if a reasonable proposal was put forward. Chinese is not a reasonable language for a cyberspace culture though. Russian might be OK, as would Greek, but Swahili seems like a good option, Korean would probably work fine as well. Japanese would probably be a terrible choice, and they would likely not appreciate a lot of foreigners contributing to alterations of their language.
    - Re: (Score:2)
      
      by Smallpond ( 221300 ) writes:
      
      Sorry, why do we need multiple languages again?
      http://www.scientificamerican.... [scientificamerican.com]
    - Re: (Score:2)
      
      by Cassini2 ( 956052 ) writes:
      
      Sorry, why do we need multiple languages again?
      Have you read the latest C++ spec? That's what happens when a single language does everything.
      The same effect happens in people languages too.
    - ENGLISH uber alles (Score:2)
      
      by unixisc ( 2429386 ) writes:
      
      Not a bad idea - abolishing every other language in the world - Chinese, Spanish, Arabic, Russian, Hindi, Urdu, Bengali, Swahili, Portugese and the whole bunch of them. Just have ENGLISH - that too, the US one, and nothing else!!! Let everyone, including the Brits and Kanucks, have to adjust - some more than others.
    - Tower of Babel (Score:2)
      
      by tepples ( 727027 ) writes:
      
      Sorry, why do we need multiple languages again?
      Originally, to punish ancient Babylonians for trying to build a dangerously tall ziggurat [rationalwiki.org]. Since then, to preserve access to oral tradition.
    - - Re: (Score:3)
        
        by prefec2 ( 875483 ) writes:
        
        No we would not do well with only one language, we would loose a lot of culture. It would be like one standard food for everyone. Furthermore, your proposition is ludicrous, as language changes all the time. That's why new street languages pop up and then evolve in something different. Language is reflection of culture. It is not like a programming language. If you want to communicate with other people you should learn additional languages. And while you are at it, also try to learn something about their cu
        
        Re: (Score:2)
        
        by Hognoxious ( 631665 ) writes:
        
        we would loose a lot of culture.
        You don't have much to spare, it seems.
        
        Re: I'm going back to ASCII (Score:2)
        
        by prefec2 ( 875483 ) writes:
        
        This is an orthographic mistake which may happen even to native speakers. So please forgive me when I make some mistakes. Hopefully you still got the message.
        
        Re: (Score:2)
        
        by OrangeTide ( 124937 ) writes:
        
        There is a difference between forcing everyone to speak a certain language at home, and having one language that we use online.
        I'm also pretty skeptical of the value of "culture".
        Language does impact your brain in profound ways, there is real science behind that. Unlike most(all?) claims to the value of culture. (often it implies one culture is more valuable than another, which smells a bit like stuff old racist white guys say)
        
        Re: I'm going back to ASCII (Score:2)
        
        by prefec2 ( 875483 ) writes:
        
        Culture is the ways people live together, their music and art, the way they address problems in life etc. There is no better culture. Only because racists think of their culture (which is often only a subculture as in a partial culture in a wider culture) as superior does not mean that it is that way or that we should use it in that way. I personally think culture is dynamic changing thing and it helps to learn from other cultures as it enriches me and my fellow humans around me.
        
        Re: (Score:2)
        
        by daveime ( 1253762 ) writes:
        
        > Otherwise could would not be able to understand them. So even with only one language, you would not be able to understand them.
        
        Funny that, I can get by in 3 and am fluent in 2 more, and I still don't know what that first sentence was supposed to mean.
        
        Re: I'm going back to ASCII (Score:2)
        
        by prefec2 ( 875483 ) writes:
        
        Wonderful for you. I cannot understand it either. Most likely I should stop using my smartphone when writing comments online.
      - Re: (Score:2)
        
        by alexgieg ( 948359 ) writes:
        
        Replying to undo incorrect moderation.
        
        Re: (Score:2)
        
        by Hognoxious ( 631665 ) writes:
        
        We're shedding languages like crazy, that with lots of small languages going extinct and all.
        
        Don't tell HR. They'll be adding "fluent in Kernowek and Ainu" to every job description.
        
        Its for historical purposes (Score:2)
        
        by drnb ( 2434720 ) writes:
        
        ... for practical purposes over 9000 languages is a bit much ...
        But for historical purposes it makes sense. Shouldn't we be digitizing as much of antiquity and vanishing cultures/languages as possible. Note that its a pretty bad time for the physical preservation of antiquities in the cradle of civilization right now.
        
        It would be helpful to academics to have such languages in a textual format not merely an image format.
      - Re: (Score:2)
        
        by alexgieg ( 948359 ) writes:
        
        Which religions gave us WW1, WW2, Vietnam, the Cold War, the Korean war, and the Opium Wars again?
        Monarchism (+Communism in Russia), Fascism+Capitalism+Communism, Communism+Capitalism, Communism+Capitalism, Communism+Capitalism, Merchantilism. You forgot: all the current Middle East wars (Fascism+Capitalism), all the current African wars (Tribalism+Communism+Capitalism), and all the many single-country revolutions of the 20th century (mostly Communism, with a few Fascism and Capitalism thrown into).
        
        Re: (Score:2)
        
        by drnb ( 2434720 ) writes:
        
        Which religions gave us WW1, WW2, Vietnam, the Cold War, the Korean war, and the Opium Wars again?
        Monarchism (+Communism in Russia), Fascism+Capitalism+Communism, Communism+Capitalism, Communism+Capitalism, Communism+Capitalism, Merchantilism. You forgot: all the current Middle East wars (Fascism+Capitalism), all the current African wars (Tribalism+Communism+Capitalism), and all the many single-country revolutions of the 20th century (mostly Communism, with a few Fascism and Capitalism thrown into).
        You almost had a point until you suggested that current wars in the middle east and africa don't have a major religious component. Then you lost credibility. Then I started to rethink the earlier ones and began to note that religion existed their too. For example the Nazis in-fact were trying to create an alternative religion with Naziism, complete with its own tenants of faith, saints, mythologies, etc. Imperial Japan's military justified everything done as service to the living-god emperor. So yeah, relig
        
        Re: (Score:2)
        
        by alexgieg ( 948359 ) writes:
        
        You've confusing causation with correlation. Religion is a good way to make people do thing for you, but your actual reasons are different. From the current conflicting parties, only ISIS is really driven by religion first and foremost, so I'll concede on that one. As for the others, nope, the driving impetus is non-religious even though religion is used for gluing purposes.
        
        Re: (Score:2)
        
        by drnb ( 2434720 ) writes:
        
        You need to concede a lot of the fighting going on in Africa too.
        
        As for Imperial Japan, the emperor worship was far more than a sales technique. The leadership were true believers. The god status of their emperor, their head of state, made all other heads of state vastly inferior, all other peoples vastly inferior, the wishes of all others vastly inferior. Imperial Japan's "superiority" was firmly based in their religion, it inspired their "divine destiny" to rule vast parts of Asia.
        
        As for Naziism, it
        
        Re: (Score:2)
        
        by alexgieg ( 948359 ) writes:
        
        That's called ideology, which is a useful distinction. You can have religion as ideology, and a-religion as ideology. Conversely, you can have non-ideological religions and a-religions.
        About Japan, not really. The Japanese religion was forcefully changed by the state for the purposes of ideological indoctrination. Temples were closed, split, merged, priests reallocated and replaced, official doctrines for the specific purpose of mass submission developed, non-related philosophies (such as Bushido) reinterpr
        
        Religions & war (Score:2)
        
        by unixisc ( 2429386 ) writes:
        
        Which religions gave us WW1, WW2, Vietnam, the Cold War, the Korean war, and the Opium Wars again?
        While those may be the biggest recent wars, they are by no means the only wars in history. There was the Muslim conquests of everything from Spain to India b/w the 7th to 10th centuries, which obliterated Christianity, Zoroastrianism, Animism, Buddhism and Hinduism from a lot of the territories it conquered. There were the Conquistadoras, who overran the Aztec, Mayan & Inca empires and replaced it w/ the Spanish inquisition. There was the Thirty Years War, fought to determine whether Central Europe s
  - Re: (Score:2)
    
    by Hognoxious ( 631665 ) writes:
    
    It's garbage until you have to read or write one of those languages.
    
    Don't do it then.
- Slashdot Glyphs obscure the titles. Please fix. (Score:2)
  
  by ciaran2014 ( 3815793 ) writes:
  
  I'm seeing this problem too.
- Re: (Score:2)
  
  by Hognoxious ( 631665 ) writes:
  
  the last version has 113021 encoded characters.
  Not too bad, only about 112765 too many.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Ithought (Score:5, Funny)

Re:I thought (Score:1, Offtopic)

Re: (Score:2, Informative)

Re: (Score:2)

Re: (Score:2)

Re:Ithought (Score:5, Funny)

Re:Ithought (Score:4, Informative)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re:Ithought (Score:5, Interesting)

Re:Ithought (Score:5, Insightful)

The (5:erocS) problem (Score:2)

CJK is Unicode's big failing (Score:5, Interesting)

Re: (Score:2)

Re: (Score:3)

Re: (Score:1)

Re:CJK is Unicode's big failing (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Unicode is badly designed (Score:2, Interesting)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Runes (Score:1)

Good for Uganda (Score:1)

Re: (Score:2)

Already = 65K characters (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Unicode can go fuck themselves (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Getting carried away? (Score:2)

Cool... (Score:2)

Lack of arrows, STILL (Score:2)

Re: (Score:3)

Touché (Score:2)

Re:Touché (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: bloatware (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Seems like it, but doesn't (Score:1)

Re:Seems like it, but doesn't (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Prince Rogers Nelson (Score:2)

Re: (Score:2)

Re:I'm going back to ASCII (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Why we need multiple languages (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

ENGLISH uber alles (Score:2)