Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Bug Graphics The Internet

New Unicode Bug Discovered For Common Japanese Character "No" 196

AmiMoJo writes: Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language (forming parts of many words, or meaning something similar to the English word "of" on its own). The Unicode standard has apparently marked the character as sometimes being used in mathematical formulae, causing it to be rendering in a different font to the surrounding text in certain applications. Similar but more widespread issues have plagued Unicode for decades due to the decision to unify dissimilar characters in Chinese, Japanese and Korean.
This discussion has been archived. No new comments can be posted.

New Unicode Bug Discovered For Common Japanese Character "No"

Comments Filter:
  • It tried to RTFA, but it was in Japanese! I thought Japanese didn't have a word for "no":

    Japanese also lacks words for yes and no. [wikipedia.org] The words "hai" and "iie" are mistaken by English speakers for equivalents to yes and no, but they actually signify agreement or disagreement with the proposition put by the question: "That's right." or "That's not right.

  • What bug? (Score:5, Informative)

    by Ark42 ( 522144 ) <slashdot AT morpheussoftware DOT net> on Saturday July 18, 2015 @06:53AM (#50134621) Homepage

    The character in question is Hiragana "No", codepoint U+306E. As far as I can tell, this has existed since Unicode 1.1 and there are no differences in the Unicode metadata when compared to any other Hiragana glyph. It is marked as IsAlphabetic=True, Category=Other Letter, and NumbericType=None for example. So are all the other common Hiragana glyphs. If there is a bug, it's clearly with some specific application, and not Unicode or Unicode metadata. Compare http://www.fileformat.info/inf... [fileformat.info] with any other Hiragana glyph, like http://www.fileformat.info/inf... [fileformat.info] (Hiragana "Ha").

    • Re: (Score:3, Interesting)

      by AmiMoJo ( 196126 )

      The bug is that the Japanese, Chinese, Korean and mathematical versions of this character all share a common code point. There is no reliable way for an application to select the right character and render it properly.

      You can't mix C/J/K and mathematics in Unicode, which is a new bug beyond just the failure to support mixing C/J/K.

      • “-” looks very differently in text and formulas, too. I don't get why people assume that you can get nice rendering without additional markup.

        • by Ark42 ( 522144 )

          Except while that is called "Hyphen-Minus" and can be used for two things, Unicode does try to solve that problem by having:
          00AD Soft Hyphen
          2010 Hypen
          2011 Non-Breaking Hyphen
          2012 Figure Dash
          2013 En Dash
          2014 Em Dash
          2015 Horizontal Bar
          2212 Minus Sign
          2796 Heavy Minus Sign

          There is no "Mathematical Hiragana No" glyph defined by Unicode, and as such, it should never be rendered in a different font just because somebody *might* use it in a formula. The application is wrong, and there is no bug in Unicode.

      • by Ark42 ( 522144 )

        I'm aware of the problems with the han unification and certain Kanji being displayed "wrong" because the Chinese equivalent is drawn significantly different from the Japanese Kanji, but this doesn't seem to be anything close to that kind of problem. I'm also aware of the Unicode block U+1D400 "Mathematical Alphanumeric Symbols" which is what should be used for formulas. Any application that is rendering one particular character in the Hiragana block in a different font than the rest of the Hiragana block, i

      • by amake ( 673443 )

        There are no Chinese or Korean versions of this Japan-specific character. This is the first time I've ever heard of a "mathematical use" of this character, and I suspect the vast majority of users would be surprised at this as well.

      • The bug is that the Japanese, Chinese, Korean and mathematical versions of this character all share a common code point. There is no reliable way for an application to select the right character and render it properly.

        What you probably mean is that an application can't select the right glyph based on the Unicode string. That is correct, but nothing specific to CJK. Without markup or metadata, Unicode often won't render as expected by readers even in Western languages. Unicode used to have its own system fo

    • by Megane ( 129182 )
      My guess is that it can be used in certain numerical contexts, sort of like "No." ("number") in English. It can mean a quantity as in "n no x" (ippiki no neko), and maybe some other contexts. So something, probably an application, was coded to think of it as used in numerical contexts. The specific instance is about LaTeX, which is one of those ancient apps like emacs that is so old it had to create everything from scratch, so it's possibly specific to LaTeX or some port thereof.
  • Nitpick (Score:5, Informative)

    by msobkow ( 48369 ) on Saturday July 18, 2015 @07:21AM (#50134695) Homepage Journal

    This is not a "Unicode bug". It is a rendering bug exhibited by some applications.

    • by AmiMoJo ( 196126 )

      How is an application supposed to know if a random character is Japanese, Chinese, Korean it mathematical? It would need some kind of strong AI to interpret and understand the text. It's a Unicode bug, merged characters are impossible to render correctly all the time because apps are forced to guess which font to use.

      • by msobkow ( 48369 )

        Ask the people who wrote the software that doesn't exhibit the bug. Obviously it can be done.

        • by AmiMoJo ( 196126 )

          Software that doesn't have this bug only avoids it by not supporting mathematical symbols. So far there is no known software that avoids the CJK confusion problem either.

          Most software doesn't even try. How many programmers are even aware of the issue? No Unicode library is immune. It's a problem with the standard that can only be fixed by starting fresh with about 150,000 new CJK characters, and then updating all fonts and libraries to handle translation and equivalence.

        • In other news a new bug is shown to exhibit a behaviour where some mathematical programs substitute a Japanese character into the formula.

          The problem is it can't be done. Not without intelligent user / designer input (such as signifying that the unicode to be displayed is Japanese and not a maths formula). If an application is correct in determining one context it will be incorrect in determining the other.

      • Re:Nitpick (Score:4, Informative)

        by Kjella ( 173770 ) on Saturday July 18, 2015 @09:18AM (#50135097) Homepage

        How is an application supposed to know if a random character is Japanese, Chinese, Korean it mathematical? It would need some kind of strong AI to interpret and understand the text. It's a Unicode bug, merged characters are impossible to render correctly all the time because apps are forced to guess which font to use.

        Except font encoding has never been part of the character encoding, you might want your English text in Arial, your French in Times New Roman and the formula in Courier, but Unicode doesn't encode that. You might argue that this is not a bug, that it's simply out of scope and should be solved by a higher level encoding like <font="some japanese font">konnichiwa</font><font="some chinese font">ni hao</font> and not plaintext Unicode. That's what the Unicode consortium says [unicode.org] and if you express it as simply a style issue, it actually sounds plausible.

        On the other hand, you might argue that there's no reasonable way to map a "unihan" character to a glyph except as a band-aid since the CJK styles are distinctly different and so any comprehensive font should have three variations, it shouldn't take three fonts to make a mixed CJK document look correct just one. That this information belongs on the lowest level and should be passed along as you copy-paste CJK snippets or pass them around in whatever interface or protocol you have, otherwise everything will need a document structure and not just a string.

        I don't think they should "unmerge" and duplicate all the han characters, that'd be silly. What they should do is add CJK indicators - say HANC, HANJ, HANK like for bi-directional text, only simpler with no nesting just one indicator applying until superseded by another. Like (HANJ) konnichiwa (HANC) ni hao and the former will render as a Japanese han, the latter as a Chinese. If it doesn't have any indicator, well take a guess. Am I missing something blindingly obvious or would this trivially solve the problem?

        • by AmiMoJo ( 196126 )

          I agree, font encoding should not be part of the character encoding. Unicode even screws that up though, because there are things like text direction marks in it. Anyway, the problem is that often you have text without metadata. A file name, audio file metadata, a plain text database entry etc. You have to pick a font to render it, and the choice depends on the language because thanks to Unicode it's impossible to have a universal all-language font.

          You could have meta characters as you suggest, but that isn

          • by Kjella ( 173770 )

            You could have meta characters as you suggest, but that isn't what Unicode is supposed to be for. It's a character encoding scheme, not a metadata encoding scheme.

            Actually I was thinking of it more like a "sticky" composite character, like you can have a + circle = å you'd have unihan + HAN(C|J|K) = "right" glyph while:

            a) Extending existing single-language CJK documents with just one character
            b) Preserving backwards compatibility with all current CJK systems
            c) Avoiding any complex CJK conversion functions
            d) Creating a simple way to override with "show as C/J/K"

            It would require adding a bit of intelligence to copy-paste for preservation, like:

            (HANC)abcde -> c

            • by AmiMoJo ( 196126 )

              That wouldn't really improve things IMHO, because you would still be reliant on the application knowing how to handle the character. In practice what would you do, add it to the start of file names? Then on all current software your filename would start with a little box representing an unknown character. The whole concept of composite characters is ridiculous as well, they should all get their own code points and let the font system handle saving some memory by re-using parts of glyphs. Otherwise your simp

              • by Kjella ( 173770 )

                That wouldn't really improve things IMHO, because you would still be reliant on the application knowing how to handle the character. In practice what would you do, add it to the start of file names? Then on all current software your filename would start with a little box representing an unknown character.

                Yes, until the software got updated to treat it as a non-printing character but it wouldn't make everything unreadable, there's bad and there's much much worse.

                The whole concept of composite characters is ridiculous as well, they should all get their own code points and let the font system handle saving some memory by re-using parts of glyphs. Otherwise your simple character count suddenly requires a massive look-up table of composite characters.

                It already does for a huge number of reasons. Oh and if you thought giving every character a code point would mean a 1:1 mapping to glyphs that's still wrong, many characters map to alternate glyphs depending on the context. For example Arabic and Latin cursive characters substitute different glyphs to connect glyphs together depending on whether the

      • by Megol ( 3135005 )

        The problem is outside the problem domain Unicode attempts to solve so it isn't strange it doesn't solve it. For some other problems Unicode try to solve the result is a mess (example: bidirectional text) so that is probably a good thing.

      • by GuB-42 ( 2483988 )

        First of all, the hiragana "no" is always Japanese, not Chinese, not Korean. The CJK unification is only about han characters (in Japanese, that's kanji).
        As for maths, there are usually markers to indicate we are in an equation, which makes sense because Unicode is not powerful enough for this : fractions, integrals, matrices, etc... cannot be rendered with just code points. So in this case Unicode provide the characters (roman and geek letters, numbers, mathematical symbols, the hiragana "no", etc...) and

  • The character in the Unicode table looks like a mashup of the hiragana (grammar-forming) version of the character, and the katakana (used as we do italics) form.

  • by ciaran2014 ( 3815793 ) on Saturday July 18, 2015 @09:25AM (#50135123) Homepage

    A lot of people complain about the idea of unification without understanding it. I can't judge if unicode's unification is great or awful. The English-speaking media constantly says it's awful, but it's usually clear the authors don't know what unification is, who's driving it, or how unicode's work compares to what existed beforehand, so they can only be ignored. (They're sometimes trying to spin up some clickbait about ignorant westerners imposing blah blah blah on Asia, which just shows they no nothing about the topic.)

    The issue:

    There's a certain number of symbols which have been copied from one East Asian language to another. They're the same symbol, so unicode has one slot for that symbol. Then there's a second category where the symbol has been copied, but one group draws it a little different (the Japanese might like to put a little flick at the end of one line, or the Chinese draw the line a little slantier). And a third category where one group has developed a simplified symbol, which means again the traditional and the simplified symbols are the same thing but drawn differently. The two symbols are equivalent, the new one is just a new suggestion for how to draw it.

    Unification is about having one slot for the symbols in categories two and three and leaving it to the font to decide how to display it.

    (Unicode uses more precise terms, but I'm calling them "symbols" and "slots" for simplicity.)

    A disadvantage to this approach is that there can't be a font which would display a symbol both the way a Japanese would draw it and the way a Chinese would draw it. Fonts have to choose one style to draw each unified symbol.

    An advantage of this approach is that new languages and dialects can be added supported without needing another 100,000 slots per language or dialect (we do all know there are more than three East Asian languages, don't we?), and it's much easier for fonts to add support for all the East Asian languages because once they've done Chinese, Japanese is automatically almost finished.

    Here are some example symbols:

    https://en.wikipedia.org/wiki/... [wikipedia.org]

    unicode.org's FAQ also has clarifications:

    If the character shapes are different in different parts of East Asia, why were the characters unified?
    http://www.unicode.org/faq/han... [unicode.org]

    Isn't it true that some Japanese can't write their own names in Unicode?
    http://www.unicode.org/faq/han... [unicode.org]

    (All that said, it's been years since I looked into this so there's a chance I've gotten some detail wrong, but I'm confident it's a good summary of the issue.)

    • by AmiMoJo ( 196126 )

      An advantage of this approach is that new languages and dialects can be added supported without needing another 100,000 slots per language or dialect (we do all know there are more than three East Asian languages, don't we?), and it's much easier for fonts to add support for all the East Asian languages because once they've done Chinese, Japanese is automatically almost finished.

      The first one isn't really an advantage, since there is no shortage of code points. There are massive disadvantages though.

      From a software point of view it would be good to have universal fonts that can render any Unicode character correctly for anyone in the world. The Unicode consortium has tried to support this by splitting some of the more distinct symbols into separate code points for each language, but it's far from complete and every new version adds many more. The FAQ is a joke - when people point o

      • Thanks for this reply!

        Can you give me an example of a Japanese name that can't be written in unicode? I keep hearing English speakers mention this problem but I've never seen exactly what the problem is.

      • > it would be good to have universal fonts that can render
        > any Unicode character correctly for anyone in the world

        But a line has to be drawn between substance and style. There are two (main) ways to draw the number 4. One has a slanty line and is closed at the top, the other is made of straight lines and is open at the top. Or the number 7. For English speakers it's two lines, but for French speakers there's also a horizontal bar across the middle. Should unicode have two 4's and two 7's, or should

  • Some slashdot editors have failed to notice that incomplete sentences, which are less and less common in the first sentence of slashdot summaries.

    Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language

  • by BlueMonk ( 101716 ) <BlueMonkMN@gmail.com> on Saturday July 18, 2015 @09:50AM (#50135233) Homepage
    I have been reading the comments for 20 minutes because I don't understand Japanese, but I still don't understand the problem. There's a Japanese character called no, it looks very much like a lowercase English/Latin "e" rotated clockwise about 80 degrees and then flipped over the vertical axis. Is this being mixed up with something else or rendered wrongly? Can anybody provide examples of what it's getting mixed up with or how or where it's being rendered improperly?
    • Here's a picture [twitter.com]. Notice that the character at the end is rendered in a different font than the rest of the characters. It's not a critical bug, the text is still legible, just an annoying cosmetic bug.
    • I can give an example, if you don't mind me running to greek. Imagine some program renders mathmatical symbols differently from text. Imagine that someone writes out, using unicode, the formula for the area of a circle. No problem, right? The pi is clearly a math symbol. But imagine the same thing if you were reading greek. And beyond that, imagine if all the greek you read though pi was being used in a mathematical sense.

      • What I still don't understand is, if there's only one code point for this character, where are the multiple renderings coming from? Multiple fonts? Is the source of the problem that Japanese fonts are providing a bad glyph/rendering for this character that doesn't match the style of the rest of the font, or is it that they are unable to provide both glyphs because there's only one code point? Would there still be a problem if they just changed their glyph to the other style; could this just be considered a
    • by AmiMoJo ( 196126 )

      It's rendered in a way that a Japanese person could read it, but looks ugly because software can't tell if it is Japanese, Chinese or mathematical. It's rather jarring in the middle of sentence and makes the output unsuitable for publishing without manual editing.

      This is due to Unicode assigning the same code to the Japanese, Chinese and mathematical versions. It would be like they tried to merge the Latin "o" and Cyrillic "o". Imagine if every "o" character you wrote was rendered in a different font to all

  • Some users have noticed that the Japanese character "no", which is extremely common in the Japanese language (forming parts of many words, or meaning something similar to the English word "of" on its own).

    That isn't even a sentence in English. It is extremely grating to read crap like this, and it does not convey much about the story. .

Real programmers don't comment their code. It was hard to write, it should be hard to understand.

Working...