Unicode 6.1 Released 170
An anonymous reader writes "The latest version of the Unicode standard (v. 6.1.0) was officially released January 31. The latest version includes 732 new characters, including seven brand new scripts. It also adds support for distinguishing emoji-style and text-style symbols and emoticons with variation selectors, updates to the line-breaking algorithm to more accurately reflect Japanese and Hebrew texts, and updates other algorithms and technical notes to reflect new characters and newly documented text behaviors."
27cb appearing in HTML in 5.4.3.2.1... (Score:3)
Take a good look at glyph 27cb aka \diagup part of the Misc Math Symbols. People are gonna try embedding that in html now. Can't wait.
Re: (Score:2)
Thats a good once, but I'm also worried about html parsers needing to understand half a dozen variants of the "closing slash"
Re: (Score:3)
Favourite unicode character (Score:3, Interesting)
has got to be the Love Hotel [fileformat.info].
Does anyone know why this is even there?
Re: (Score:3)
As if http://www.fileformat.info/info/unicode/char/1f4be/index.htm [fileformat.info] makes sense to anyone under age 30. I demand the addition of a punchcard glyph...
Re: (Score:2)
Re: (Score:3)
The "don't bother me with those implementation details"-icon?
Re: (Score:2)
Re: (Score:2)
What exactly did you mean by this statement? What are you calling an implementation detail with which the user shouldn't be bothered?
The location where the data ist stored (RAM vs. harddrive). There are some effects that play against each other here:
As computers get better, the latter effect becomes negligible. This means that when this is done automatically in the background (w
Re: (Score:2)
...Copying from RAM to harddrive (aka "saving") takes time. As computers get better, the latter effect becomes negligible.
Continuous autosave is possible with current technology, but it requires wasting battery power on spinning a hard drive's platter at all times while the user continues to edit the document. I agree that it's an implementation issue, but the underlying technical reason for the implementation issue is still present in 2012 technology. I don't see the distinction between fast temporary storage and large nonvolatile storage "becom[ing] negligible" until large SSDs and cellular data become a lot cheaper. In add
Re: (Score:2)
Continuous autosave is possible with current technology, but it requires wasting battery power on spinning a hard drive's platter at all times while the user continues to edit the document. I agree that it's an implementation issue, but the underlying technical reason for the implementation issue is still present in 2012 technology. I don't see the distinction between fast temporary storage and large nonvolatile storage "becom[ing] negligible" until large SSDs and cellular data become a lot cheaper. In addition, one ordinarily doesn't want to create a new numbered revision of the document in a revision control system after each keypress; there has to be some way to mark one's changes as suitable for being viewed by other editors of the document, not unlike the SQL keyword COMMIT.
Yes, you shouldn't save after every single keypress, but a timer for saving every minute or so (if there are any changes) should suffice. Committing for others to see is a different thing, that's something a user can be expected to understand.
Ultimately, for revert/versions there should be a timeline slider like there was in Google Wave, where you can go back to your document's state of any point in the past.
btw, affordable SSDs are already large enough for everyday use. My notebook has a 256GB SSD in it, a
Re: (Score:2)
Committing for others to see is a different thing, that's something a user can be expected to understand.
Back to my original question: If not a floppy disk, what icon should be used for this action of committing an edited document to the part of the file system viewable by other users and applications?
btw, affordable SSDs are already large enough for everyday use.
Not when "everyday use" includes storing a large collection of purchased music and purchased movies.
I didn't have to sell my car for [a 256 GB SSD].
But you did have to pay more than one would for the stock hard drive that comes bundled with a low-end laptop. Google Product Search shows 256 GB SSD in the $300-$400 range. Until the ultrabook market matures, auto
Re: (Score:3)
The generic flowchart datastore symbol with an inbound arrow (retrieving something previously committed would use the same symbol with an outbound arrow.)
For products with less technical audiences, a stone tablet with an etching instrument, since committing results in the data being "carved in stone".
Re: (Score:2)
But you did have to pay more than one would for the stock hard drive that comes bundled with a low-end laptop.
You could remove "the stock hard drive that comes bundled with" from that sentence and it would still be true.
Re: (Score:2)
The generic flowchart datastore symbol with an inbound arrow
Thank you. I had forgotten about the flowchart symbols because nowadays none of them appear see popular use except an oval for module entry and exit, a box for a step, and a diamond for a decision.
Re: (Score:2)
Re: (Score:2)
If the user never saved it, then where is it when the user needs it later? Auto-saved, OK, but where and under what name? There still needs to be a save option, and an icon, even if outdated, is useful for that.
Saved to an internal directory, and will be opened as an untitled document the next time you open the application.
Re: (Score:2)
Why should the user be bothered with it? There aren't many real-life instances where a user creates and it isn't "autosaved".
It's one of the things that OS X Lion is doing - it's asking "why do we still do this?". Lion-aware apps automatically autosave in the background, and have a time-machine like feature that lets them view their document as it existed in the past. If they
Disclosure, drive space, and spinning up (Score:3)
If they write a brilliant paragraph a day ago, then deleted it in the morning, they can view the document as it existed yesterday, copy the paragraph back out, and be done with it.
For one thing, an application that saves (and sends) a document's undo history along with the document can disclose things that the document's author did not want to disclose. I seem to vaguely remember scandals with Word's AutoRecover being used to recover redacted parts of a document. For another, how much of the limited space on the drive should be dedicated to saving a document's undo history since creation, especially when the document is a large layered picture or multitrack audio project?
And that's because people forget to save - why not have the OS do it for them?
I agree, but
Re: (Score:2)
Yes, however I don't think that many users know what an internal hard drive looks like... So using this as an icon for saving is not a solution either. USB sticks and external drives vary too wildly in their looks to be recognized at that size.
Re: (Score:2)
One with the word "Save" on it.
"Save" with no icon in a toolbar full of icons (Score:2)
Re: (Score:2)
Yes, because none of the [working] machines here has a floppy drive and nobody under the age of twenty has ever even seen one except in a museum, you smug wanker.
U+0057 U+2693 = wanker (Score:2)
Re: (Score:2)
Re: (Score:2)
Here, a punch card glyph. Not quite what I expected but still...
http://www.fileformat.info/info/unicode/char/5361/index.htm [fileformat.info]
There is also a card index glyph do?
http://www.fileformat.info/info/unicode/char/1f4c7/index.htm [fileformat.info]
There might not be a punchcard glyph, but there is a minidisk one:
http://www.fileformat.info/info/unicode/char/1f4bd/index.htm [fileformat.info]
and an optical disk one:
http://www.fileformat.info/info/unicode/char/1f4bf/index.htm [fileformat.info]
and a DVD one:
http://www.fileformat.info/info/unicode/char/1f4c0/index.htm [fileformat.info]
I cannot
Re: (Score:3)
They have 14 planes of ~65,536 characters... even after including massive syllabaries, and the unified CJK ideographs, they still had really only used the first plane. Now they're presented with only using about 7% of the space available, and so they started chucking just about every pictograph that they could possibly come up with into it...
I'm sorry, but while I'm down for having every script that is actually used, and every script that has been decoded, I don't see why we should have all of these pictogr
Re: (Score:2)
I thought unicode was unlimited? The coding methods might each have a limit, but the standard is unlimited.
Re: (Score:2)
Re: (Score:2)
If it is a 16 bit standard, how can it be unlimited? It can support at the most 2^16, or 65,536 characters. Where does it get planes from?
UTF-16 is NOT a naive 16-bit encoding, and has a set of surrogate pairs that allow one to construct codepoints of up to 20-bits in a UTF-16 stream. Subtract out the 16-bits per plane, and you're left with 4-bits, which is 16.
I misquoted 14 in my post, the Unicode standard only defines 14 planes, and 2 private use areas.
Re: (Score:2)
If it were a 16-bit standard, it couldn't be unlimited. But it's not. In two ways. First, Unicode is simply a number->meaning table, and doesn't specify actual in memory format. There are a lot of competing standards for that. Second, UTF-16 has 1.1 M values. UTF-32 has 4B. UTF-8 has a 2B or a 1.1M limit depending on the version.
Re: (Score:2)
I thought unicode was unlimited? The coding methods might each have a limit, but the standard is unlimited.
The limit is mostly purely arbitrary as newer encodings allow for much more expanded coding sequences. However, due to the way UTF-16 encodes values above UTF+0xFFFF it is limited to expressing at most a 20-bit codepoint, meaning that the Unicode standard is basically limited practically to 16 pages of 65536 values. So, short of breaking changes to the UTF-16 standards you're basically SOL.
Re: (Score:2)
I'm sorry, but while I'm down for having every script that is actually used, and every script that has been decoded, I don't see why we should have all of these pictographs
If they are or were in use in real programs, it sucks to not have them in the standard. Unicode started out as a quite political project (e.g. Han Unification) but it has become much more pragmatic over time.
We need the emoji and the other junk in the standard so that we are able to use Unicode as a credible archiving format.
Re: (Score:2)
The first one you link is a Chinese symbol. Looks totally valid to me.
Remember, Chinese has symbols for entire words or ideas, it is not "alphabetical" like most other popular languages.
Re: (Score:2)
Yes, it is. I don't question that character. The others, on the other hand, are a bit silly though.
Re: (Score:2)
Agreed. Myself, I think it would be better to just reserve the space for future use, giving us plenty of expansion room without having to increase the word size (utf8 to utf16 to utf32) - instead of just filling the section up with nonsense.
And where's Tengwar? (Score:2)
They've got symbols for a love hotel, a horse [fileformat.info], and a steaming pile of poo [fileformat.info], along with emoticons, and they still haven't accepted the Tengwar [evertype.com] draft that's been around since '93? Where are these people's priorities!?
Re: (Score:2)
I had no idea but was intrigued to find out myself, and stumbled upon this, which presumably explains it:
http://www.developerfusion.com/news/91207/unicode-6-out-with-2000-new-characters-but-what-support-does-it-have/ [developerfusion.com]
I knew the Japanese would be involved somewhere!
Re: (Score:2)
The "love hotel" symbol is part of the Emoji set. These are a semi-standardized set of emoticons that had widespread use in Japan. It was Google that proposed their inclusion in Unicode. http://sites.google.com/site/unicodesymbols/Home/emoji-symbols [google.com]
Why Slashdot won't adopt it (Score:5, Informative)
Re:Why Slashdot won't adopt it (Score:5, Insightful)
Raise your hand if you couldn't code a parser that detects those characters and takes appropriate action, such as popping bidi characters.
I'd love to be able to write IPA when discussing pronunciation, or actually write out words in other languages, ohm character for discussing electronics, pound and yen signs for currency ... Hey, even a bigger whitelist than what we have now would be great!
Checking for the release of a new version (Score:2)
Raise your hand if you couldn't code a parser that detects those characters and takes appropriate action, such as popping bidi characters.
🙋 If I were writing such a parser, I don't know how I'd get it to automatically check for the release of a new version of the standard and determine which code points are new bidi characters to be popped.
I'd love to be able to write IPA when discussing pronunciation
It'd be nice but not necessary: X-SAMPA.
or actually write out words in other languages
I guess the rationale is that most moderators would not be able to read foreign words without transliteration into Latin characters.
pound and yen signs for currency
£ is Alt+0163 on a Windows machine, and ¥ is Alt+0165. They're probably Ctrl+Shift+U A 3 Enter and Ctrl+Shift+U A
Re:Checking for the release of a new version (Score:5, Funny)
£ is Shift+3, what are you on about?
Re: (Score:2)
Only on UK keyboard layouts.
Re: (Score:2)
I guess the rationale is that most moderators would not be able to read foreign words without transliteration into Latin characters.
So at least give us Latin-1. There are English words which use accents in high registers.
Re: (Score:2)
Re: (Score:2)
Raise your hand if you couldn't code a parser that detects those characters and takes appropriate action, such as popping bidi characters.
Um, so do it and submit a patch against Slashcode?
Re: (Score:2)
Re: (Score:2)
Just admit that it's because it's old and random, there's a few HTML entities working but there's no reason why æ = æ should would and μ = shouldn't - like in micrograms, or uTorrent. It's a geeky site, but it's made for writing English prose with some half-hearted Latin1 support, no math or science.
Re: (Score:2)
Here's the reason: æ = 0xE6 (or 0xC6 for capitol) in extended ASCII, where Mu is not present in extended ASCII. It appears slashdot dumps anything outside of that range.
Lets try an experiment:
0xAB and 0xBB:
0xA7 and 0xB6:
Re: (Score:2)
False! Only a subset is allowed, but anything outside of it most definitly seems to fail.
Re: (Score:2)
There are technical solutions to these problems, such as tracking language/BIDI overrides when embedding strings provided by users (and reversing the effect afterward). You could also do it the "easy" way and just filter out characters based on their Unicode property (e.g. disallow all 'other' characters, which would include these formatting characters).
Re: (Score:3)
For one thing, there was once a fad of posting pornographic ASCII art on Slashdot, so it appears Slashdot disallows any character that would be more useful for glyph art than for English text.
If ASCII can be used for trolling just the same than there is little point in not implementing Unicode. The point of moderation is to prevent these issues.
For another, there was once a fad of using bidirectionality override control characters for turning text backwards, which would break the layout and allow spoofing a comment's moderation score.
That's because of a buggy/unsecure implementation. It doesn't mean it can't be done right.
The next version of the standard (Score:2)
Trolls gonna troll; that's what moderation is for.
At one point, ASCII art spammers were filling pages with sexually explicit ASCII art, such as Goatse, male masturbation, and birds perched on a penis, so fast that moderators could not keep up.
So filter those character ranges.
Blacklisting doesn't work because the next version of the standard, such as Unicode 6.1, may introduce more undesirable character ranges.
Re:The next version of the standard (Score:5, Funny)
...filling pages with sexually explicit ASCII art, such as Goatse, male masturbation, and birds perched on a penis...
Yeah, the way they are going they might actually *have* these characters in the set now...
Re: (Score:2)
Blacklisting doesn't work because the next version of the standard, such as Unicode 6.1, may introduce more undesirable character ranges.
That would lead to the Slashdot "editors" having to maintain their code, and we can't have that.
Re: (Score:2)
Blacklisting doesn't work because the next version of the standard, such as Unicode 6.1, may introduce more undesirable character ranges.
It's not difficult to update a simple file/DB entry/whatever to add more characters to the blacklist. Include a little util to parse the UnicodeData file and automatically blacklist all control characters. But even if you wanted to go with a whitelist instead of a blacklist, there's no reason for the whitelist to be as small as it currently is. And then there's what I assume is a Slashcode bug where non-ASCII characters that are in the whitelist don't come through properly. I've seen numerous posts where a
Re: (Score:2)
They can do that with or without unicode, so how does blocking unicode help?
How often do new versions come out? We aren't talking about Firefox here.
Hundreds of iframes (Score:2)
Unicode has different *pages*. You can filter by page.
New versions of Unicode introduce new pages. If you're blocking a page for some reason, the next version of Unicode might introduce another page that extends the functionality of the old page, reintroducing the behavior that led you to block the old page.
What's stopping us from just creating a Greasemonkey script that translates back and forth from HTML with square brackets and allows the full HTML set
Slashdot's lameness filter would probably confuse those square brackets with ASCII art, and even if not, the comment would likely draw negative moderations from moderators who haven't installed the Greasemonkey script.
by putting every message in its own e.g. IFRAME
There was a time when hundreds of <i
Re: (Score:2)
As for "people could spam ASCII art": People could also flood Slashdot with bizarre textual porn copypasta. The key part of "posting ASCII art faster than the mods can cope" is "faster than the mods can cope", not "ASCII art".
It is fairly weir
Re: (Score:2)
New versions of Unicode introduce new pages. If you're blocking a page for some reason, the next version of Unicode might introduce another page that extends the functionality of the old page, reintroducing the behavior that led you to block the old page.
So use a whitelist instead of a blacklist for pages.
Re: (Score:2)
Until April 2014, when IE 6 passes out of extended support, one can't assume that all supported browsers support CSS max-width.
Who the fuck cares whether Slashdot renders on IE6?!
Although to be fair, it does seem like that is the only browser that Slashdot does care about. All the others probably spend more time supporting Slashdot than Slashdot spends supporting them.
Re: (Score:2)
Looks like extended-ASCII, not necessarily UTF/UCS. For example, 0xE9: é
emoticons? (Score:4, Insightful)
Seriously, emoticons? Who ever thought it a good idea to include those in a standard? Should we have an encoding for hearts as dots over lower case i as well? And little horseys, too? And y with a big tail that wraps around to the front of the word?
Re:emoticons? (Score:4, Informative)
And little horseys, too?
U+1F40E ... no, seriously...
Re: (Score:2)
The U+1f4af character is a bit harder to explain than little horses, because it relies on a 4-octet code character to express something which can be easily expressed by using 3 1-octed characters.
Smile emoticon at CP437 code 0x01 (Score:2)
Seriously, emoticons? Who ever thought it a good idea to include those in a standard?
Unicode had to be able to round-trip (losslessly encode and decode) all old popular encodings. This includes encoding now called "code page 437", introduced with the first IBM PC, which includes a smile emoticon at code value 0x01. It also includes the encodings associated with the widely distributed system fonts Zapf Dingbats and Wingdings.
Re: (Score:2)
Re:emoticons? (Score:4, Funny)
The next thing will be teenagers building bigger emoticons out of emoticon characters. Then they will have to be included in the standard as well, and so on...
Tetris, Chess, Baseball, and gang symbols (Score:5, Informative)
all the Tetris pieces
The polyominoes up to five squares can be composed from U+2580 (upper half block), U+2584 (lower half block), and 2588 (full block) characters. Unicode tends not to introduce precomposed ligatures except when needed for round-tripping with pre-Unicode encodings.
glyphs of game pieces of all well known games
A lot of well-known pre-1923 tabletop games' game pieces already exist in Unicode. Chess is U+2654 through U+265F, and Checkers is U+26C0 through U+26C3. A lot of game pieces are simple enough in form that the Geometric Shapes (U+25A0 through U+25FF) represent them just fine. For example, Othello is U+25CB and U+25CF, as is Connect Four. Even the enemy in Fast Eddie for Atari 2600 is in Miscellaneous Technical (U+237E) as is home plate in Baseball (U+2302).
heck, instead of just the suit symbols why not 52 glyphs for a standard deck of cards
Those can already be composed from a Basic Latin letter or number and a suit symbol. Unicode tends not to introduce precomposed ligatures except when needed for round-tripping with pre-Unicode encodings.
throw the Major Arcana tarot cards in there too
I don't know about Tarot, but all twelve signs of the zodiac are in Miscellaneous Symbols, even the "69" looking sign of Cancer (U+264B).
gang symbols
The symbol of "Folk Nation" gangs is similar to that of Judaism: a Star of David (U+2721). The symbol of "People Nation" gangs is similar to that of Islam: a 5-point star and crescent (U+262A).
Re: (Score:2)
It needed to be flexible, so it's a VM now. (Score:2, Offtopic)
"It needed to be flexible, so it's a VM now."
I fear this is the next step. The right to left and line wrapping BS is complicated enough that I'd welcome a specialized VM with loadable bytecode & glyph data. Yes, from a security standpoint this could create a wider attack surface. However, I'd argue it would be less attack surface considering that the VM for my unlimited precision scientific & programming calculator is smaller than my UTF-8 text display implementation.
I'd also argue that it woul
I don't know... (Score:2)
I'm sure we could have found some way to get along without "Mathematical Rising Diagonal" and "Kissing Face".
Re:Stick to ASCII (Score:5, Funny)
Yeah but can you write a pile of poo in ASCII?
http://www.fileformat.info/info/unicode/char/1f4a9/index.htm [fileformat.info]
Re:Stick to ASCII (Score:4, Funny)
This is Slashdot, I'm sure you can find any number of examples of people who've written a pile of poo in ASCII.
Re: (Score:3)
Yeah but can you write a pile of poo in ASCII?
As far as I know, Windows was originally written in ASCII... :)
Re: (Score:2)
Re: (Score:2)
Re: (Score:3)
I'm pretty sure in HTML5 like in HTML4 the document is considered to be made up of unicode characters and other charsets are considered as encodings of unicode. Of course the HTML5 spec doesn't include all unicode characters explicitly that would be insane.
Re: (Score:2)
Re: (Score:2)
The character entities in HTML are only to try to get around legacy encodings. And since you can specify numerical Unicode entities, all of the Unicode set is accessible, there is no need for explicit names for everything.
If you aren't constrained to legacy encodings, then the obvious approach is just to set the encoding to something sensible, for example UTF8. There are several ways to do this in HTML. http://www.w3.org/TR/html5-diff/#character-encoding [w3.org]
Re: (Score:2)
Specifying the "document character set" as unicode means that even if the charset you are writing your document in doesn't support the character you want you can still enter it as a numeric (or named if one is defined) entity, whether it will be displayed is mostly a matter of whether appropriate fonts are installed but generally i'd expect someone who writes Chinese to have Chinese fonts installed.
Generally it's the GUI system's job to handle input and output of text not an individual application. Is it re
I blame Star Trek & LotR. (Score:2)
Well said, that man. If you feel the desire to "write" with stick figures and squiggles use a bastarding graphic, for fuck's sake.
Eklinóringëon my arse.
Re:Stick to ASCII (Score:4, Informative)
ASCII is just 128 characters.
Re: (Score:2)
Re: (Score:3)
They're only "easy" if you have your system configured for ISO-8859-1. Those of us who use UTF-8 get this result: à é.
Re: (Score:2)
Hey, so it is the /.'s web server that doesn't do encoding right? I always tought it was the GCI code.
WTF are they using to serve those pages?
Re: (Score:2)
The correct sequence for business, politics and everything is of now:
#1F648 #1F649 #1F64A
Gotta love the effort that went into providing the proper symbols.
Re: (Score:3, Insightful)
Re:Zomg (Score:5, Funny)
Re: (Score:3)
ASCII leaves off a lot of English punctuation, and accents that are, in fact, used in English (sure, in words of foreign origin, but they are still used.)
Re: (Score:3)
ASCII leaves off a lot of English punctuation, and accents that are, in fact, used in English (sure, in words of foreign origin, but they are still used.)
Some that aren't foreign as well. "Coöperate" is an archaic spelling. Basically, any prefix that ends in "o" that is attached to a word that starts with an "o" can archaically be spelled with a diaeresis, in the French/Dutch method of "this vowel should be pronounced separately, and not as part of a diphthong".
Re: (Score:2)
Re: (Score:3)
English also has the second-worst spelling system on the planet (only outdone by Japanese).
??? WTF are _YOU_ on about? English does not have the worst spelling system on the planet, and Japanese certainly doesn't qualify as the worst. "But they have three different scripts: two syllabaries, and an ideographic set" but...
Look, perhaps I better just demonstrate to you what a real bad spelling system looks like; go look at Irish [wikipedia.org].
Re: (Score:2)
??? WTF are _YOU_ on about?
Can you concisely explain why the English word "psyche" is pronounced the way it is to a non-native speaker of the language?
Re: (Score:2)
It's a loan-word from Greek. It follows the basic English rules for borrowing Greek words.
Re: (Score:2)
The rules for regular English words are no better, to be honest. It's like someone was trying to come up with the most perverted way to make a letter represent something as different as possible from what it does in most European languages (and Latin, where it originates). The only language that's possibly worse in that regard is French, but at least they are consistent in the way they mutilate their phonemes (and most of it is just dropping them altogether), whereas in English you have to guess which of th
Re: (Score:2)
??? WTF are _YOU_ on about?
Can you concisely explain why the English word "psyche" is pronounced the way it is to a non-native speaker of the language?
The word being originally from Greek and pronounced /psyxe/ was transliterated and taken into English. English phonology does not allow for a word to start with /ps/, and so the rules change that to a /s/. English phonology does not allow for a /x/, and so the rules change that to a "k". English phonology does not allow for a word to end with /e/, and so the rules change that to either a /ej/ or an /i/, but more more commonly /i/ (e.g. Japanese "sake" is typically pronounced /saki/). All that is left is the
Re: (Score:2)
Can you explain why any word in french is pronounced the way it is? It seems like they have different rules for what letters to pronounce for every word.
You know, "better than French" is not a great achievement. Indeed, one of the reasons why English is in such a sorry shape is because it absorbed an unhealthy dose of French poison as part of its history.
Anyway, the rule of thumb in French seems to be, if you don't know how to pronounce any given letter, just skip it altogether - >50% chance of you getting it right in that case. ~
Anyway the reason you pronounce psyche like that is because it sounds better than psitsh.
Technically, it should be /psixe/, which sounds reasonable to me.
Re:Obligatory XKCD (Score:5, Insightful)
You know that this is the exact situation that Unicode AVOIDED, doesn't you?
Now we have one standard with 3 different representation. Those replaced literaly thousands of standards. Yep, sometimes doing that new standard works.