


Curl Warns GitHub About 'Malicious Unicode' Security Issue (daniel.haxx.se) 64
A Curl contributor replaced an ASCII letter with a Unicode alternative in a pull request, writes Curl lead developer/founder Daniel Stenberg. And not a single human reviewer on the team (or any of their CI jobs) noticed.
The change "looked identical to the ASCII version, so it was not possible to visually spot this..." The impact of changing one or more letters in a URL can of course be devastating depending on conditions... [W]e have implemented checks to help us poor humans spot things like this. To detect malicious Unicode. We have added a CI job that scans all files and validates every UTF-8 sequence in the git repository.
In the curl git repository most files and most content are plain old ASCII so we can "easily" whitelist a small set of UTF-8 sequences and some specific files, the rest of the files are simply not allowed to use UTF-8 at all as they will then fail the CI job and turn up red. In order to drive this change home, we went through all the test files in the curl repository and made sure that all the UTF-8 occurrences were instead replaced by other kind of escape sequences and similar. Some of them were also used more or less by mistake and could easily be replaced by their ASCII counterparts.
The next time someone tries this stunt on us it could be someone with less good intentions, but now ideally our CI will tell us... We want and strive to be proactive and tighten everything before malicious people exploit some weakness somewhere but security remains this never-ending race where we can only do the best we can and while the other side is working in silence and might at some future point attack us in new creative ways we had not anticipated. That future unknown attack is a tricky thing.
In the original blog post Stenberg complained he got "barely no responses" from GitHub (joking "perhaps they are all just too busy implementing the next AI feature we don't want.") But hours later he posted an update.
"GitHub has told me they have raised this as a security issue internally and they are working on a fix."
The change "looked identical to the ASCII version, so it was not possible to visually spot this..." The impact of changing one or more letters in a URL can of course be devastating depending on conditions... [W]e have implemented checks to help us poor humans spot things like this. To detect malicious Unicode. We have added a CI job that scans all files and validates every UTF-8 sequence in the git repository.
In the curl git repository most files and most content are plain old ASCII so we can "easily" whitelist a small set of UTF-8 sequences and some specific files, the rest of the files are simply not allowed to use UTF-8 at all as they will then fail the CI job and turn up red. In order to drive this change home, we went through all the test files in the curl repository and made sure that all the UTF-8 occurrences were instead replaced by other kind of escape sequences and similar. Some of them were also used more or less by mistake and could easily be replaced by their ASCII counterparts.
The next time someone tries this stunt on us it could be someone with less good intentions, but now ideally our CI will tell us... We want and strive to be proactive and tighten everything before malicious people exploit some weakness somewhere but security remains this never-ending race where we can only do the best we can and while the other side is working in silence and might at some future point attack us in new creative ways we had not anticipated. That future unknown attack is a tricky thing.
In the original blog post Stenberg complained he got "barely no responses" from GitHub (joking "perhaps they are all just too busy implementing the next AI feature we don't want.") But hours later he posted an update.
"GitHub has told me they have raised this as a security issue internally and they are working on a fix."
Re:Yes, unicode is a security issue (Score:4)
Re: Yes, unicode is a security issue (Score:2)
Re: (Score:3)
If you can't be bothered to switch your iphone to use ascii for input on slashdot - I can't be bothered to care either. Seeing those (TM) warnings is like spotting a big red f
Re: (Score:2)
I agree that the only correct character to use as the apostrophe should be the actual, ASCII, 0x27, ', apostrophe. The problem people have with this is that it looks horrible typographically because fonts generally have a vertical symbol (so it can also work as an opening or closing single quote). The correct fix is to use fonts where the apostrophe looks like a closing single quote. Obviously that's a disaster for coding ... so code with a console font. There are plenty of common fonts that ar
Re: Yes, unicode is a security issue (Score:2)
I have an issue with people thinking that the ascii variant looks horrible.
Re: Yes, unicode is a security issue (Score:1)
Re: (Score:2)
There is a genuine issue with Unicode here though. The lack of any official metadata makes things like this common. Characters that render identically should really be variations of each other, not completely separate, but you need metadata to encode that.
Re: (Score:2)
Indeed.
This is just another instance of "want something" - "implement it badly" - "get a mess".
There were zero IT securuty experts involved with the design of unicode. I had an oppurtinity to ask one of the designers about that some 25 years back and he said "I do not think it was considered." Bright eyed amateurs at work. As in so many other IT areas.
Re: (Score:2)
How would you have designed it, then, broadly speaking?
(Obviously, one cannot put a full design specification in a comment on Slashdot, but a general gist could be quite interesting.)
I am genuinely curious, by the way. No trolling here.
I have no idea. Give me a year or two (full time) togetehr with several groups deeply into GUI text rendering and text processing on OS and library level to get into the complexities of the question. One thing is for sure, I would have completely forbidden rendering of two similar looking or the same glyphs on a code-point. As in "you may not use the name, logo and we will sue you". Another thing is I would have defined a mandadory glyph-set that must be present and made it mandatory that all chars outside
Re: (Score:2)
That is "rendering two similar or the same glyph on _different_ code-points".
Who audits the entire package tree? (Score:2)
Is there anyone who can explain just how software developers or vendors will be able to audit the entire package tree for any NPM, PIP or whatever package they include in their solution?
XKCD Some guy in Nebraska - https://xkcd.com/2347/ [xkcd.com]
Re:Yes, unicode is a security issue (Score:5, Funny)
And I don't miss the stupid emojis either.
:-(
Re: (Score:2)
Re: (Score:2)
pretty much every major open source package is compromised at this point. The fact that once every few years these infiltration attempts are barely caught (xz, now this) just goes to show how many get through.
Or it goes to show that almost nobody is actively trying, and attempts happen only every few years. The absence of detecting attacks is not inherently a defect in detection; it can also be a lack of attacks.
More highlighting (Score:2)
Re: (Score:2)
If it can refine the difference to highlight which word in a line was different, maybe it could use a different color (if moving to a 3-color process [britannica.com] isn't too much more expensive) for which characters in that line are different. Or have a checkbox to temporarily highlight non-ASCII UTF-8 characters.
Various diff utilities do highlight things at the character level.
I imagine Github will tout a CoPilot solution (Score:2)
However, looking for these sort of shenanigans seems like something that could've (and maybe should've) been at least semi-automated a couple decades ago - search for characters outside the typical ASCII range and flag those parts for human review.
No reason we can't automate (Score:2)
However, looking for these sort of shenanigans seems like something that could've (and maybe should've) been at least semi-automated a couple decades ago - search for characters outside the typical ASCII range and flag those parts for human review.
An automated review is not that difficult. For each ASCII character there can be a list of visually similar characters. For example a Latin (Ascii) 'a' would have a Cyrillic 'a' on its list. ...
U+0061: U+0430,
Flagging everything would include characters that do not look the same. That would seem like false positives. Or maybe lower priority warnings. Visually similar characters being a higher priority warning.
7 Bit ASCII (Score:1)
You need that 8th character Re:7 Bit ASCII (Score:1)
EBCIDIC is the One True Standard [xkcd.com].
Re: 7 Bit ASCII (Score:2)
Or even better, 6-bit BCD code like in good old Fortran on an IBM 7094 mainframe.
Re: (Score:2)
IS THAT WHY FORTRAN PROGRAMMERS ARE ALWAYS SHOUTING
(note also the lack of a question mark).
Also lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness lameness
Re: (Score:2)
Unicode is a bug (Score:1)
Vertical double quotes.
Closing double quotes. Opening double quotes.
Homoglyphs.
Arbitrary number of bytes per glyph.
If it ain't ascii it isn't worth expressing in bytes.
Re: (Score:2)
Unicode is fucking ridiculous and so are standards bodies who seem to be entirely composed of zero experts and just industry insiders. Javascript is even worse and the web as a whole is getting progressively worse.
Re: (Score:2)
Now the problem here where different codepoint sequences looking the same have to be avoided is handled easily by requiring escape sequences for anything outside the range of printable ASCII.
Re: (Score:2)
If it ain't ascii it isn't worth expressing in bytes.
If you exclusively speak American then you can say everything is US ASCII ... but for many who, reasonably, want to express themselves in their own language they will want other characters. But the "everything" is not entirely true even for Americans, eg 1/100 of a dollar is a cent which is U+00A2 - which slashdot will not display correctly.
Re: (Score:2)
Also to be fair, and I don't want to be fair, Unicode and multilanguage websites, where the content owners hounded me forever to get the orthography right in 7 languages, was a source of significant and ongoing pain and irritation... that is actually the whole point of it. Apparently, not everyone speaks ASCII.
Re: (Score:2)
Yeah, but to be fair, Unicode was invented to put those in.
More specifically, Unicode was invented to provide a standard encoding for all living languages. Anything currently used books, magazines, newspapers, etc.
It was later expanded to include dead languages to help researchers.
Re: Unicode is a bug (Score:1)
And poo emojis to help retards.
Re: (Score:2)
And poo emojis to help retards.
Well I guess that would be phase 3. :-)
Re: (Score:2)
Fortunately the unicode consortium have decided to be apolitical and that it was their job to merge all the worlds charsets, rather than editorialize and put their emotions front and centre. If they had taken your approach they would have failed.
Re: (Score:2)
Re: (Score:2)
More specifically, Unicode was invented to provide a standard encoding for all living languages. Anything currently used books, magazines, newspapers, etc.
Plus some dead ones.
Re: (Score:2)
More specifically, Unicode was invented to provide a standard encoding for all living languages.
I thought it was more meant to be "one charset to rule them all", and encompass everything that's put down in print and on screen, including all languages and a whole host of non language symbols, such as the poo emoji which has been around for a surprisingly long time.
Re: (Score:2)
More specifically, Unicode was invented to provide a standard encoding for all living languages.
I thought it was more meant to be "one charset to rule them all", and encompass everything that's put down in print and on screen, including all languages and a whole host of non language symbols, such as the poo emoji which has been around for a surprisingly long time.
I think we are saying the same thing, "character set" being an "encoding" of Unicode code points. Living languages was v1.0. Dead languages came in a later major update. Emoji in amore recent major update.
Re: (Score:2)
Arbitrary number of bytes per glyph.
Yes and no. That's mostly a result of encoding, UTF-8 vs UTF-32. Although there would still be some glyphs that are composed from multiple code points. To oversimplify, image two characters, 'A' and '`', creating an accented A glyph.
FWIW, UTF-8 is not difficult to decode, so doing comparisons or detecting malformed UTF-8 isn't too much work. As part of defensive programming I check for proper UTF-8 encoding on any inputs. Its a write once, use many times, sort of thing.
If it ain't ascii it isn't worth expressing in bytes.
Bytes, iie UTF-8 encoding of code p
Re: (Score:2)
Re: (Score:2)
Unicode is arguably the wrong tool for the job. It was designed to represent all human writing, across every language, living or dead. Even within one language, defining the character set unambiguously is difficult. Across multiple languages it's practically impossible. So Unicode goes for an inclusive approach - if something is plausibly a character of a human language, there's at least one way to represent it in Unicode. Possibly multiple ways, which is preferred over no ways. And uniqueness and identity
Spoofing attacks are old (Score:2)
Package Managers (Score:2)
Many traditional distros still ship unusably old versions of some packages - due to some network dependency they literally don't work anymore.
Some are buggy with upstream fixes (e.g. nvme tool) and just don't work. "Wait a year and we'll ship a version that works".
This pushes people to use upstream packages which often times come with update scripts that run as root.
These would be an ideal place for a malicious "contributor" to put in an update URL he controls.
It would be better for the distros to remove t
AI would have caught this (Score:2)
That's indeed one of the use-cases than an AI can catch easier than a human.
Patch (Simplified as I couldn't copy&paste from the screenshot):
--- test1.txt 2025-05-17 20:56:18.097357631 +0200
+++ test2.txt 2025-05-17 20:56:33.357317426 +0200
@@ -1 +1 @@
-Find the file at https://githubusercontent.com/... [githubusercontent.com]
+Find the file at https:/// [https]ithubusercontent.com/mozilla-firefox/file.json
Instruction: "Describe the changes done in this patch"
Input: (the patch)
AI:
In this patch, the following changes were made:
1. **Re
Re: (Score:2)
Also note that the LLM did get the actual code point (first question) and the script (second question) wrong. To the AI's defense: It was only a small 12B model.
Not everything needs AI (Score:2)
That's indeed one of the use-cases than an AI can catch easier than a human.
A very small amount of non-AI code could also catch it. Not everything needs AI.
Re: (Score:2)
And not everything that's called "AI" actually is "AI" (except to pedants).
Re: (Score:2)
And not everything that's called "AI" actually is "AI" (except to pedants).
I would add everything that is AI is not necessarily Machine Learning. Some of it is old fashioned humans developing algorithms and stitching them together. AI is really about a family of problems, not necessarily a particular approach to a solution.
Re: Not everything needs AI (Score:1)
I would add that for marketing purposes, matrix inversion can be an AI algorithm.
There are times when I wonder if the fraction of people who believe computers are merely another form of magic is higher than I presumed.
Re: (Score:2)
Re: (Score:2)
You can fit (some) neural networks by computing the pseudo inverse. It will usually overfit your data and be too large to be computable on your hardware, but mathematically it is possible.
Re: Not everything needs AI (Score:1)
Backpropagation has a pseudoinverse operation in every iteration. It's linearized curve fitting for compsci majors who didn't like calc 101 and reinvented it for themselves ;-)
Re: (Score:2)
AI is a stupid catch-all phrase. But we won't get rid of it anymore, so we can also use it instead of three to five convoluted math formulations to catch what's actually going on.
Re: (Score:2)
Yea, this. It'll also be more reliable, be harder to subvert, have no hallucinations and so on. And could run on a potato, not need a 16-32GB GPU and even then be like "oh and it's only a small 12B model".
Re: (Score:2)
I did it on a mid-range consumer GPU in a few seconds. No need to buy a GPU cluster. Models are getting quite efficient, you can mimic what the ChatGPT 175B model did with a 4b model that runs on your smartphone.
how does this slip through their commit diff? (Score:2)
Re: (Score:1)
Unicode is a security risk (Score:2)
I have been saying this for about 25 years now. It may have its place in an UI (but is dangerous even there), but it has no place whatsoever in source-code except for character constants. And now we have to do crap like running detectors for it. What a fail.