Google's Big Sleep LLM Agent Discovers Exploitable Bug In SQLite (scworld.com) 27
spatwei writes: Google has used a large language model (LLM) agent called "Big Sleep" to discover a previously unknown, exploitable memory flaw in a widely used software for the first time, the company announced Friday.
The stack buffer underflow vulnerability in a development version of the popular open-source database engine SQLite was found through variant analysis by Big Sleep, which is a collaboration between Google Project Zero and Google DeepMind.
Big Sleep is an evolution of Project Zero's Naptime project, which is a framework announced in June that enables LLMs to autonomously perform basic vulnerability research. The framework provides LLMs with tools to test software for potential flaws in a human-like workflow, including a code browser, debugger, reporter tool and sandbox environment for running Python scripts and recording outputs.
The researchers provided the Gemini 1.5 Pro-driven AI agent with the starting point of a previous SQLIte vulnerability, providing context for Big Sleep to search for potential similar vulnerabilities in newer versions of the software. The agent was presented with recent commit messages and diff changes and asked to review the SQLite repository for unresolved issues.
Google's Big Sleep ultimately identified a flaw involving the function "seriesBestIndex" mishandling the use of the special sentinel value -1 in the iColumn field. Since this field would typically be non-negative, all code that interacts with this field must be designed to handle this unique case properly, which seriesBestIndex fails to do, leading to a stack buffer underflow.
The stack buffer underflow vulnerability in a development version of the popular open-source database engine SQLite was found through variant analysis by Big Sleep, which is a collaboration between Google Project Zero and Google DeepMind.
Big Sleep is an evolution of Project Zero's Naptime project, which is a framework announced in June that enables LLMs to autonomously perform basic vulnerability research. The framework provides LLMs with tools to test software for potential flaws in a human-like workflow, including a code browser, debugger, reporter tool and sandbox environment for running Python scripts and recording outputs.
The researchers provided the Gemini 1.5 Pro-driven AI agent with the starting point of a previous SQLIte vulnerability, providing context for Big Sleep to search for potential similar vulnerabilities in newer versions of the software. The agent was presented with recent commit messages and diff changes and asked to review the SQLite repository for unresolved issues.
Google's Big Sleep ultimately identified a flaw involving the function "seriesBestIndex" mishandling the use of the special sentinel value -1 in the iColumn field. Since this field would typically be non-negative, all code that interacts with this field must be designed to handle this unique case properly, which seriesBestIndex fails to do, leading to a stack buffer underflow.
Nice find, but... (Score:2)
It was looking at recent commits.
Did it find something other current tools would not have found?
Re: (Score:3)
Re: (Score:2)
Oh damn it, I didn't think to go there. Good catch.
Re: (Score:1)
Re: (Score:2)
"We can't afford to hire humans to look for bugs in legacy code, instead we'll use our billion dollar investment to do this!
If I could spend every day hunting for bugs then I would find quite a few of them. In fact, I even know of several bugs that should be fixed. Except that this is not revenue for the company, as a bug with an easy workaround doesn't need to be fixed. Other bugs I know has some bad code, but to fix it would require a LOT of time and money: if a line of code is changed then someone need
Re: (Score:2)
More important, did it find bugs that humans could not find? If not, then why didn't humans find it? All these stories about AI doing stuff humans can't is ridiculous, it's like Google is merely doing marketing to sell their AI brand.
Re: (Score:2)
A human probably could have found it if they were looking for it, but how long would it take? If this LLM found it in an hour or two while a human would have taken a week or two, then it's money well spent.
Re: (Score:2)
Where are the details of this. Did a human even TRY to look? How much did that AI cost compared to a human? How many false positives did the AI have?
AI is not free. There's probably a billions of dollars being spent here, and now compare it to the salary of a generic junior developer. The snag is, in my experience, that developers are not paid to find bugs, but they are paid to create new features. Bug fixing happens when a test or customer finds one. It's sad, but it's how things work. Absolutely d
Risky to disclose (Score:5, Informative)
This seems risky to disclose considering the nature of sqlite being embedded and how many things that use SQL don't use a shared library or get updated often, if ever.
Re: (Score:3)
You're saying the disclosure is the risky thing, and not using the software in that way?
Re: (Score:1)
You're saying the disclosure is the risky thing, and not using the software in that way?
Fortunately, we found this issue before it appeared in an official release, so SQLite users were not impacted [blogspot.com]
Re: (Score:3)
My tinfoil hat thinks the bug was committed just to sell the idea of AI 'finding' it.
Re:Risky to disclose (Score:5, Informative)
This seems risky to disclose considering the nature of sqlite being embedded and how many things that use SQL don't use a shared library or get updated often, if ever.
Embedded sqlite libraries (and lots of other stuff) not being updated often is the problem there, not the disclosure. It's a broad and deep problem, and a serious one, but holding back disclosures is not how you fix or mitigate it. Holding back disclosure just ensures that more device/systems are vulnerable for more time.
In cases where it's feasible to notify developers who use a vulnerable library before public disclosure that's the right way to do it, but for widely-used open source libraries like sqlite, there's no way. Any notification to developers using sqlite is a public notification. The best you can do with sqlite is to let the sqlite team notify all paid support contract holders, and it seems likely that was done since the sqlite team was notified a month ago and the public announcement was last week.
Re: (Score:2)
"All historical vulnerabilities reported against SQLite require at least one of these preconditions:
The attacker can submit and run arbitrary SQL statements.
The attacker can submit a maliciously crafted database file to the application that the application will then open and query."
I can't think of anyone using SQLite in a way that would actually present a risk.
Re: (Score:2)
Im in a quandary right now where a customer has scanned the code (they asked for it), and found some older libraries existed, and those libraries had CVEs for them, and now they demand fixes. Except that we don't know if the CVEs apply (giant libraries that we used 3 files from for instance), or that if we fix the CVEs for the old library do we still need to massive overhaul to use new versions? Probably it's a mix of everything; prove that the CVE can't happen, fix a CVE (with weeks of testing), and migr
Re: (Score:2)
"in a development version"
Caught before release according to TFS.
sentinel (Score:2)
Seems that sentinel values are a bad idea in the first place.
Interesting possibility (Score:3)
Maybe in the near future, software engineers will rely on AI to write the test cases and run the tests... Because, let's face it, how many software engineers like writing test cases?
Re: (Score:3)
I have seen some AI generated test cases - trivial stuff, like "that method returns a string. Verify that the string is equal to " but (a) good for coverage numbers and (b) good for catching accidental modifications to user-visible text.
I look forward to when it can do more sophisticated checking of the code flow and build test cases that exercise different paths. I don't expect it to be perfect but it'd be a nice starting point.
Re: (Score:2)
Ah, unit tests. A way to look extremely busy all the time, and then show the 100% pass rate of your unit tests so that you get a nice bonus. Except that they almost always test the stuff that obviously isn't going to fail but not the complex stuff (boundary conditions, wrap arounds, fault handling, etc). But boy are those guys happy that their "asssert(1 + 1 == 2)" tests pass every time. And they'll use automated unit test generation, so now there's 10 times more code, and it's all obscure and unreadabl
Re: (Score:2)
I find writing tests to be relaxing, as long as the code being tested is functional. Object-oriented software test cases are a fucking nightmare.
Big Sleep? (Score:3)
I wonder if the LLM is named after the movie [nytimes.com].
The Big Sleep is one of those pictures in which so many cryptic things occur amid so much involved and devious plotting that the mind becomes utterly confused. And, to make it more aggravating, the brilliant detective in the case is continuously making shrewd deductions which he stubbornly keeps to himself. What with two interlocking mysteries and a great many characters involved, the complex of blackmail and murder soon becomes a web of utter bafflement. Unfortunately, the cunning script-writers have done little to clear it at the end.
Re: (Score:2)
That 1946 reviewer doesn't seem to have liked the picture! It's one of my favorites, though.
Re: (Score:1)
A good use of AI (Score:2)
An area with lots of mostly-automateable work where the result can be checked by humans and false positives are no big deal. Perfect usecase for AI.
Grammar (Score:2)