Google's Big Sleep LLM Agent Discovers Exploitable Bug In SQLite (scworld.com) 36
spatwei writes: Google has used a large language model (LLM) agent called "Big Sleep" to discover a previously unknown, exploitable memory flaw in a widely used software for the first time, the company announced Friday.
The stack buffer underflow vulnerability in a development version of the popular open-source database engine SQLite was found through variant analysis by Big Sleep, which is a collaboration between Google Project Zero and Google DeepMind.
Big Sleep is an evolution of Project Zero's Naptime project, which is a framework announced in June that enables LLMs to autonomously perform basic vulnerability research. The framework provides LLMs with tools to test software for potential flaws in a human-like workflow, including a code browser, debugger, reporter tool and sandbox environment for running Python scripts and recording outputs.
The researchers provided the Gemini 1.5 Pro-driven AI agent with the starting point of a previous SQLIte vulnerability, providing context for Big Sleep to search for potential similar vulnerabilities in newer versions of the software. The agent was presented with recent commit messages and diff changes and asked to review the SQLite repository for unresolved issues.
Google's Big Sleep ultimately identified a flaw involving the function "seriesBestIndex" mishandling the use of the special sentinel value -1 in the iColumn field. Since this field would typically be non-negative, all code that interacts with this field must be designed to handle this unique case properly, which seriesBestIndex fails to do, leading to a stack buffer underflow.
The stack buffer underflow vulnerability in a development version of the popular open-source database engine SQLite was found through variant analysis by Big Sleep, which is a collaboration between Google Project Zero and Google DeepMind.
Big Sleep is an evolution of Project Zero's Naptime project, which is a framework announced in June that enables LLMs to autonomously perform basic vulnerability research. The framework provides LLMs with tools to test software for potential flaws in a human-like workflow, including a code browser, debugger, reporter tool and sandbox environment for running Python scripts and recording outputs.
The researchers provided the Gemini 1.5 Pro-driven AI agent with the starting point of a previous SQLIte vulnerability, providing context for Big Sleep to search for potential similar vulnerabilities in newer versions of the software. The agent was presented with recent commit messages and diff changes and asked to review the SQLite repository for unresolved issues.
Google's Big Sleep ultimately identified a flaw involving the function "seriesBestIndex" mishandling the use of the special sentinel value -1 in the iColumn field. Since this field would typically be non-negative, all code that interacts with this field must be designed to handle this unique case properly, which seriesBestIndex fails to do, leading to a stack buffer underflow.
Nice find, but... (Score:2)
It was looking at recent commits.
Did it find something other current tools would not have found?
Re: (Score:2)
Re: (Score:2)
Oh damn it, I didn't think to go there. Good catch.
Re: (Score:1)
Re: (Score:2)
"We can't afford to hire humans to look for bugs in legacy code, instead we'll use our billion dollar investment to do this!
If I could spend every day hunting for bugs then I would find quite a few of them. In fact, I even know of several bugs that should be fixed. Except that this is not revenue for the company, as a bug with an easy workaround doesn't need to be fixed. Other bugs I know has some bad code, but to fix it would require a LOT of time and money: if a line of code is changed then someone need
Re: (Score:2)
More important, did it find bugs that humans could not find? If not, then why didn't humans find it? All these stories about AI doing stuff humans can't is ridiculous, it's like Google is merely doing marketing to sell their AI brand.
Re: (Score:2)
A human probably could have found it if they were looking for it, but how long would it take? If this LLM found it in an hour or two while a human would have taken a week or two, then it's money well spent.
Re: (Score:2)
Where are the details of this. Did a human even TRY to look? How much did that AI cost compared to a human? How many false positives did the AI have?
AI is not free. There's probably a billions of dollars being spent here, and now compare it to the salary of a generic junior developer. The snag is, in my experience, that developers are not paid to find bugs, but they are paid to create new features. Bug fixing happens when a test or customer finds one. It's sad, but it's how things work. Absolutely d
Re: (Score:2)
Risky to disclose (Score:5, Informative)
This seems risky to disclose considering the nature of sqlite being embedded and how many things that use SQL don't use a shared library or get updated often, if ever.
Re: (Score:3)
You're saying the disclosure is the risky thing, and not using the software in that way?
Re: (Score:1)
You're saying the disclosure is the risky thing, and not using the software in that way?
Fortunately, we found this issue before it appeared in an official release, so SQLite users were not impacted [blogspot.com]
Re: (Score:3)
My tinfoil hat thinks the bug was committed just to sell the idea of AI 'finding' it.
Re: Risky to disclose (Score:1)
It most certainly looks like deliberately introduced vulnerability to test the AI.
Re:Risky to disclose (Score:5, Informative)
This seems risky to disclose considering the nature of sqlite being embedded and how many things that use SQL don't use a shared library or get updated often, if ever.
Embedded sqlite libraries (and lots of other stuff) not being updated often is the problem there, not the disclosure. It's a broad and deep problem, and a serious one, but holding back disclosures is not how you fix or mitigate it. Holding back disclosure just ensures that more device/systems are vulnerable for more time.
In cases where it's feasible to notify developers who use a vulnerable library before public disclosure that's the right way to do it, but for widely-used open source libraries like sqlite, there's no way. Any notification to developers using sqlite is a public notification. The best you can do with sqlite is to let the sqlite team notify all paid support contract holders, and it seems likely that was done since the sqlite team was notified a month ago and the public announcement was last week.
Re: (Score:2)
"All historical vulnerabilities reported against SQLite require at least one of these preconditions:
The attacker can submit and run arbitrary SQL statements.
The attacker can submit a maliciously crafted database file to the application that the application will then open and query."
I can't think of anyone using SQLite in a way that would actually present a risk.
Re: (Score:2)
Im in a quandary right now where a customer has scanned the code (they asked for it), and found some older libraries existed, and those libraries had CVEs for them, and now they demand fixes. Except that we don't know if the CVEs apply (giant libraries that we used 3 files from for instance), or that if we fix the CVEs for the old library do we still need to massive overhaul to use new versions? Probably it's a mix of everything; prove that the CVE can't happen, fix a CVE (with weeks of testing), and migr
Re: Risky to disclose (Score:2)
Yeah, I've always felt like CVEs were mostly security theater. As a system administrator, I am going to concentrate on layered security and keeping everything patched. There's almost never any reasonable action I can take on a CVE. That shit is upstream. But then the bosses want to take up my time proving why each CVE doesn't apply. And even when it does, they're not willing to let me shutdown the server while upstream patches it, so WTF am I gonna do with that? Maybe once in a blue moon there's a config to
Re: (Score:2)
The snag is that the CVE doesn't really affect us. We have security controls preventing third parties from providing "carefully crafted inputs". But that's not good enough, someone sees a CVE and wants a later library version. I could also fix the CVEs, I see the bug, it's relatively simple to fix. But that's not good enough, for the same reasons. So it's a big pain in the ass because fixing the CVE is easy, upgrading the library will take a very long time.
Re: (Score:2)
I can't think of anyone using SQLite in a way that would actually present a risk.
We can hope.
That said, I agree that sqlite is an impressively high-quality piece of software, and its APIs don't encourage app developers to write code in ways that enable arbitrary SQL injection, so... maybe.
I also want to plug paid sqlite support (honestly the main reason I decided to reply). I don't know what it costs, but I've had to use it twice and Dr. Hipp and his team are fantastic. Very responsive and extremely capable. No "Did you try this list of obvious things" after your carefully-written
Re: (Score:2)
"in a development version"
Caught before release according to TFS.
Re: (Score:2)
sentinel (Score:2)
Seems that sentinel values are a bad idea in the first place.
Interesting possibility (Score:3)
Maybe in the near future, software engineers will rely on AI to write the test cases and run the tests... Because, let's face it, how many software engineers like writing test cases?
Re: (Score:3)
I have seen some AI generated test cases - trivial stuff, like "that method returns a string. Verify that the string is equal to " but (a) good for coverage numbers and (b) good for catching accidental modifications to user-visible text.
I look forward to when it can do more sophisticated checking of the code flow and build test cases that exercise different paths. I don't expect it to be perfect but it'd be a nice starting point.
Re: (Score:3)
Ah, unit tests. A way to look extremely busy all the time, and then show the 100% pass rate of your unit tests so that you get a nice bonus. Except that they almost always test the stuff that obviously isn't going to fail but not the complex stuff (boundary conditions, wrap arounds, fault handling, etc). But boy are those guys happy that their "asssert(1 + 1 == 2)" tests pass every time. And they'll use automated unit test generation, so now there's 10 times more code, and it's all obscure and unreadabl
Re: (Score:2)
I find writing tests to be relaxing, as long as the code being tested is functional. Object-oriented software test cases are a fucking nightmare.
Big Sleep? (Score:3)
I wonder if the LLM is named after the movie [nytimes.com].
The Big Sleep is one of those pictures in which so many cryptic things occur amid so much involved and devious plotting that the mind becomes utterly confused. And, to make it more aggravating, the brilliant detective in the case is continuously making shrewd deductions which he stubbornly keeps to himself. What with two interlocking mysteries and a great many characters involved, the complex of blackmail and murder soon becomes a web of utter bafflement. Unfortunately, the cunning script-writers have done little to clear it at the end.
Re: (Score:2)
That 1946 reviewer doesn't seem to have liked the picture! It's one of my favorites, though.
Re: (Score:1)
A good use of AI (Score:2)
An area with lots of mostly-automateable work where the result can be checked by humans and false positives are no big deal. Perfect usecase for AI.
Grammar (Score:2)
train AI to discover flaws in source code instead (Score:1)
While this is good news that LLMs are used to discover potential 0-days, it would be much better if AI could be trained to spot such flaws directly in the code instead of being just getting better at running fuzzer against binary
Discovering exploits at analyzing the source code would not only be a real breakthrough, but also a major progress at having a more secure code base.
Not AI (Score:2)
"However, the team emphasized that Big Sleep remains âoehighly experimentalâ and that they believe a target-specific fuzzer âoewould be at least as effectiveâ at detecting vulnerabilities as the AI agent in its current state."
It was also only a bug in recently-committed development code, never pushed to release, and there's nothing to say it wouldn't have been caught before then.
Sorry, but this is more AI hyperbole even as the authors literally say "Yeah, you could also find this with a
Re: (Score:2)