Fenrir.Niflheim said: »
Oh what? but look they can find CVEs (this one is even a remote zeroday vulnerability), I mean ignore the mountain of false positives, you can sort through those... like what else are you going to do with your time after you are replaced by the AI? It is only going to get better and at rapid pace.
This is a fair example and one I was aware of, the issue I take is that this isn't how AI is being promoted by the industry.
The way most people read the way AI is going to work is this: give me code, I put it in AI, tell AI to find bugs, AI gives me bugs. Now we, here, understand this is not the case but your typical CISO does not. The goal is to promote the idea that you can remove the person and automate results using AI, something they've claimed for years prior to LLMs and...wasn't the case. In this case, the bug was found 8% of the time, so between weeding out false positives and running enough for it to find the bug, is that really working the way it's being promoted esp in context of the Anthropic posts?
In the case above, you had someone who understood the code enough to find a bug on their own (he kindof hints at this but IMO underplays the value here), then fed the LLM the specific code necessary AND understands how LLMs work enough to know how to prompt it and provide what is needed. You'd also have to have enough familiarity with the code to weed out false positives, again, something that requires manual intervention and review. This code is not hard to get through, for sure, but still...there's a prerequisite that someone can interpret the result and make sense of it.
In the context of the other discussion, exploiting an issue like this is also extremely volatile. Most memory corruption bugs I've found in the course of my career are not practically exploitable, but get CVEs anyway because it segfaults and most people don't care if it's actually exploitable. Feeding a LLM the code necessary to reliably exploit a UAF bug would require understanding of the compiled code, allocator internals, thread state, and a number of other factors that it's just not capable of handling to produce a working exploit. So yea, bug hunting there is some optimization, but not enough to replace people and even (in my experience) provide meaningful output. You still need someone who can understand these internals to really turn it into something useful, but it involves correlation of so many different factors (some of which are non-deterministic, like allocator state) that LLMs can't begin to handle it.
I actually ran a similar test not long ago for a bug (in the Linux kernel actually, also) that related to the way state was handled for a certain kernel module. I prompted it and hand held it to see if I could get it even close to finding the bug, even going to the point of asking "is this a bug?" while pointing out the bug and it basically told me to read the code myself after giving me the wrong answer repeatedly. Granted, the code in my case is a little more complex as it relates to data fed in from userspace that is very indirectly interacted with, whereas data pulled through a command handler off the network is a more linear path (which is what was done here). In another case, I was interacting with a service over a public available IPC API and it just invented header files, function calls, and data structures that didn't exist and even pointing at the code, it couldn't get over whatever it was hallucinating. I used Claude and Copilot, though, so maybe I need try o3 and replicate his process here a little better than I have in the past (our current work isn't strictly related to this at the moment).
I think the overall point here is that yea, there is value in some cases that can optimize what someone who already knows what they are doing is capable of. The problem is that's not the pitch, the pitch is a lot simpler and is the same pitch that's been out there for years prior to LLMs, but doesn't match the reality when you are dealing with complex targets. These tools can make someone more optimized, at times, and others they can end up chasing ghosts. That's also questioning whether simpler testing methods, like simply fuzzing the SMB protocol in this case with the proper instrumentation, and whether or not it would've identified the same bug with less work and core understanding of the code (initially, anyway)
