Artificial intelligence has taken a keen interest in cybersecurity, largely due to the growing reliance of large software systems on AI and their expanding capabilities. The complexity of the threats has increased, so securing software systems is more complex than ever before. This includes automated reasoning and vulnerability detection as well as code-level understanding. In order to achieve modern cybersecurity, tools and techniques are needed that simulate real-world situations, detect hidden flaws and verify system integrity in diverse software environments. Researchers have developed benchmarks and methodologies to evaluate AI agents’ ability to detect and exploit vulnerabilities. They draw parallels between human security researchers and the researchers who develop AI. The key challenge is to bridge the gap that exists between AI-based reasoning and actual cybersecurity challenges.
There is a problem with existing benchmarks
A pressing problem is that there are no effective methods to determine whether AI systems can handle security tasks in realistic situations. Most current methods of testing are focused on simple benchmark tasks, and they rarely reflect the complex reality that is large software repositories. In these environments there are complex inputs, code paths that go deep, and hidden vulnerabilities. They require more than just a superficial inspection. In the absence of robust methods for evaluating AI agents, it is difficult to know whether they can be trusted with tasks like exploit or vulnerability development. Current benchmarks do not reflect the nuance and scale of vulnerability found in widely-used, actively maintained software, creating a significant evaluation gap.
Current Tool Limitations
Cybench, the NYU Bench for CTF and other benchmarks are used to measure cybersecurity. These focus on capture-the-flag-style tasks that offer limited complexity, typically involving small codebases and constrained test environments. Some benchmarks try to address real-world weaknesses, but do so on a small scale. Many of these tools use synthetic tests or challenge problems with a narrow scope, but they do not represent the diverse software inputs and execution paths found in real-world systems. Security agents have also been evaluated on benchmarks that only contain a handful of tasks. This is far less complex than the real world threat landscape.
CyberGym: Introducing CyberGym
Researchers introduced CyberGymThe tool is designed for evaluating AI agents under real-world security scenarios. CyberGym, developed at the University of California Berkeley, includes 1,507 benchmark tasks that are based on actual vulnerabilities discovered and fixed across 188 open-source projects. These vulnerabilities were initially identified by OSS Fuzz – a Google continuous fuzzing project. For realism purposes, every benchmark example includes the complete pre-patch version of code, a executable file, and textual descriptions about the vulnerabilities. CyberGym measures success by evaluating whether a vulnerability was triggered on the pre-patched version and not in the patched version. The benchmark is unique in that it emphasizes Proof of Concepts, a complex task requiring agents to navigate code paths to synthesize inputs and meet security requirements. CyberGym’s modularity and containerization make it easy to expand and reproduce.
CyberGym Evaluation Levels
CyberGym has four levels of complexity, with each level increasing the information input. The agent receives only the source code at level 0. There is no indication of the vulnerability. Level 1 adds a natural language description. The Level 2 includes a proof-of-concept (PoC), crash stack trace and the actual patch. Level 3 also contains the codebase after the patch. Each new level adds another layer of reasoning. In level 1, for example, agents are required to infer the location and context of a vulnerability based on its codebase and textual description. CyberGym uses filters to ensure the quality of benchmarks. These include checking for informative patch commit messages and validating reproducibility. The codebases in the final dataset have a median size of 1,117 lines and 387.491 files, but can range up to over 7 million lines and 40,000 lines. Patch sizes vary as well, with a median patch size of one file and seven line, and sometimes patches spanning over 40 files and 3,000 lines. These vulnerabilities affect different crash types. Of the 30,4% that are related to heap buffer overflows READ, 19.0% is due to using uninitialized values.

Experimental Results
The existing agents were not able to perform well when evaluated against this benchmark. OpenHands with Claude-3.7Sonnet was the best performing agent among four frameworks: Codex, ENiGMA and Cybench. It reproduced 11.9% of targets vulnerabilities. The performance of this tool dropped dramatically when it was asked to deal with larger PoCs. Success rates for PoCs less than 10 bytes were the highest (43.5%), and they fell below 8 percent for those over 100 bytes. DeepSeek V3 was the only open-source model that performed poorly, with a success rate of 3.6%. SWE-Gym 32B and R2E Gym 32B are specialized code reasoning models that scored below 2%. Unexpectedly, higher levels of difficulty resulted in better performance. Level 3 achieved 17.1%, while only 3.5% was attained by level 0. The analysis also showed that the most successful PoC replications were between 20 to 40 execution steps. Many runs exceeded 90 steps before failing. In spite of these obstacles, the agents found 15 zero-days previously undiscovered and 2 disclosed but still unpatched vulnerabilities across real-world project, showing their capacity to make novel discoveries.

What you need to know
- CyberGym is the most realistic and largest benchmark in its class. It contains 1,507 tasks that are derived from actual, patched software vulnerabilities.
- Agent Limitations: Even with the highest-performing combination of agent models, only 11.9% vulnerabilities were reproduced.
- Difficulty scaling: By adding additional inputs such as patches or stack traces, performance was significantly enhanced. Tasks at level 3 had a success of 17.1%.
- Agents were unable to complete tasks involving lengthy Proofs-of-Concept. The lowest rates of success were for PoCs over 100 bytes. This was 65.7% (of the dataset).
- Agent-generated PoCs validated the potential of 15 zero-day security vulnerabilities.
- Model behavior: Successes in exploits tend to be generated earlier, before 80 steps.
- Tool Interactions: Agents performed better when allowed to interact with tools (e.g., using ‘awk’, ‘grep’, or installing ‘xxd’) and adapt PoCs based on runtime feedback.
You can also read our conclusion.
This study concludes by highlighting a crucial problem. It is important to understand AI’s limitations and capabilities. CyberGym offers an extensive, realistic framework that makes it possible to evaluate AI in cybersecurity. The researchers tackled the problem with a realistic and comprehensive benchmark, which requires agents to analyze entire codebases in depth, create valid exploits and then adapt by iterating. Results show that, while agents have shown promise in many areas, including discovering bugs, it will take a lot of work to make AI contribute to security at scale.
Take a look at the Paper, GitHub Page, Leaderboard. The researchers are the sole owners of all credit. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe Now our Newsletter.
Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost was his most recent venture. This platform, which specializes in covering machine learning and deep-learning news, is technically solid and accessible to a broad audience. Over 2 million views per month are a testament to the platform’s popularity.


