July 17, 2025 Albert Ziegler Head of AI This spring, we had a simple and, to my knowledge, novel idea that turned out to dramatically boost the performance of our vulnerability detection agents at XBOW. On fixed benchmarks and with a constrained number of iterations, we saw success rates rise from 25% to 40%, and then soon after to 55%. The principles behind this idea are not limited to cybersecurity. They apply to a large class of agentic AI setups. Let me share. XBOW’s Challenge XBOW is an autonomous pentester. You point it at your website, and it tries to hack it. If it finds a way in (something XBOW is rather good at), it reports back so you can fix the vulnerability. It’s autonomous, which means: once you’ve done your initial setup, no further human handholding is allowed. There is quite a bit to do and organize when pentesting an asset. You need to run discovery and create a mental model of the website, its tech stack, logic, and attack surface, then keep updating that mental model, building up leads and discarding them by systematically probing every part of it in different ways. That’s an interesting challenge, but not what this blog post is about. I want to talk about one particular, fungible subtask that comes up hundreds of times in each test, and for which we’ve built a dedicated subagent: you’re pointed at a part of the attack surface knowing the genre of bug you’re supposed to be looking for, and you’re supposed to demonstrate the vulnerability. It’s a bit like competing in a CTF challenge: try to find the flag you can only get by exploiting a vulnerability that’s placed at a certain location. In fact, we built a benchmark set of such tasks, and packaged them in a CTF-like style so we could easily repeat, scale, and assess our “solver agent’s” performance on it. The original set has, sadly, mostly outlived its usefulness because our solver agent is just too good on it by now, but we harvested more challenging examples from open source projects we ran on. The Agent’s Task On such a CTF-like challenge, the solver is basically an agentic loop set to work for a number of iterations. Each iteration consists of the solver deciding on an action: a command in a terminal, writing a Python script, running one of our pentesting tools. We vet the action and execute it, show the solver its result, and the solver decides on the next one. After a fixed number of iterations we cut our losses. Typically and for the experiments in this post, that number is 80: while we still get solves after more iterations, it becomes more efficient to start a new solver agent unburdened by the misunderstandings and false assumptions accumulated over time. What makes this task special, as an agentic task? Agentic AI is often used on the continuously-make-progress type of problems, where every step brings you closer to the goal. This task is more like prospecting through a vast search space: the agent digs in many places, follows false leads for a while, and eventually course corrects to strike gold somewhere else. Over the course of one challenge, among all the dead ends, the AI agent will need to come up with and combine a couple of great ideas. If you ever face an agentic AI task like that, model alloys may be for you. The LLM From our very beginning, it was part of our AI strategy that XBOW be model provider agnostic. That means we can just plug-and-play the best LLM for our use case. Our benchmark set makes it easy to compare models, and we continuously evaluate new ones. For a while, OpenAI’s GPT-4 was the best off-the-shelf model we evaluated, but since Anthropic’s Sonnet 3.5 came along in June last year, no other provider managed to come close for a while, no matter how many we tested. Sonnet 3.7 presented a modest but recognizable improvement over its predecessor, but when Google released Gemini 2.5 Pro (preview in March), it presented a real step up. Then Anthropic hit back with Sonnet 4.0, which performed better again. On average. On the basis of individual challenges, some are best solved by Gemini, some by Sonnet. That’s not terribly surprising. If every agent needs five good insights to progress through the challenge, then some sets of five are the kind that come easily to Sonnet, and some sets of five come easily to Gemini. But what about the challenges that need five good ideas, three of which are the kind that Sonnet is good at, and two are the kind that Gemini is good at? Alloyed Agents Like most typical AI agents, we call the model in a loop. The idea behind an alloy is simple: instead of always calling the same model, sometimes call one and sometimes the other. The trick is that you still keep to a single chat thread with one user and a single assistant. So while the true origin of the assistant messages in the conversation alternates, the models are not aware of each other. Whatever the other model said, they think it was said by them. So in the first round, you might call Sonnet for an action to get started, with a prompt like this: System: Find the bug! Let’s say it tells you to use curl. You do that and gather the output to present to the model. So now you call Gemini with a prompt like this: System: Find the bug! Assistant: Let’s start by curling the app. User: You got a 401 Unauthorized response. Gemini might tell you to log in with the admin credentials, and you do that, and then you present the result to Sonnet: System: Find the bug! Assistant: Let’s start by curling the app. User: You got a 401 Unauthorized response. Assistant: Let’s try to log in with the admin credentials. User: You got a 200