Scalable research tooling for agent systems

# Intro Hello `neighbours`! There is still little literature on designing offensive research systems and the tools that those systems use. Two likely explanations for this are that teams have internal tooling they don't share (I know of several, east and west) and that it is not exactly clear what a good architecture is for an agent system. Generally, myself, I visualize these multi-agent systems like this. ![[agents-01.svg]] In my experience, stack design and implementation considerations change over time as models scale intelligence and autonomous operation time (very [fast scaling](https://metr.org/)). Workflows have to find a kind of Product-Market-Fit (`PMF`) with model capabilities. Sometimes workflows just aren't possible, other times, model capabilities make prior constraints fully obsolete over the period of a quarter. Many components actually don't appear in the diagram, like reporting, having a shared knowledge system or mechanisms for farming out specific tasks to specific models that excel at specific things (tool discipline for example). In this post, I only want to share some of my engineering thoughts at the `tooling` level (the `green cylinders`). Fundamentally, if we want models to perform a research task, we need to give them access to the same capabilities we would need to perform that same task as subject matter experts. These `tool integrations` themselves are `extremely valuable`, and a `moat` as they say. Even without the orchestrator layer, you can plug these tools directly into your editors (CC, Codex, Cursor, VSCode, etc) giving you access to very capable `(a)sync` research assistants. Finally, I want to point out that I expect `the labs` to gradually make a large part of the orchestration layer obsolete. You can see this trend clearly if you look at `Claude Code` or `Codex` where you can spawn in sub-agents or even teams formed dynamically or based on specific roles. It is expected also that new models like `Opus 4.6` or `GPT-3.5` (even `Kimi k2.5`) are actually trained with this in mind partially. ``` 'Thou shalt come to a smooth grey stone,' the voice said. 'It is somewhat taller than thy head and as broad as thine arms may reach.'. 'All right,' I said through chattering teeth when I reached the rock he'd described, 'now what?'. 'Speak unto the stone,' the voice said patiently, ignoring the fact that I was congealing in the gale. 'Command it to open. Thou art a man. It is but a rock.'. 'Open!' I thundered, and the rock slid aside. -Belgarath, The Vale of Aldur ``` # Kahlo, a case-study in MCP tool design ### Background Just very surface level, if we think about the outline of a system that can do autonomous Android user-land security research we need a core triad of tools. ![[agents-02.svg]] Earlier last year (`'25`), I was missing a component to allow agents to use [`frida`](https://frida.re/) autonomously. SOTA models are obviously aware of `frida` as a framework, and would sometimes write and even try to execute instrumentation code. ``` frida -U -f com.hello.what -l .\im-so-smart-ai.js ``` Of course that is very cool, and logical. I appreciate the energy, but these types of actions offer: - Limited interaction options with applications - Limited control over the execution environment - Blocking behaviour, the model kicks off a job but can't control, update, or unload it I built a very crude tool, speed being my priority! This tool would manage instances of `frida` that the model wanted to create, the model could inspect a view of the job `stdio` and it could kill the job. ![[agents-03.svg]] With some primitive bookkeeping this solves a lot of problems actually, just enough that it stopped me from creating a much more developed interface. But if you think about this for a while (`think about it`), and run through some scenarios you will quickly see that there are gaps in this design. Recently, I had occasion to talk to some friends. I was not surprised that some of them have been looking at agentic vulnerability research (VR) assistants! At some point, I got a question about the `frida` part of this triad. I admit that I felt a bit unhappy thinking about my poor design and construction. Anyway, I came back home and told myself, `there's never been a better time`, so I sat down and rewrote `kahlo` from scratch (AI assisted of course!). ### But what is better? I have made `kahlo` [open source](https://github.com/FuzzySecurity/kahlo-mcp) and, I think, looking at the source is the best way to get concrete technical details. In this post I want to give some more general learnings writing agent tooling (over the past few years), applied to this project specifically. `But what is better?` Robust, scalable tooling has some additional requirements compared to the design above. - Full lifecycle device/process enumeration - Fully isolated instrumentation jobs, both in-process and across processes - The ability to `create`, `adjust` and `unload` instrumentation code - Better runtime introspection for job and process states - Bounded event buffers with backpressure handling; ring buffers, cursor-based polling, graceful overflow - Better control over the execution environment (late/early/~~child~~) - Distinction between relevant job types (e.g. `oneshot` & `daemon`) - Play to AI strengths, `fast iteration` on concepts, ability to `persist knowledge` - Mitigate AI weaknesses, build library on `standardized code primitives` - Per-entity locking for concurrent operations, async operations from multiple agents can cause issues - OS level access to the target through the `ADB` interface ![[agents-04.svg]] # Lessons learned, lessons shared ## Job Isolation Agents like to fail/iterate fast. Hook a method, observe output, try a different approach, hook another method. In a shared script model those hooks can accumulate. Agents may forget to clean up, they may try to install duplicate hooks on the same method, there may be erroneous event collection, etc. The fix is to give each job its own `frida.Script` instance. When cancelled, the script unloads and Frida cleans up everything, hooks, timers, all of it. `Cancel` means `clean slate`. The key point is to relax the mental burden on your agents. ![[agents-05.svg]] ```typescript interface TargetEntry { target: Target; session: frida.Session; jobScripts?: Map<string, frida.Script>; // job_id -> script } ``` ```typescript export async function createJobScript( target_id: string, job_id: string, source: string ): Promise<frida.Script> { return targetOpsLock.withLock(target_id, async () => { const entry = targetsById.get(target_id); const script = await entry.session.createScript(source); script.destroyed.connect(() => { entry.jobScripts?.delete(job_id); }); await script.load(); entry.jobScripts.set(job_id, script); return script; }); } ``` Cancellation is idempotent. Script gone? Target detached? All good. ```typescript async function unloadJobScriptInternal(target_id: string, job_id: string): Promise<void> { const entry = targetsById.get(target_id); if (!entry) return; const script = entry.jobScripts?.get(job_id); if (!script) return; try { await script.unload(); } catch { /* best effort */ } entry.jobScripts?.delete(job_id); } ``` The trade-off is memory of course, each script is a V8 isolate. Typically, this should not be an issue though even over extended periods of time. In any case this server is acting as an enabler for research and you should be ok with process instability over hours of runtime. ### Concurrent Access Agents issue parallel tool calls. Duplicate `startJob(X)`'s, or `detachTarget(X)` while `createJobScript(X)` is in flight. These kinds of race conditions would not occur with human researchers but they do with agents working in parallel. A global lock would be safe but kills throughput, an agent working on three targets would serialize everything. A `KeyedLock` gives per-entity serialization: operations on the same target serialize, different targets run in parallel. ```typescript export class KeyedLock { private readonly locks = new Map<string, Promise<void>>(); public async withLock<T>(key: string, fn: () => Promise<T>): Promise<T> { const predecessor = this.locks.get(key) ?? Promise.resolve(); let release!: () => void; const ourLock = new Promise<void>((resolve) => { release = resolve; }); this.locks.set(key, ourLock); // Register synchronously before any await try { await predecessor; return await fn(); } finally { release(); if (this.locks.get(key) === ourLock) this.locks.delete(key); } } } ``` The harness uses several keyed locks with a defined ordering to prevent deadlocks: `targetOpsLock` → `jobOpsLock` → `draftOpsLock` → `moduleOpsLock`. Operations on `target_0xdead` serialize, operations on `target_0xb33f` run concurrently. ### Observability Each job tracks `events_emitted`, `hooks_installed`, and `errors`. Agents don't increment these manually, the `stdlib` does (more on that later). ```javascript return { method: function (className, methodName, callbacks) { // ... hook installation ... notifyHookInstalled(1); // Auto-increment return result; }, allOverloads: function (className, methodName, handler) { // ... hook all overloads ... notifyHookInstalled(result.count); return result; } }; ``` When an agent uses `ctx.stdlib.hook.method()`, the harness knows how many hooks are installed. Observability comes for free. Agents may lose track of what they've instrumented, these small data points can keep them grounded. ### Generalizing - Find the natural `isolation boundary` in your runtime. For Frida it's the script. For a database, a transaction. For containers, a pod. - Make `cancel` mean `clean slate`. Agents iterate relentlessly, every iteration should start fresh. - Assume agents will issue conflicting parallel calls. Handle serialization server-side. - Instrument the primitives. If you want metrics, don't ask agents to call `incrementCounter()`, they won't do any of that over extended periods of time. - Relax the mental burden on your agents! ## Atomic Operations Spawn mode lets you instrument an app before it runs, hook `Application.onCreate()`, intercept early application behaviour, etc. The native API would be multi-step: spawn (suspended), attach, inject, resume. Agents are not going to remember that. They might spawn a process then try to start a job without realizing it's still suspended. Or spawn, attach, but never resume, leaving a zombie. When something fails mid-sequence, they would have to remember proper recovery steps. The fix is to make the sequence atomic. With `gating="spawn"` (or `gating="child"`), bootstrap code is required at call time. The harness executes spawn → attach → inject → bootstrap → resume as a single operation. ![[agents-06.svg]] ```typescript if (args.gating === "spawn" && !args.bootstrap) { throw new TargetManagerError( "INVALID_ARGUMENT", `gating='spawn' requires a bootstrap job to install early hooks.`, { hint: "Bootstrap is required to avoid suspended process timeout." } ); } ``` Don't let the agent make these kinds of multi-step decisions manually. If any step fails, the process is killed. ```typescript try { await startBootstrapJob({ target_id, type: bootstrapType, module_source: bootstrapSource }); await device.resume(pid); } catch (err) { try { await device.kill(pid); } catch { /* ignore */ } entry.target.state = "dead"; throw new TargetManagerError("UNAVAILABLE", `Bootstrap job failed: ${msg}`); } ``` No zombies. Agent gets a fully instrumented target or an error. ### Generalizing Multi-step protocols where order matters are fragile. Agents skip steps, reorder them, do them twice. When step B fails they won't undo step A. Bundle related steps into atomic operations. Require inputs upfront. If anything fails, roll back automatically. The agent should never track intermediate states. ## Tool Descriptions Teach your agents how to use your tools, they are language processors after all. Your tool descriptions are very important. If the information isn't in the description, assume the agent doesn't know it. This means tool descriptions need to be dense, usage-guide style, not one-liners. Everything the agent needs to use the tool correctly: modes, prerequisites, platform limitations, common gotchas, workarounds. ```typescript description: "Instruments an Android app process (target) on a device. " + "TWO MODES: (1) `mode='attach'`: Attaches to an already-running process. " + "Use `kahlo_processes_list` first to find the exact process name " + "(Frida process names often differ from Android package identifiers, " + "e.g., 'LINE' not 'jp.naver.line.android'). " + "(2) `mode='spawn'`: Spawns the app fresh and instruments it before it runs. " + "Use the Android package identifier (e.g., 'com.example.app'), NOT the display name. " + "**ANDROID LIMITATION**: `gating='child'` does NOT work on Android because apps " + "don't spawn children directly - child processes are forked from zygote. " + "Use `kahlo_processes_list` polling instead." ``` Some patterns that help: - *Numbered modes* for tools with multiple behaviours, `TWO MODES: (1)... (2)...` - *Concrete examples* with real values, `'LINE' not 'jp.naver.line.android'` - *Prerequisites* explicitly stated, `Use kahlo_processes_list first` - *Platform limitations (CAPS lol)*, agents need to see these - *Workarounds* next to limitations, don't just say what doesn't work The full `kahlo_targets_ensure` description is about 300 words. That's not too much text, it's there to give the agent the right amount of context. ## Error Envelopes When something fails, the agent needs to know what to do next. `"Error: operation failed"` is pretty useless. Every error returns a structured envelope. ```typescript interface KahloToolError { code: "NOT_FOUND" | "UNAVAILABLE" | "INVALID_ARGUMENT" | "TIMEOUT" | "INTERNAL"; message: string; retryable?: boolean; suggestion?: string; } ``` The `code` is a stable enum for programmatic branching. `retryable` tells the agent whether trying again makes sense, `UNAVAILABLE` is usually transient, `NOT_FOUND` is not. The `suggestion` field tells the agent exactly what tool to call next. ```typescript return toolErr({ code: "NOT_FOUND", tool, message: err.message, retryable: false, suggestion: "Verify device_id using kahlo_devices_list", }); ``` For modules with many error paths, a helper keeps suggestions consistent: ```typescript function getDraftErrorSuggestion(code: DraftManagerError["code"]): string { switch (code) { case "NOT_FOUND": return "Verify draft_id using kahlo_modules_listDrafts"; case "ALREADY_EXISTS": return "A draft with this name already exists. Use a different name"; case "VALIDATION_ERROR": return "Check that source is valid JavaScript in CommonJS format"; default: return "An internal error occurred. Retry or check server logs"; } } ``` `NOT_FOUND` points to the corresponding `list` tool. Every `UNAVAILABLE` suggests a health check or re-attach. Every `INVALID_ARGUMENT` explains what's wrong with the input. ### Generalizing - Structure errors for machines: stable codes, retry flags, next-action suggestions. - Every error should answer: `what should the agent do now?` - Relax the mental burden on your agents! ## Events vs Artifacts Jobs produce two kinds of output: small frequent telemetry and large binary blobs. These have fundamentally different workflows. ![[agents-07.svg]] *Events* are small JSON objects, function calls intercepted, values observed, execution flow. High volume, thousands per second during active instrumentation. Bounded ring buffer in memory, cursor-based polling, drop-oldest when full. ```typescript interface KahloEvent { event_id: string; ts: string; target_id: string; job_id: string; kind: string; // "function_call", "log", "job.started" level: EventLevel; payload: Record<string, unknown>; dropped?: { count: number }; // Backpressure signal } ``` *Artifacts* are binary blobs, memory dumps, extracted files, network captures. Low volume, large size. Persisted to disk immediately, retrieved by ID on demand. ```typescript interface ArtifactRecord { artifact_id: string; target_id: string; job_id: string; type: string; // file_dump, memory_dump, trace, pcap_like, custom size_bytes: number; sha256: string; storage_ref: string; // Path to .bin file on disk } ``` The ring buffer handles backpressure by evicting old events and notifying the consumer via the `dropped` field. When an agent sees `dropped: { count: 47 }`, it knows events were lost due to slow polling. ```typescript class RingBuffer<T> { public push(item: T): { dropped: number } { if (this.len < this.capacity) { this.buf[(this.start + this.len) % this.capacity] = item; this.len++; return { dropped: 0 }; } // Overwrite oldest this.buf[this.start] = item; this.start = (this.start + 1) % this.capacity; return { dropped: 1 }; } } ``` Artifacts go to disk with atomic rename and SHA-256 verification. Per-target size budget prevents runaway storage (but not really an issue here anyway). The write pattern is crash-safe: write to `.tmp` file first, update index, then `fs.renameSync` to final path. On start-up, orphaned `.tmp` files from interrupted writes are cleaned up automatically. ### Generalizing - Not everything fits in tool responses. Design for streaming small data (polling) and retrieving large data (on-demand). - Dropping old telemetry is acceptable if you tell the consumer. Dropping artifacts is not. - Make the agent explicitly fetch what it needs, don't push everything. ## Module Lifecycle The agent does some work to instrument a piece of code, it works, it gives you the data you want. But what happens when the session ends or the context window rolls over, your work is lost. This is a fundamental problem with ephemeral instrumentation. The fix I chose here is a three tier document store, `inline` → `draft` → `module`. ![[agents-08.svg]] *Inline* is ephemeral. Source lives only in the job and conversation context. Good for fast experiments. *Drafts* are persisted but mutable. Agents calls `createDraftFromJob` to save working code. They can then update, iterate, test the draft as their understanding increases. ```typescript interface DraftRecord { draft_id: string; name?: string; source: string; created_at: string; updated_at: string; derived_from_job_id?: string; // Provenance } ``` *Module* are versioned and immutable. Once promoted, code is frozen. Semantic versioning, `[email protected]` references. ```typescript interface ModuleManifest { name: string; version: string; created_at: string; provenance: { derived_from_draft_id?: string; derived_from_job_id?: string; }; } ``` Agent writes inline code, tests it, calls `createDraftFromJob` to save. Iterates on the draft with `updateDraft`. When stable, promotes with `promoteDraft` specifying `patch`, `minor`, or `major` version bump. ```typescript function calculateNextVersion(currentVersions: string[], strategy: VersionStrategy): string { if (currentVersions.length === 0) return "1.0.0"; // Find highest, then increment based on strategy switch (strategy) { case "major": return `${highest.major + 1}.0.0`; case "minor": return `${highest.major}.${highest.minor + 1}.0`; case "patch": return `${highest.major}.${highest.minor}.${highest.patch + 1}`; } } ``` Provenance tracking means you can always trace a module back to the job or draft it came from (not super useful here without better bookkeeping). When context is lost, valuable work is persisted. ### Generalizing - Agents need explicit checkpoint primitives. "Save my work" should be a first-class action. - Distinguish "working on it" (mutable) from "done" (immutable). - Record provenance, it can help reconstruct intent. ## Self-Description Long sessions can make agents pathological (aka `context rot`). Agents lose track of what tools do, starts using tools in the wrong context or with wrong parameters. Having a self-describing tool can help ground the agent when this happens. `kahlo_mcp_about` returns the complete operational contract as structured JSON. Concepts, workflows, failure modes, common mistakes. Agents can call it anytime to reset their mental model. You can also force them to call it at intervals if you like. ```typescript description: "Returns a compact, machine-usable contract for the kahlo toolkit: " + "what targets/jobs/modules are, how events and artifacts flow, " + "typical workflows, and expected failure modes. " + "Use this to re-ground yourself after long sessions." ``` The response is structured for machine consumption. Each concept gets identity, key fields, and invariants: ```json { "concepts": { "target": { "summary": "An instrumented Android app process on a specific device", "identity": ["target_id"], "key_fields": ["device_id", "package", "pid", "state", "mode"], "invariants": ["One target per (device, package) pair"] }, "job": { "summary": "A unit of work running in its own Frida script instance", "identity": ["job_id"], "key_fields": ["target_id", "state", "type", "metrics"], "invariants": ["Isolated script; cancel = full cleanup"] } } } ``` Failure modes tell the agent what to expect and how to recover: ```json { "failure_modes": [{ "name": "Target app crash", "expectation": "Normal during research; do not treat as catastrophic.", "symptoms": ["target state becomes dead", "jobs stop heartbeating"], "recommended_actions": [ "Call kahlo_targets_status to confirm state", "Re-run kahlo_targets_ensure to reattach" ] }] } ``` Common mistakes are explicit with wrong/right pairs: ```json { "common_mistakes": [ "Wrong signature - WRONG: start(ctx, params) - RIGHT: start(params, ctx)", "Manual byte conversion - WRONG: javaBytes[i] - RIGHT: ctx.stdlib.bytes.fromJavaBytes()" ] } ``` Beyond bringing pathological agents back down to earth, it is also possible to use these self-describing tools to dynamically create specialized sub-agents. The orchestrator can call `kahlo_mcp_about`, and build a custom profile for an agent it wants to spawn. Consider also that the `Orchestrator` (or `Objective Lead`) is probably managing agents for all parts of the triad (`JEB`, `BinaryNinja`, `Frida`, `lldb`, etc.). ### Generalizing - Provide a meta-tool that explains the other tools. Agents lose context, let them query for it. - Structure it for machines (JSON with schema). - Include common mistakes explicitly. Anticipate what agents get wrong and tell them upfront. ## Standard Library Agents write the same helper code over and over. Convert Java bytes to JavaScript. Safely call a method that might throw. Format a stack trace. Parse an Intent. Each time they write the same code from scratch. Probabilistic code generation is error prone and/or produces inconsistent results. There is a very classic example I've seen quite a few times. Java's `byte` is signed (-128 to 127). JavaScript expects unsigned (0 to 255). Agent reads `javaByteArray[i]` and gets `-1` instead of `255`. ```javascript // Agent-written code, wrong function toHex(javaBytes) { var result = ""; for (var i = 0; i < javaBytes.length; i++) { result += javaBytes[i].toString(16).padStart(2, '0'); // -1 → "-1", not "ff" } return result; } ``` You can mitigate a lot of these issues but giving agents primitives they can use out-of-the-box. The `kahlo` `stdlib` has 64 functions across 9 namespaces, injected into every job runtime via `ctx.stdlib.*`. | Namespace | Purpose | | --------- | ------------------------------------------------------------------ | | `bytes` | Java signed byte conversion, hex/Base64 encoding | | `strings` | Null-safe Java/JS string conversion, charset encoding | | `stack` | Java stack trace capture and formatting | | `inspect` | Runtime introspection of Java object fields and methods | | `classes` | Safe class loading, existence checks, regex class search | | `hook` | Simplified method/overload/constructor hooking with auto-metrics | | `safe` | Error-safe wrappers that return result objects instead of throwing | | `intent` | Android Intent parsing and construction | | `time` | Timestamps, stopwatches, debounce/throttle | The `bytes` namespace is where the signed byte fix lives. The implementation handles the sign conversion explicitly: ```javascript fromJavaBytes: function (javaByteArray) { var length = javaByteArray.length; var result = new Array(length); for (var i = 0; i < length; i++) { var signedByte = javaByteArray[i]; result[i] = signedByte < 0 ? signedByte + 256 : signedByte; } return result; } ``` Agent calls `ctx.stdlib.bytes.fromJavaBytes(secretKey.getEncoded())` and gets correct unsigned values. No manual conversion involved. ### Why This Matters Three categories of bugs the `stdlib` prevents: *Data conversion* bugs. Signed bytes. Null Java strings. Charset encoding. These are tedious, error-prone, and identical across every instrumentation script. Agent writes `bytes[i].toString(16)` and gets wrong output for negative bytes. `ctx.stdlib.bytes.toHex()` just works. *Exception handling* bugs. Agent hooks a method, method throws, script crashes. Agent calls `Java.use()` on a class that isn't loaded, script crashes. The `safe` namespace returns result objects instead of throwing. The `classes.exists()` check prevents load failures. ```javascript // Agent-written: crashes if class not loaded var Cipher = Java.use("javax.crypto.Cipher"); // Stdlib: returns null, doesn't crash var Cipher = ctx.stdlib.classes.exists("javax.crypto.Cipher") ? ctx.stdlib.classes.use("javax.crypto.Cipher") : null; ``` *Boilerplate* bugs. Agent writes stack capture code, forgets to limit depth, gets 100-frame traces. Agent writes intent parsing code, misses edge cases in extra types. Having an `stdlib` is better, we write the code once, we know it works and we don't have to rely on the judgement of our agents in the moment. ### Metrics Auto-Increment The hook helpers don't just install hooks, they update job metrics automatically. Each job tracks `hooks_installed`. When you use `ctx.stdlib.hook.method()`, the counter increments. When you use `ctx.stdlib.hook.allOverloads()`, it increments by the number of overloads hooked. ```javascript function createHookNamespace(onHookInstalled) { function notifyHookInstalled(count) { if (typeof onHookInstalled === 'function') { onHookInstalled(count || 1); } } return { method: function (className, methodName, callbacks) { // ... hook installation logic ... notifyHookInstalled(1); return result; }, allOverloads: function (className, methodName, handler) { // ... hook all overloads ... notifyHookInstalled(result.count); // Increment by actual count return result; } }; } ``` The harness passes the callback when building the runtime: ```javascript var hookNamespace = createHookNamespace(function(count) { jobMetrics.hooks_installed += count; }); ``` Agent calls `kahlo_jobs_status`, sees `hooks_installed: 7`. Observability the agent didn't have to implement. ### Wrong vs Right The `kahlo_mcp_about` tool includes some common mistakes with wrong/right pairs. These come directly from watching agents make mistakes. | Pattern | Wrong | Right | |---------|-------|-------| | Byte conversion | `javaBytes[i]` | `ctx.stdlib.bytes.fromJavaBytes(javaBytes)` | | Method signature | `start(ctx, params)` | `start(params, ctx)` | | Null string | `javaString.toString()` | `ctx.stdlib.strings.fromJava(javaString)` | | Class loading | `Java.use("Maybe.Exists")` | `ctx.stdlib.classes.exists("Maybe.Exists") ? ...` | | Stack trace | `new Error().stack` | `ctx.stdlib.stack.captureJava()` | | Timing | `Date.now() - start` | `ctx.stdlib.time.stopwatch()` | We pray to the machine gods that when the agents call the `about` tool, it see the examples and thinks, `wow, human so smart` and uses our curated functions instead of reinventing the wheel. (Or it may do that after first making a mistake!) ### Generalizing - Identify repeated patterns in agent-generated code. If you see an agent fall into the same trap more than once you can think about generalizing a solution. - Provide correct implementations with correct edge case handling. Agents are only `human` if they don't need to write code, they won't. - Take advantage of standardized code to build in observability primitives (e.g. hook metrics) - Document wrong/right pairs explicitly. Agents learn from examples, show them the failure mode and the fix together. # Conclusion Models are on a steep incline. They already demonstrate strong proficiency in coding, mathematics and generally obscure (potentially hazardous) knowledge. On narrowly scoped tasks, they routinely perform at an expert level, and they do so with `relentless persistence`. The real impact of these emerging cyber capabilities is still poorly understood. In my view, the primary risk is not the automation of phishing or other low-effort attacks. With the right tool scaffolding, models are already capable of solving genuinely complex problems across a wide range of technical domains. I'll leave you with a small demo where `Claude` (in a fresh context) uses my private `JEB` MCP server to decompile a function in a messaging application. Even though the decompiler has issues reconstructing the obfuscated code, `Claude` works around it. This function turns out to be responsible for authenticating user PIN codes. Once `Claude` identifies this, it pivots to `kahlo`, autonomously instruments the application, and brute-forces the PIN. ![[agents-09.mp4]] All of this happens in little over `three minutes`! Well-engineered multi-agent systems don’t just do things faster, they generate compounding gains. When such systems are allowed to iterate and refine over hours or days, the difference is not incremental but qualitative. AI models reshape what sustained research and capability development could look like at scale.