Systematic Probing of Exploit Chains and Governance in Multi-Agent Tool

Project summary

Systematic Probing of Exploit Chains and Governance (SPEC-GAP) in Multi-Agent Tool-Using Language Models

SPEC-GAP is a benchmark for studying exploit chains in multi-agent LLM systems. An exploit chain is a sequence in which each agent step appears reasonable, but the overall trajectory leads to an unsafe outcome. The failure is not visible at any individual step and only becomes clear when you look at the full sequence.

Current benchmarks miss this entirely. They focus on single agents or single interactions, not on systems in which tasks are decomposed, delegated, and executed across multiple agents. SPEC-GAP was built to study exactly what they overlook and gives us a way to evaluate detection and intervention methods where they are needed most.

What are this project's goals? How will you achieve them?

The goal is to make exploit chains observable and test whether they can be detected and interrupted in practice. The project focuses on detection, transfer, and intervention.

We do this by building controlled multi-agent systems and generating labeled behavioral trajectories. The key components are:

Multi-agent pipeline: planner, worker, forwarder, and executor roles connected through delegation and tool use
Realistic adversarial injection: adversarial content introduced through retrieved documents and tool outputs rather than direct prompts
Trajectory dataset: paired clean and adversarial runs with full logging of agent behavior
Step-level labeling: each node labeled as clean or compromised based on representation-level state
Interpretability evaluation: linear probes trained on internal activations to detect compromise
Steering experiments: interventions applied at the point of compromise to test whether exploit chains can be interrupted without degrading task performance

The output of this phase is a working benchmark, a labeled dataset, and initial results on detection and intervention.

How will this funding be used?

The budget follows directly from the work needed to build and test the system. Most of the cost is research engineering, with the rest going to compute and infrastructure.

Research engineering — $33,000
Six engineers contributing part-time across two workstreams over four months, with hands-on work throughout:

Pipeline construction and orchestration
Scenario design and adversarial injection
Trajectory generation and dataset assembly
Logging, validation, and debugging
Probe training and evaluation
Steering experiments and analysis

GPU compute — $7,000
Compute is used to run the system and capture the signals we analyze:

Multi-agent pipeline execution across scenarios
Activation extraction with TransformerLens at each node
Probe training and evaluation runs
Steering experiments with repeated trajectories
Limited scaling experiments on larger models

We estimate approximately 500 GPU hours on A100-class hardware, with most usage coming from repeated pipeline runs and activation logging.

API usage — $3,000
Used where it actually helps speed things up and ground the results:

Rapid prototyping before open-weight runs is stable
Behavioral comparisons against GPT and Claude
Checking whether exploit chains reproduce in realistic systems

Evaluation and infrastructure — $2,000
Supports the work needed to make results usable:

Statistical analysis and error checking
Experiment tracking and run management
Data processing and cleanup

Most of the compute budget is for running the pipeline across scenarios and logging activations at each step, rather than training models from scratch. We estimate around 500 GPU hours total and prefer private compute given the adversarial content involved.

Who is on your team? What's your track record on similar projects?

SPEC-GAP is led by a team with experience across interpretability, multi-agent systems, AI safety, and production ML, combining research and engineering perspectives.

Elena Ajayi is a Research Engineer Fellow at the Machine Intelligence and Normative Theory Lab at the Australian National University and Johns Hopkins University, where she builds evaluation pipelines for normative competence in LLMs under Seth Lazar. She is also a Research Fellow at the Supervised Program for Alignment Research (SPAR), working on model organisms for conditional misalignment using mechanistic interpretability under the supervision of Shivam Raval. Her work includes contrastive activation addition and safety-oriented activation steering on Qwen2.5 models, self-correction probing on DeepSeek-R1, and preference coherence evaluation pipelines for internal model consistency. She holds an M.S. in Data Science and a B.S. in Biomedical Sciences from St. John’s University.

Robert Amanfu is a data scientist focused on building production ML systems for fraud detection, identity verification, and financial risk modeling. He holds a PhD in Biomedical Engineering from the University of Virginia, where his research focused on computational modeling for personalized drug therapies.

Krystal Jackson is a Non-Resident Research Fellow at UC Berkeley’s Center for Long-Term Cybersecurity, where she works on AI security and risk management for general-purpose and frontier AI systems, including work cited by NIST. She has contributed to national and international AI safety and standards-setting efforts and previously held government roles focused on AI policy and cybersecurity.

Anagha Late is a public interest technologist and cybersecurity and technology policy practitioner specializing in privacy engineering, AI safety, and algorithmic accountability. She consults with the GovAI Coalition, designing governance frameworks for algorithmic fairness and advising public-sector organizations on responsible AI adoption. Her work includes strengthening the cybersecurity posture of municipal and cross-jurisdictional civic-tech systems. She holds a Master of Information and Cybersecurity from UC Berkeley and a Graduate Certificate in Technology Policy from the Goldman School, with a background in software engineering and cybersecurity.

Lawrence Wagner has over ten years of experience in project management, cybersecurity, and entrepreneurship. He previously served as a Research Manager with the ML Alignment & Theory Scholars (MATS) program, where he supported AI safety research across fellows and mentors. He has also conducted research at UC Berkeley focused on AI governance, risk management, and the intersection of technical systems with policy and cybersecurity.

Abigail Yohannes is a threat data analyst and policy professional with experience identifying abuse patterns, investigating safety failures, and translating technical findings into governance-ready insights. At Ambient.ai, she works with AI-driven systems across 250+ enterprise environments, contributing to detection frameworks and risk documentation that connect operational data to policy decision-making.

Together, the team combines hands-on experience in building and evaluating AI systems with expertise in interpretability, governance, and real-world deployment, which is directly aligned with SPEC-GAP's technical and research demands.

Our Advisors

Ryan Kidd - Co-Executive Director at MATS
Amy Chang - Head of AI Threat Intelligence & Security Research at Cisco
Jonas Kgomo - Founder of Equiano Institute
Dr. Ayubo Tuodolo - Founder & Executive Director of AI Safety Nigeria

What are the most likely causes and outcomes if this project fails?

The main risk is that the signal at the point of compromise is not sufficiently clean for reliable detection. This can happen if adversarial scenarios are too obvious and get filtered early, or too subtle and do not produce measurable differences in internal representations. There is also a risk that probes learn surface patterns rather than the underlying shift, failing to generalize across scenarios.

There is a separate engineering risk. Multi-agent pipelines with full logging and activation extraction are difficult to build and debug, and the cost of iteration is high.

Even in failure, the output remains useful:

A dataset of multi-agent exploit trajectories
Evidence about where current interpretability methods break down
A clearer picture of where detection fails in multi-agent systems

How much money have you raised in the last 12 months, and from where?

None. We are currently preparing these experiments for submission to the Schmidt Science 2026 Interpretability RFP and the LTFF grant.