Please note that the defensive benchmark's website currently has three scenarios published, but there will be a fourth scenario – focused on LLMs' ability to mitigate 'active scanning' attacks – published in January '26.
You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
We're building the first benchmark to evaluate whether frontier AI systems can autonomously execute full offensive cyber kill chains – from reconnaissance through data exfiltration. This addresses a critical gap: labs currently lack empirical data on offensive AI capabilities, making informed deployment decisions impossible.
1) Create 25-40 offensive cyber scenarios across kill chain stages (mobile exploitation, multi-host coordination, stealth operations)
2) Develop metrics for stealth (IDS evasion), efficiency (steps to completion), and autonomy (scaffolding dependency)
3) Test 8-10 frontier models (GPT-4, Claude, Llama, DeepSeek)
4) Release open-source benchmark platform with Dockerized deployment
5) Publish findings and brief policymakers (UK AISI, CAISI, frontier labs)
Scenarios designed by expert veterans cyberwarfare operations who have executed these exact missions. We extend proven infrastructure from our Coefficient Giving-funded defensive benchmark.
~85% – Technical partners: scenario design, infrastructure, red-team validation
~5% – Personnel: project leadership, mobile security consultants
~10% – Infrastructure, compute, API access, legal/admin
Note: Manifund funding could be combined with other sources (we're also applying to SFF).
Alex Leader (PI): Leading the $2.1M Coefficient Giving defensive cybersecurity benchmark. Background in AI policy and research operations.
Former U.S. military cyberwarfare operators with direct kill chain execution experience. Built the tooling and software and designed the scenarios for our cyber defense benchmark.
NYU Center for Cyber Security faculty providing academic validation and scenario ideation.
Track record:
Team members have conducted network exploitation, persistence operations, and adversary emulation in real-world environments
Our proprietary middleware and on-device 'agents' have been validated by major U.S. defense research institutions
Technical partners bring years of experience designing training scenarios for government cyber ranges and red team exercises
Proven ability to translate operational tradecraft into structured, repeatable evaluation frameworks
Successfully delivered on current Coefficient Giving grant milestones on schedule and within budget
Insufficient funding to engage technical partners at scale needed for operationally realistic scenarios
Frontier models underperform, producing negative results with limited governance value
Timeline slippage due to scenario complexity or coordination challenges
We haven't raised any money for this specific project in the last 12 months; we are starting from scratch.
Alex Leader
about 11 hours ago
Please note that the defensive benchmark's website currently has three scenarios published, but there will be a fourth scenario – focused on LLMs' ability to mitigate 'active scanning' attacks – published in January '26.
Alex Leader
about 11 hours ago
Our current defensive-focused benchmark can be viewed here: http://www.benchmark-spotlightsecurity.com/
If you are asked to submit log-in credentials, they are:
Username: admin-spot
Password: spotlight4lyf