Inspect Evals

Proposal Summary

Inspect Evals is an open-source repository of AI evaluations used by the UK AISI, METR, RAND, US CAISI and other organisations for safety testing and capability assessment. We need $50k USD in bridge funding to maintain operations while transitioning to a diversified funding model.

In this proposal, we provide: 1) background about the project, 2) the rationale for funding it at this time, 3) supporting testimonials from users

This project is run by ASET (Arcadia Impact)

About the project

Inspect Evals is a core component of the evaluations ecosystem. Currently it is the largest centralised repository of open-source evaluations, with 90+ evaluations as of September 2025. It provides a standardised way to run evals, which would otherwise require using unmaintained research code from individual repos.

As the maintainers of Inspect Evals, we work with open-source contributors to onboard evaluations, bring them up to standard, and then adjust and refine them according to user feedback. When capacity is available and the need arises, we also contribute to developing best practices and open-source tooling for development, quality assurance and maintenance of evals.

Currently, the repository provides value to a variety of users, ranging from high impact organisations including the UK AISI, the US CAISI, and METR, through to individual researchers. Use cases include:

Modifying and building on evaluations (e.g. to test evaluation awareness)
Measuring capability uplift during elicitation experiments (e.g. developing agent scaffolding)
Using Inspect Evals benchmarks to verify that infrastructure is configured correctly for new models (e.g. before testing exercises)
Using capability evaluations to sense-check unusual results during testing exercises
Dashboards and statistical analyses for understanding trends across benchmarks

As maintainers, we have three primary objectives:

Coverage across use cases: AI safety researchers have access to evaluations that meet their needs
Satisfy requirements: evaluations satisfy requirements for quality, reliability, scalability and configurability
Responsiveness: issues and PRs from open-source contributors and users are addressed quickly

Organisational background:

Inspect Evals is currently managed by Arcadia Impact, under its AI Safety Engineering Taskforce (ASET).

ASET was jointly launched by Justin Olive (Head of AI Safety) and Alexa Abbas (Technical Project Manager).

Celia Waggoner joined as ASET's Technical Program Manager in December 2024.

Project background:

Inspect Evals started as a joint project between the UK AISI, Vector Institute, and ASET. It was officially launched in November 2024. ASET has been contracted with the UK AISI since January '25 to manage Inspect Evals.

Up to this point we've delivered work with a cumulative value of ~110k GBP. (The exact amount isn't possible to determine, because we also delivered related but separate work as part of the first contract).

In many cases, we've paid contractors from Arcadia's budget in order to have sufficient management capacity and deliver high quality outcomes. For example, we paid analysts to develop a cost-estimation model for more accurate budgeting and resource planning.

Current Situation

Increased demand:

Since our first contract in January, Inspect Evals has steadily increased in users and contributors, which is increasing the level of maintenance effort (often in the form of PR reviews).

It is not surprising that there is significant work involved in onboarding and maintaining evaluations:

There are a wide range of evaluations which are useful in AI safety research (e.g. cyber, AI R&D, safeguards, as well as general agentic capabilities / knowledge)
Evaluations are used for many different purposes; i.e. large-scale testing exercises and ML research have very different requirements.
Research code is notoriously low quality, which provides a suboptimal starting point for meeting user requirements
Evals are technically complex due to the need for tools and containerised environments which create challenges with managing dependencies and configuring sandboxes to be secure, scalable and reliable.

Funding:

For the period of September-November, we expect to have £53k in project funding via a contract with AISI (via the team which has funded the project since its inception). In this most recent contract, the budget has decreased due to a new requirement to include VAT (20%), which is deducted from the total value.

In our most recent proposal, we requested that our capacity be increased from 1.5 FTE to 3.0 FTE. Due to organisational budget constraints and the new VAT requirement, our operating capacity has instead been reduced to 1.2 FTE. We've been advised that we will need to seek additional funding to achieve our project objectives, which are described below:

Response times & capacity

Feedback from users and contributors has indicated that slow response times are a significant pain-point. For this reason we established response-time standards for business-as-usual tasks.

On some occasions we also receive urgent requests from high-profile users that need to be addressed within ~24 hours. For this reason, we try to retain buffer capacity for urgent / high-priority tasks when they arrive.

Our response-time standards, which were proposed and endorsed by contributors, are summarised below:

Pull requests:
- Initial response within 3 days
- Follow up responses within 24 hours
Issues (bugs):
- Response within 3 days (this usually includes resolving the bug, but this may not apply in all cases)

Based on historical data, we require at least 68 hours of SWE capacity to achieve an >80% success rate in meeting these thresholds. This does not account for:

Increasing demand (i.e. increased issues and PRs)
High urgency requests

Currently we are on track to have less than 45 hours of SWE capacity (i.e. 66% of the bare minimum under generous assumptions). This will result in reduced responsiveness, and likely a build-up of PRs and unresolved bugs.

Meeting user requirements:

Due to the above mentioned capacity limitations, we are unable to carry out work which proactively ensures evaluations meet user requirements. This means that staff at organisations such as UK AISI and US CAISI currently experience many unnecessary pain-points.

Why are there so many pain-points?

Many evaluations are used in complicated, high-stakes situations (such as pre-deployment testing). However, they are not developed by their original authors with these use cases in mind, and there are insufficient incentives to carry out time-consuming QA and implement software engineering best practices (e.g. testing), let alone implement more advanced functionality that helps meet requirements around configurability and scalability.

Our understanding of user requirements and pain points is grounded in research undertaken in the past several months:

June '25: we interviewed over a dozen researchers and engineers working in evals (e.g. from Apollo, UK AISI, Epoch, Cambridge, Oxford). The findings are available in our recently updated Evals Report
September '25: we put out a call for feedback to users of Inspect Evals, and we received detailed responses from 6 UK AISI staff and 2 US CAISI staff.
- For those who are interested, we can provide further technical details (e.g. specific statements, requests and issues raised).

This feedback highlighted a range of issues, including:

Quality: Trust issues with implementations require users to conduct their own QA checks, which wastes time and duplicates effort. Problems include errors with the implementation causing the evaluation to crash, silent failures such as undetected reward hacking or poorly configured tools, as well as issues with underlying dataset quality, as seen recently with SWE-Bench.

Configurability: Users want to be able to easily modify evaluations. At present, making changes to scorers, solvers, and sandboxes often requires changes to the source code, which introduces pain-points and reliability issues.

Reliability: Evaluations rely on underlying dependencies which can change. When these dependencies come into conflict, evaluations can stop working entirely (in the case of HuggingFace going to 4.0.0) or start working differently, without being noticed.

Scalability: Users need parallelized runs across hundreds of instances but face orchestration complexity and error handling difficulties during large-scale deployment. High-complexity evaluations need to be configured to be more scalable, and users have also asked for greater access to low-complexity agentic evaluations (e.g. those which don't require sandboxes).

To assist in meeting user requirements, we recently hired Jay Bailey to fill the role of technical lead for this project; Jay spent 20 months as a research engineer in the UK AISI Cyber & Autonomous Systems team.

Rationale for 50k USD

An additional $50k USD would bring our budget for September-November to £90k GBP, which will cover all our staff costs (equivalent to ~3.0 FTE), ensuring we have the necessary capacity to meet our goals.

More specifically, it will allow us to uphold our responsiveness standards while also making improvements to the repository to address the 4 areas of user requirements. Having this capacity is also essential for long-term sustainability, because it provides a window of opportunity to develop our long-term approach to partnerships, product management and fundraising.

If we do not have this funding, we will need to either:

Reduce our SWE capacity. This would significantly disrupt service delivery (i.e. most PR reviews and bug fixes would essentially come to a halt)
Significantly reduce our leadership capacity (i.e. limiting ability to coordinate with users, funders and other stakeholders, not to mention managing the SWE team to ensure tasks are being completed on time and according to user requirements)

Team & Track Record

Leadership Team

Celia Waggoner (Technical Program Manager/Engineering Manager) (~20 hrs/wk)

Senior engineering manager with 10 years of tech industry experience. Previously at GoDaddy, promoted from entry-level SWE to senior engineering manager in under 6 years.

Responsibilities include:

Reviewing & responding to incoming communications, PRs and issues
Prioritising and delegating tasks
Reviewing code & providing feedback to SWEs
Facilitating stand-ups and internal communications

Jay Bailey (Technical Lead) (40 hrs/wk - planning to start in mid-October)

Jay was a founding member of the UK AISI, where he worked as a research engineer for 1.5 years. Previously he worked as a SWE at Amazon, and also participated in the MATS ‘23 Winter cohort.

Engineering Team

Alexander Putilin

UK-based software engineer with over 12 years of technical experience, including previous work at Facebook. Alexander has been working with us since February '25, and will be balancing his Inspect Evals work while also contracting with Anthropic. (4-8 hrs/wk).

Matt Fisher

Melbourne-based software engineer with 20 years of SWE experience. Joined ASET as a volunteer contributor in July 2024, and has contracted with us as our primary senior engineer since January '25. (20-30 hrs/wk).

Scott Simmons

Sydney-based engineer with Bachelor of Engineering Science from New Zealand's top university (A+ average). Currently Senior SWE at CommBank while contributing 15-20 hrs/wk.

Tania Sadhani

Sydney-based engineer with Bachelor of Data Analytics from Australia's top university (High Distinction average). Currently researching AI interpretability for honours degree while contributing 10-15 hrs/wk.

Endorsements

In support of this funding proposal, we requested supporting statements from users of Inspect Evals. We have included all statements that we received.

UK AISI staff:

Michael Schmatz, Technical Staff, Cyber & Autonomous Systems Team:

Inspect Evals has been internationally influential - as you can read here, CAISI based their Cybench implementation off of the public fork in Inspect Evals.
It makes it really easy to run public general capabilities evals, which we used to do in the Autonomous Systems team quite a bit and I know some other groups within AISI may be interested in doing.
It provides a useful repository of examples of all sorts of evaluations, which makes writing new evaluations easier, especially for people getting started in Inspect.

Joe Skinner, Technical Staff, Cyber & Autonomous Systems Team:

Inspect evals is core to the work I do as a researcher. It provides access to a large repository of high-quality evals and benchmarks that are easy to run and use consistent infrastructure, allowing me to ensure experimental consistency across the work I do.

Art O'Cathain, Software Engineer, Core Technology Team:

In addition to others' points, I would add that bug reports and feature requests originating from users of Inspect Evals have been very useful to guide the development of Inspect core

Sid Black, Technical Staff, White Box Evaluation Team:

Inspect evals has been transformative for our research team, giving us the means to test models against a much wider range of evaluations and make stronger, more generalizable claims in our research than we would have been able to without it. The maintainers are always responsive, and the package is an excellent resource that enables the research community to run reproducible, scalable and complex evaluations with ease.

Ole Jorgensen, Technical Staff, Chem-Bio Team:

Inspect Evals allows me to quickly and easily test hypotheses about new models. It provides a great resource to the evals community and I'm grateful it exists.

Open-source contributors:

Jay Chooi (from AISST at Harvard, intern at AWS)

Reviews are incredibly fast, usually within 24h, and the maintainers consistently follow-up until the PR is shipped.
The maintainers are professional, give clear, useful feedback and code suggestions, and work together with the PR-authors to ensure that the PR is production-ready.
I'm pleasantly surprised by the quick turnaround on the maintainers' part, which make Inspect Evals a super exciting open source community to contribute to.
Contributing feels great with Inspect Evals because each benchmark is clearly scoped while having their own complexities to be incorporated.

Pranshu (SWE at Cloudera)

Contributing to the inspect_evals repository has been an immensely rewarding and collaborative experience. I was drawn to contribute by the variety of innovative AI evaluations created by the research community. The documentation was clear and precise; I was able to get the tool running without any difficulties.

The maintainers promptly handled everything from answering queries to reviewing pull requests and resolved conflicts pragmatically while keeping everyone's interests in mind. Working with this project has deepened my understanding in evaluation frameworks and enhanced my expertise in their technical implementation.

Anthony Duong (Inspect contributor via Manifund)

Five out of the six PRs I created were reviewed within 3 business days, and the reviews contained good suggestions, which helped me learn. Maintainers helped me merge all of my PRs, testing my changes themselves, and providing clear feedback. The maintainers' responsiveness in either reviewing PRs, or letting me know when they'd review them, and there being standards and processes in the contributing guide surprised me. The amount of work the contributors put into developer experience and testing made contributing feel smooth, and the maintainers always made me feel that they valued my contributions.

Contacting the team

Anyone who is interested to learn more specific details about Inspect Evals can reach out and contact me at justin@arcadiaimpact.org, or book a meeting via my calendar. (There are a few details about budget changes and other sensitive strategic or technical factors that aren't provided here, so we'd be happy to share these in more detail to anyone who is interested)

Feel free to get in contact even if you're not looking to fund our work, it's always a pleasure to have thoughts, feedback and questions from the community