Project summary
1-month full-time contributing software to Inspect.
What are this project's goals? How will you achieve them?
Goals:
Make improvements to Inspect.
Add more evals to Inspect.
Test my fit as a software engineer for evals.
Build career capital.
How I'll achieve them:
Concrete ways:
Port benchmarks not yet in Inspect (e.g. TheAgentCompany, RE-Bench, and MLGym).
Develop Python packages implementing collections of Inspect solvers, tools, and scorers (e.g. like Inspect Cyber).
Implement realistic test environments like WebArena for testing a wider range of agent scenarios in contained settings.
Build tools for analyzing log files/reviewing transcripts to identify reasons for failure (if Docent isn’t doing all of this).
Build tools for presenting collections of results in dashboards (i.e. contribute to https://github.com/ArcadiaImpact/inspect_evals_dashboard).
Build tools for LM agents to use (e.g. search through https://github.com/aorwall/moatless-tools for tools which help/might be useful and build them in Inspect).
Default ways/in general:
How will this funding be used?
This is meant to replace as much of my salary in industry as possible (which would mean about $15,000 per month).
Who is on your team? What's your track record on similar projects?
Just me. I maintain open source projects like SAELens, neuronpedia, and SAEDashboard.
What are the most likely causes and outcomes if this project fails?
Causes:
Outcomes:
I don't:
Make any significant improvements to Inspect.
Add many evals to Inspect.
Know my fit as a software engineer for evals.
Build career capital.
How much money have you raised in the last 12 months, and from where?
None.