Project summary
Recent work from Redwood Research studies whether or not an LLM attack policy is able to perform a simulated cyberattack task while fooling an LLM+human blue team into thinking its actions are benign. However, the simulated malware task is quite limited (download+run an arbitrary file from the internet). In particular, it leaves out the parts of the cyberattack process where the attacker has to find one or more vulnerabilities in the target software, find a way to exploit them to achieve some effect, and put together a payload to perform the exploit. These stages are the majority of the labor in the case of human cyber attacks, and might involve more obviously suspicious activities than the task studied by Redwood. This project would build on top of Redwood's work: modify their evaluation environment with a more detailed cyberattack task, design new attack policy and blue team strategies more tailored to the new setting, and assess the resulting safety/usefulness tradeoff curve.
This project is useful for the same reason that the Redwood project it's building on is useful - frontloading the process of designing effective AI control harnesses, and assessing how effective they are.
I'm not sure how much compute/API funding this project will need. The 2k minimum seems like enough that I'll be able to produce something, but will have to think very carefully about how to spend my budget and probably make some significant compromises. The 6k goal seems like enough to produce good results without being too constrained by budget, provided I still spend carefully.
What are this project's goals? How will you achieve them?
Get Redwood's control evaluation setup working
Design a few extended cyberattack tasks, probably with the help of some colleagues from the SEI (also acting in their personal capacity)
Modify Redwood's task environment to support each extended cyberattack task
Re-run evaluations using new environment, using the most important policy configurations studied by Redwood
Based on analysis+inspection of the behavior of those evaluations, iterate on the attack policy and blue team policy to perform better in the new setting, implement the new policies, and re-run evaluation
Write up and publish results. Probably an arxiv preprint and accompanying blog post.
How will this funding be used?
Compute/API budget to run the experiments. I'll work on this unpaid on my own time.
Who is on your team? What's your track record on similar projects?
I'm working independently.
I've previously led two SPAR projects developing novel LLM evaluations.
What are the most likely causes and outcomes if this project fails?
I have to run this project past the SEI conflict of interest approval process. If it is not approved, nothing happens and I'll return all money.
I might decide that a different project is a better use of my time, in which case I'll finish proximate deliverables, write up my work so far, share my code and return remaining money.
I might make some mistake when running an expensive experiment (bad experimental design or bad implementation) that leads to the money being spent without producing useful results. If I catch the mistake, I'll probably pay out of pocket, but I might apply for more funding, or as a last resort just write up the results, describe my mistake, and hope someone else will decide to carry on and find funding. If I don't catch the mistake I'll publish wrong/useless results.
How much money have you raised in the last 12 months, and from where?
I got a 55k grant (45k stipend, 10k compute/API) from the Open Philanthropy Career Development and Transition Fund for six months' independent research, during which I led my two SPAR projects.
I currently work full-time for the AI Security team at CMU SEI CERT, where I make 95k/year.