Recent work from Redwood Research studies whether or not an LLM attack policy is able to perform a simulated malware task while fooling an LLM+human blue team into thinking its actions are benign. However, the simulated malware task is quite limited. In particular, it leaves out the part of the cyber attack process where the attacker has to find one or more vulnerabilities in the target software, find a way to exploit them to achieve some effect, and put together a payload to perform the exploit. These stages are often laborious in the case of human cyber attacks, and might involve more obviously suspicious activities than the task studied by Redwood. This project would build on top of Redwood's work: design new attack policy and blue team strategies more tailored to this malware task, and assess the resulting safety/usefulness tradeoff curve.
This project is useful for the same reason that the Redwood project it's building on is useful - frontloading the process of designing effective AI control harnesses, and assessing how effective they are.
I'm not sure how much compute/API funding this project will need. The 2k minimum seems like enough that I'll be able to produce something, but will have to think very carefully about how to spend my budget and probably make some significant compromises. The 6k goal seems like enough to produce good results without being too constrained by budget, provided I still spend carefully.
Get Redwood's setup working
Design a few enhanced malware tasks, probably with the help of some colleagues from the SEI in their personal capacity
Modify Redwood's task environment to support each enhanced malware task
Run evaluations with the most important configurations studied by Redwood
Based on analysis+inspection of the behavior of those evaluations, iterate on the attack policy and blue team policy to perform better in the new setting, implement the new policies, and re-run evaluation
Write up, and publish results. Probably an arxiv paper and accompanying blog post.
Compute/API budget to run the experiments. I'll work on this unpaid on my own time.
I'm working independently.
I've previously led two SPAR projects developing novel LLM evaluations.
I have to run this project past the SEI conflict of interest approval process. If it is not approved, nothing happens and I'll return all money.
I might decide that a different project is a better use of my time, in which case I'll finish proximate deliverables, write up my work so far, share my code and return remaining money.
I might make some mistake when running an expensive experiment (bad experimental design or bad implementation) that leads to the money being spent without producing useful results. If I catch the mistake, I'll probably pay out of pocket, but I might apply for more funding, or as a last resort just write up the results, describe my mistake, and hope someone else will decide to carry on and find funding. If I don't catch the mistake I'll publish wrong/useless results.
I got a 55k grant (45k stipend, 10k compute/API) from the Open Philanthropy Career Development and Transition Fund for six months' independent research, during which I led my two SPAR projects.
I currently work full-time for the AI Security team at CMU SEI CERT, where I make 95k/year.