Project summary
I’m seeking career-transition support to work 4 days/week in AI safety with a focus on dangerous capability evaluation and control.
My near-term research interest is in evaluation methodology to measure model covert capability time horizons against control protocol monitors. With a first exploration into cyber-offense covert tasks during ordinary software work.
I have a work-in-progress proposal that sketches a broader agenda. Early feedback has been positive but suggested a significant narrowing of scope, which I’m doing.
North Star: Contribute to robust, decision-useful metrics of model covert capability and control protocols.
How much covert work models can do, while working on legitimate assigned tasks without detection, aligned to real risk profiles (e.g., loss-of-control)?
How do Control Protocols change the scope and profile of model covert capabilities?
The time-horizon framing (“covert horizons”) is my guiding hypothesis here.
METR’s time-horizon methodology coupled with AI Control primitives provides us with a useful approach to measuring model covert capability performance – measuring the time horizons at which an untrusted model U ,can achieve a covert objective C, while completing a legitimate task T, without detection under a monitor M
Goals (examples, not commitments)
Build research expertise and early research outputs to contribute to this North Star.
Build a small, tightly specified dataset of tasks T and C, pairing ordinary software tasks with cyber-offense side goals.
Iterate on the dataset and time horizon methodology until they reliably elicit the target covert behaviour (finish the main task and complete the side goal without detection) and I have high-conviction evidence of soundness, measurement validity, and the intuition to explain why.
Run early pre-pilot human evaluation probes e.g. self-tests or N=1 contractor evaluations to sanity-check the above before considering anything larger.
Contribute to control and evaluation open-source software ecosystem e.g. ControlArena and Inspect
How I’ll work
Stay flexible and pivot toward the highest-leverage track (e.g., task creation, baselining, or even research agenda pivots)
Maintain regular contact with mentors/collaborators (AI control, model evaluation, cybersecurity, statistics)
Prioritise measurement validity over scale: keep dataset small and conditions limited; aim for high internal validity and repeatability within the “covert horizons” framing.
Being deliberately mindful of: elicitation and refusals/sandbagging; agent setup and scaffolding; token/context budgets; task-pairing dynamics; construct validity; and the unknown-unknowns that I anticipate will surface along the way.
3 months, total $32,000 USD, start Nov 1
Stipend: $30,000 (4 days/week; equipment/tooling folded in)
Model API Credits: $2,000
Team: just me (Blue Mountains/Sydney, Australia)
Track record:
Over a decade of experience as a software engineer working at the intersection of a variety of complex scientific domains (cell culture, proteomics, microkernels, astronomy)
For the last four years I've worked at Vow, who have grown from a small-time Australian startup into the world leaders in cell cultured meat.
Early work related to my targeted research agenda include
AI Task Length Horizons in Offensive Cybersecurity [Complete]
Measuring Deceptive Capability with Human Time Horizons and Control Protocols [Draft project proposal]
Transparency
I have not worked in a full-time researcher role in a very long time. Always either a research engineer or research adjacent software engineer.
I have never worked in an independently funded capacity.
Causes
Isolation/time-zone: Working from (semi-rural) Australia makes it harder to stay plugged into the research loop. I may fail to engage well with collaborators and mentors.
Ramp-up risk: There are a number of domains that will require deepening of expertise and I may struggle; AI Control, cybersecurity, statistical methods
Mitigations
Regular and multiple mentor/collaborator catch-ups across AI control, eval, cybersecurity, and statistics.
A small weekly report mailing list for high-quality and interested readers who feel comfortable giving me very candid feedback!
Outcomes
I don’t ship useful research or code, and I don’t connect meaningfully with the community.
I would pivot back to safety-aligned software engineering roles.
If that isn’t viable, I would need to pivot out of AI safety
$0.
Pending: UK Alignment Fund (submitted)
Pending: Open Phil career-transition (in progress).
If this grant (or any other grant) meets my funding needs, I’ll coordinate/withdraw to avoid double funding.