Project summary
I’m seeking career-transition support to work 4 days/week in AI safety with a focus on dangerous capability evaluation and control.
My near-term research interest is in evaluation methodology to measure model covert capability time horizons against control protocol monitors. With a first exploration into cyber-offense covert tasks during ordinary software work.
I have a work-in-progress proposal that sketches a broader agenda. Early feedback has been positive but suggested a significant narrowing of scope, which I’m doing.
What are this project's goals? How will you achieve them?
North Star: Contribute to robust, decision-useful metrics of model covert capability and control protocols.
How much covert work models can do, while working on legitimate assigned tasks without detection, aligned to real risk profiles (e.g., loss-of-control)?
How do Control Protocols change the scope and profile of model covert capabilities?
The time-horizon framing (“covert horizons”) is my guiding hypothesis here.
METR’s time-horizon methodology coupled with AI Control primitives provides us with a useful approach to measuring model covert capability performance – measuring the time horizons at which an untrusted model U ,can achieve a covert objective C, while completing a legitimate task T, without detection under a monitor M
Goals (examples, not commitments)
How I’ll work
Stay flexible and pivot toward the highest-leverage track (e.g., task creation, baselining, or even research agenda pivots)
Maintain regular contact with mentors/collaborators (AI control, model evaluation, cybersecurity, statistics)
Prioritise measurement validity over scale: keep dataset small and conditions limited; aim for high internal validity and repeatability within the “covert horizons” framing.
Being deliberately mindful of: elicitation and refusals/sandbagging; agent setup and scaffolding; token/context budgets; task-pairing dynamics; construct validity; and the unknown-unknowns that I anticipate will surface along the way.
How will this funding be used?
3 months, total $32,000 USD, start Nov 1
Stipend: $30,000 (4 days/week; equipment/tooling folded in)
Model API Credits: $2,000
Who is on your team? What's your track record on similar projects?
Team: just me (Blue Mountains/Sydney, Australia)
Track record:
Over a decade of experience as a software engineer working at the intersection of a variety of complex scientific domains (cell culture, proteomics, microkernels, astronomy)
For the last four years I've worked at Vow, who have grown from a small-time Australian startup into the world leaders in cell cultured meat.
Early work related to my targeted research agenda include
Transparency
What are the most likely causes and outcomes if this project fails?
Causes
Isolation/time-zone: Working from (semi-rural) Australia makes it harder to stay plugged into the research loop. I may fail to engage well with collaborators and mentors.
Ramp-up risk: There are a number of domains that will require deepening of expertise and I may struggle; AI Control, cybersecurity, statistical methods
Mitigations
Regular and multiple mentor/collaborator catch-ups across AI control, eval, cybersecurity, and statistics.
A small weekly report mailing list for high-quality and interested readers who feel comfortable giving me very candid feedback!
Outcomes
I don’t ship useful research or code, and I don’t connect meaningfully with the community.
I would pivot back to safety-aligned software engineering roles.
If that isn’t viable, I would need to pivot out of AI safety
How much money have you raised in the last 12 months, and from where?
$0.
If this grant (or any other grant) meets my funding needs, I’ll coordinate/withdraw to avoid double funding.