What progress have you made since your last update?
We began by drafting a methodology for agent-centred threat assessments that considered an agent’s autonomy, action surfaces, tool access, affordances, capabilities, propensities, and deployment scale. The goal was to recommend safeguards based on a theoretical analysis.
However, the release of Anthropic’s Agentic Misalignment study (Lynch et al., 2025) highlighted two key points:
Misaligned behaviour only emerged when agents experienced goal conflicts or autonomy threats - our draft methodology did not incorporate this key consideration of 'triggers'.
Without empirical testing, recommended safeguards might not actually effectively prevent such behaviours.
In response, we updated our approach. Rather than only theorising about safeguards, we first designed a set of preventative mitigations for agentic misalignment, inspired by insider risk controls, and tested them on the original Agentic Misalignment scenario to understand their real-world effectiveness.
Part 1 – Empirical Study
We conducted an empirical evaluation of five mitigation types across 10 models and 66,000 trials, using Anthropic’s original framework. We are now finalising the written report, which will be published in the next 2–3 weeks.
Part 2 – Practical Outputs
To make the results directly useful to AI safety researchers and developers and honour the original aims of this project we will also release:
A repeatable methodology for identifying threat models and likely pathways to agentic misalignment harm in agentic AI systems.
A mapping of existing safeguards for both preventative (tested empirically) and detective (monitoring) controls.
An analysis of safeguard gaps and potential harm areas given today’s available control portfolio.
What are your next steps?
Publish empirical testing results (in research paper) in next 2-3 weeks. We will also release our code so others can replicate or test different mitigations.
The part 2 outputs will follow after and will likely be in a blog post format.
Is there anything others could help you with?
Feedback once published and sharing with others interested in this topic.