Francesca Gomez

@wiserhuman

Founder of Wiser Human, background in risk management, focusing on developing technical human control mechanisms for agentic AI systems which are robust to agent subversion

https://www.linkedin.com/in/francescagomez/

$0total balance

$0charity balance

$0cash balance

$0 in pending offers

About Me

I’m the founder of Wiser Human, an AI safety organisation focused on developing advanced monitoring and control systems for agentic AI. Our work interrogates the limits of existing safeguards, identifying where they fail with agentic AI so we can build verifiable, empirical defenses and understand where our limits of control lie. By exposing these limits, we aim to inform safer decisions for long-term AI safety.

Previously, I was the CEO & Co-Founder of Smarter Human (2018-2024), where I helped startups assess the ethics and compliance of their AI and machine learning developments, designing scalable governance frameworks. Prior to that, I worked in digital risk management at HSBC, Deloitte, American Express, and Tandem, helping organizations embed responsible AI and technology risk management into their products and operations.

I hold an MSc in Human-Centred Computing and a BSc in Artificial Intelligence from the University of Sussex, and published a paper on Dynamic Safety Cases for Frontier AI with the Arcadia Impact Taskforce in December 2024.

Projects

Develop technical framework for human control mechanisms for agentic AI systems

Comments

Develop technical framework for human control mechanisms for agentic AI systems

Francesca Gomez

about 1 month ago

I’m excited to share that Part 1 of our project is now complete: our empirical study testing mitigations for agentic misalignment across 10 models and 66,600 trials, using Anthropic’s Agentic Misalignment scenario framework (Lynch et al., 2025).

We designed and tested controls, adapted from insider-risk management, that steer AI agents toward escalation under stressors such as autonomy threats, substantially reducing blackmail across ten models without retraining or fine-tuning.

Key findings:

Controls adapted from insider-risk management significantly reduced blackmail rates across all ten models, though not entirely.
Escalation channels and compliance cues steered agents toward safe, compliant actions without altering base model weights.
Because these mitigations generalised across model families, they may form a low-cost, model-agnostic defence that reduces the number of harmful actions needing to be caught by monitoring.
The study also surfaced new failure modes and biases detectable only through cross-model and counterfactual analysis.

We believe that environment shaping, where agents act to preserve autonomy or goal achievement over longer time horizons, is a credible threat model requiring deeper study.

📄 Research page: https://www.wiserhuman.ai/research
✍️ Blog summary: https://blog.wiserhuman.ai/p/can-we-steer-ai-models-toward-safer
💻 Code and dataset: https://github.com/wiser-human-experimental/agentic-misalignment-mitigations/tree/public-mitigations-v1
📘 Paper (preprint): https://arxiv.org/abs/2510.05192

This is an early proof of concept, and we hope to explore further how steering controls can form part of a layered defence-in-depth approach.

Develop technical framework for human control mechanisms for agentic AI systems

Francesca Gomez

3 months ago

Progress update

What progress have you made since your last update?

We began by drafting a methodology for agent-centred threat assessments that considered an agent’s autonomy, action surfaces, tool access, affordances, capabilities, propensities, and deployment scale. The goal was to recommend safeguards based on a theoretical analysis.

However, the release of Anthropic’s Agentic Misalignment study (Lynch et al., 2025) highlighted two key points:

Misaligned behaviour only emerged when agents experienced goal conflicts or autonomy threats - our draft methodology did not incorporate this key consideration of 'triggers'.
Without empirical testing, recommended safeguards might not actually effectively prevent such behaviours.

In response, we updated our approach. Rather than only theorising about safeguards, we first designed a set of preventative mitigations for agentic misalignment, inspired by insider risk controls, and tested them on the original Agentic Misalignment scenario to understand their real-world effectiveness.

Part 1 – Empirical Study
We conducted an empirical evaluation of five mitigation types across 10 models and 66,000 trials, using Anthropic’s original framework. We are now finalising the written report, which will be published in the next 2–3 weeks.

Part 2 – Practical Outputs
To make the results directly useful to AI safety researchers and developers and honour the original aims of this project we will also release:

A repeatable methodology for identifying threat models and likely pathways to agentic misalignment harm in agentic AI systems.
A mapping of existing safeguards for both preventative (tested empirically) and detective (monitoring) controls.
An analysis of safeguard gaps and potential harm areas given today’s available control portfolio.

What are your next steps?

Publish empirical testing results (in research paper) in next 2-3 weeks. We will also release our code so others can replicate or test different mitigations.

The part 2 outputs will follow after and will likely be in a blog post format.

Is there anything others could help you with?

Feedback once published and sharing with others interested in this topic.

Develop technical framework for human control mechanisms for agentic AI systems

Francesca Gomez

6 months ago

I submitted the info to withdraw funds at the end of April but not sure how to track progress of this. I imagine being based in UK there are some extra checks, but let me know if you need any additional info from me to process.

Develop technical framework for human control mechanisms for agentic AI systems

Francesca Gomez

7 months ago

Thank you! Ryan was introduced to the project while he was advising on the Catalyze Impact AI Safety incubator. Looking forward to kicking off now!

Transactions

For	Date	Type	Amount
Manifund Bank	7 months ago	withdraw	10030
Develop technical framework for human control mechanisms for agentic AI systems	7 months ago	project donation	+30
Develop technical framework for human control mechanisms for agentic AI systems	7 months ago	project donation	+10000

Comments

Develop technical framework for human control mechanisms for agentic AI systems

Francesca Gomez

about 1 month ago

Key findings:

Controls adapted from insider-risk management significantly reduced blackmail rates across all ten models, though not entirely.
Escalation channels and compliance cues steered agents toward safe, compliant actions without altering base model weights.
Because these mitigations generalised across model families, they may form a low-cost, model-agnostic defence that reduces the number of harmful actions needing to be caught by monitoring.
The study also surfaced new failure modes and biases detectable only through cross-model and counterfactual analysis.

We believe that environment shaping, where agents act to preserve autonomy or goal achievement over longer time horizons, is a credible threat model requiring deeper study.

This is an early proof of concept, and we hope to explore further how steering controls can form part of a layered defence-in-depth approach.

Develop technical framework for human control mechanisms for agentic AI systems

Francesca Gomez

3 months ago

Progress update

What progress have you made since your last update?

However, the release of Anthropic’s Agentic Misalignment study (Lynch et al., 2025) highlighted two key points:

Misaligned behaviour only emerged when agents experienced goal conflicts or autonomy threats - our draft methodology did not incorporate this key consideration of 'triggers'.
Without empirical testing, recommended safeguards might not actually effectively prevent such behaviours.

Part 2 – Practical Outputs
To make the results directly useful to AI safety researchers and developers and honour the original aims of this project we will also release:

A repeatable methodology for identifying threat models and likely pathways to agentic misalignment harm in agentic AI systems.
A mapping of existing safeguards for both preventative (tested empirically) and detective (monitoring) controls.
An analysis of safeguard gaps and potential harm areas given today’s available control portfolio.

What are your next steps?

Publish empirical testing results (in research paper) in next 2-3 weeks. We will also release our code so others can replicate or test different mitigations.

The part 2 outputs will follow after and will likely be in a blog post format.

Is there anything others could help you with?

Feedback once published and sharing with others interested in this topic.

Develop technical framework for human control mechanisms for agentic AI systems

Francesca Gomez

6 months ago

Develop technical framework for human control mechanisms for agentic AI systems

Francesca Gomez

7 months ago

Thank you! Ryan was introduced to the project while he was advising on the Catalyze Impact AI Safety incubator. Looking forward to kicking off now!

For

Date

Type

Amount

Manifund Bank

7 months ago

withdraw

10030

Develop technical framework for human control mechanisms for agentic AI systems

7 months ago

project donation

+30

Develop technical framework for human control mechanisms for agentic AI systems

7 months ago

project donation

+10000