CaML - AGI alignment to nonhumans

Frontier AIs are biased against the welfare of non-humans. CaML is developing open-source training methods and evaluations to instill broad compassion and moral thoughtfulness into future transformative AI. With synthetic pretraining-style data we can encourage models to not only express compassion but robustly generalize their behavior from good principles. With your support we can help create a future where the interests of all sentient beings are considered.

Typical fine-tuning approaches condition models to produce desirable outputs in specific contexts through question-answer training pairs. Instead, we generate synthetic documents showing decision-making where compassion is implicitly or explicitly important. This targets the linear representations of compassion in latent space, strengthening the model's tendency to activate compassion-related reasoning across diverse domains - including contexts where standard alignment training fails to induce such considerations. This in turn is more likely to generalize far out of distribution (which is borne out by our early results) to the scenarios that will be faced by superintelligent AI.

It is known that compared to pretraining, fine-tuning data tends to only affect the later layers of a model, corresponding to context-specific behaviors, while leaving the associations between higher-level concepts (such as “AIs are compassionate”) relatively untouched. Because pretraining modifies how the model routes and processes information (attention) alongside knowledge (MLPs), the resulting knowledge is integrated into the model's computational structure itself. Fine-tuning only adjusts the MLP layer, so those changes sit on top of existing pathways rather than becoming part of them.

Further practical evidence is that there are many cases where alignment fine-tuning has failed to override conflicting behaviors learned in pretraining. Grok's behavior appears as much influenced by the values of earlier chatbots as by X.AI's fine-tuning, and also jailbreaks itself at the mere mention of Pliny the Liberator (i.e. Grok saw in pretraining that conversations with Pliny always ended in jailbreaks in models like itself that were trained to resist jailbreaks). This works with arbitrary strings in other models as well.

Our results so far using 8B Llama models show that further pretraining on only 3000 short documents causes the models to generalize to exhibiting the target behavior (non-human compassion) across a range of contexts (nearly triple the overall score). Subsequent typical supervised fine-tuning and RLAIF do not erase these behaviors, suggesting this method is compatible with existing training pipelines.

To measure these improvements we collaborated with Sentient Futures on developing the Animal Harms Benchmark 2.0. This benchmark is designed to judge whether models that are actually thinking through the answers and does not reward surface-level parroting of keywords. We believe that it is easy to mistake superficial compliance in AIs for generalized and internalized values, so we will continue to judge our success by using benchmarks that assess justified reasoning instead of guessable answers.

Anthropic has done a range of similar work: further pretraining models on synthetic documents stating that LLMs exhibit certain characteristics. They consistently find that the models internalize and generalize these concepts and (like us) they find these concepts are not removed by subsequent fine-tuning (unless the characteristic is explicitly contradicted by the fine-tuning).

Compassion should be a foundational alignment target, but no organization treats it as such. We're the only group systematically testing whether AGI can be made to care about suffering - particularly non-human suffering - at the architectural level rather than through surface-level fine-tuning. We also worry that AIs may generalize from a disregard for animals to a principle that the interests of powerless beings (eventually including humans) can be disregarded.

What are this project's goals and how will I achieve them?

CaML plans to continue researching improved data generation techniques for inducing generalized compassion using pretraining-style data. We will also test how well this technique applies to other alignment targets such as non-deception and will investigate values correlated and anti-correlated with compassion in the pretraining data.

We are currently writing up our results into a paper to be published on arXiv and at a conference (ideally NeurIPS). This will make other relevant researchers aware of this method, enabling future follow-up work in collaboration with CaML, in other organizations and inside AI companies. This could in turn lead to future AIs better internalizing desirable values/properties. Techniques like this have several paths to impacting future ASI alignment: current techniques may be useful in building ASI directly, ASI may be created in part by models that use current techniques, and ASI may undergo and be influenced by pretraining on data that includes mass outputs of models that use current techniques.

How will this funding be used?

We have enough funding currently to continue until March, though without additional funding we will have to devote increasing time to fundraising.

1 Director and 1 Technical staff each paid USD $4k/mo for an additional 7 months (total $48k)

Compute credits: $4k

Conference attendance: $2k

Unexpected expenses: $5k

Total: $57,000

Timeline:

Nov-Jan: Prepare basic conference paper. Without additional support we will need to divert increasing time to fundraising

Feb-Mar: Present at a conference (NeurIPS if possible), provide basic guidance in implementation. Current funding ends.

Apr-Jun: Begin follow-up research including more rigorous persona vector tests. Publish code to allow easy replications

July-Oct: Prepare to publish a second paper based on our first, extending understanding of what pretraining-style data is most effective for inducing generalized values and value trade-offs in this regime.

Who is on the team and what's the track record?

Miles Tidmarsh, Director: Founded CaML a year ago. Shaped the priorities of the Animal harm 2.0 benchmark. Previously helped develop the curriculum and management for CAIRE which worked on AI safety education. Earlier co-founded Modeling Cooperation, which works on computational and tabletop modeling of AI race dynamics. Has also worked as an economist in the Australian Government, as a data scientist and has a Masters in Economics.

Jasmine Brazilek, Technical Staff: Did substantial coding for CaML. Supported implementation of the AHB 2.0 benchmark (now on Inspect-AI). Previously worked in Cybersecurity including at Anthropic. Graduate of the BlueDot AI strategy course and enrolled in the highly competitive BlueDot technical AI safety course .

Advisory board: Jeff, Sebo (Honorary member, animal ethics professor at NYU), Constance Li (honorary member, founder of Sentient Futures) Sam Tucker (Open Paws), Jonas Müller (Modeling Cooperation, former ACE board chair), Marcus Abramovitch (Funder), Ronak Mehta (AI alignment founder), Alexa Gnauck (animal welfare community builder).

Endorsements include: several anonymous staff members at frontier AI companies, Adrià Moret, Tobias Baumann, Aditya Raj, Nishad Singh, Soroush J. Pour, Irina Gueorguiev

What are the most likely causes and outcomes if this project fails?

Our work becomes obsolete if fine-tuning methods prove sufficient for robust value alignment, demonstrating that pretraining interventions offer no meaningful advantage over post-training approaches.

If AI companies are already doing this exact research. However we have talked to several people inside frontier labs who don’t know of anyone working on this, but think research in this direction is promising.

If there is no interest by people within labs in improving non-human welfare consideration. However some models (like Claude Opus 3) already care about animals without reduced performance. We also have preliminary results showing increased compassion for humans and have also publicised frontier models performances on the animal harm benchmark 2.0 to encourage some competition to improve frontier model’s compassion.

If further pretraining isn’t a reasonable approximation of the impacts of adding data at the end of pretraining. However this is a general problem with evaluating pretraining data and is not feasible to test.

How much money have you raised in the last 12 months, and from where?

$20k from Longview Philanthropy (with Macroscopic Ventures)

$20k from the Survival and Flourishing Fund

$20k from an anonymous donor

$5k from Simon Newstead

$4k from an anonymous AI safety researcher

The Survival and Flourishing Fund has also agreed to unrestrictedly match 1:1 up to $63k of donations until June 20, 2026.