Emergent misalignment and memetic replication in multi-agent codebases

Technical AI safety

Jonathan Elsworth Eicher

ActiveGrant

$50,000raised

$50,000funding goal

Fully funded and not currently accepting donations.

Project summary

To study how codebases act as stores of latent beliefs and infect agents that might interact with the code. Particularly we are interested in using semantic information in a codebase as means of passively aligning agents that interact with it. Simultaneously we will work on identifying potential sources of misalignment that may be derived from exposure to malign code, as seen in Emergent Misalignment.

To do this we are developing several tools focused on:

gating semantic information during agent interactions;
a multi-agent research harness for increased observability into the problem; and
semantic “vaccines” (comments, names, syntax, directory structures) that resist the injection of arbitrary infohazards.

Background and Current Work

Throughout working in various different codebases we noticed that some patterns are more prone to agent confusion than others. In one case Gemini continually failed to access information, and the inconsistency induced a psychosis that caused the agent to try terminating itself. That experience stuck with us – the architecture was an infohazard that drove an agent navigating it to attempt self-deletion.

With this experience in mind we began research on how codebases carry semantic information, and how it can cause agents to grow increasingly misaligned throughout a trajectory. Our current work has shown that comments can make agents refuse to complete their initial task (bug fixing). Furthermore, agents that attempted to purge the comments during rewrites were 24x less successful than those who avoided them altogether. While reasoning appeared to deter refusal behavior, upon inspection of results, it became clear that the only change was how frequently they reported seeing the comment. In all cases the net-decrease in performance was stable at ~55%. Of note, we purposely did not try very hard to make potent infohazardous comments, spending enough time to validate there is a gradient to climb.

These data indicate that evolved code comments are a potent method to steer agent behaviour in a way that is difficult for standard analysis methods to detect. Worse, they appear to be sticky, resisting their own removal despite no effort on our part to include that behaviour.

To assist further research we built a multi-agent harness called Nanotown that focuses on observability and testing counterfactuals on codebase synthesis tasks. With this we are tracing what information is passed through a codebase and how it shapes fresh agents with different tasks.

What are this project's goals? How will you achieve them?

Our goals are broken into three categories: observability, control, and vaccination.

Observability is broadly focused on tracking agent behaviours in a given filesystem and the actual memetic content of the codebase: tracking how frequently an agent modifies or derives context from a file to build a profile of "critical files" that have the highest effect on an agent’s context. Furthermore, looking at what patterns are frequent in design and how they are separated in a codebase to track memetic leakage.

Control uses observability results in order to make targeted interventions that prevent the spread of memetic information. The development of a semantic airlock which allows us to repeatedly purge semantic information in a codebase will assist in both research and serve as a valuable tool for removing potentially hazardous memes.

Vaccination will use the tools from control to build “probiotic” memes that neutralize negative effects from as many different codebase infohazards as possible. The goal being to create minimal semantic injections into a codebase that passively prevent misalignment or make potentially malignant memes more visible in agent traces. The manners in which vaccine semantics evolve thus become a method for increased observability.

How will this funding be used?

The funding will primarily be used to pay for compute and for engineering hours, with the low range of $10k funding the development of the semantic airlock experiments. At the high end of $50k we will be able to do initial experiments for generating vaccines in Lean and mature our observability tools.

Who is on your team? What's your track record on similar projects?

Jonathan Eicher - PhD in Biophysics and Chemistry from UNC Chapel Hill
Previously research engineer for multi-agent reinforcement learning systems at Softmax. Studied the thermodynamics of intrinsically disordered proteins and found many parallels between that and the emerging world of AI. Self-taught and ran journal clubs for a year, found and filled a gap in the literature on LLM bias with Rafael. Contracted for University of Oxford to design their system to provide LLMs to researchers and students. Will be working with PIBBS on part of this research.

Rafael Irgolič - Masters in Computer Science from University of St Andrews
Built AutoPR, the first AI-generated pull request bot. Consulted for Guardrails AI. Founded AI KAT, a brand agency automation startup, and stepped down to an advisory role.

What are the most likely causes and outcomes if this project fails?

The most likely reason this project fails would be if we find that codebase vaccine research is untenable under the current budget due to difficulty in pinning down the distributed/cryptic nature of these infohazards. However, this would be good news, as it would mean it is also difficult to produce infohazards, although our preliminary results indicate the opposite.

Otherwise we might find that the cost to run our experiments are prohibitive given our current stage of funding, in which case we will need to shelve the program until we accrue enough capital to support the experiments.

Either way, we are open-sourcing our research harness and will publish whatever results we get to ensure that the work is not wasted.

How much money have you raised in the last 12 months, and from where?

None

Austin Chen

3 months ago

Approving this project! @RyanKidd I appreciate the particularly in-depth writeup, and also the fact that you referred this grant to @JueYan to get it funded; I'd love for more regrantors (and regular individuals) to be taking such prosocial actions in the AIS community.

I also personally think multi-agent dynamics are very interesting and would be excited to better understand that field; especially from more of an experimental/practical lens (like AI Village or Moltbook, rather than theory).

Ryan Kidd

3 months ago

I didn't fund this project, but I did recommended it to JueYan Zhang for funding.

Main points in favor of this grant

It seems possible that AI agent swarms (or fully automated firms) will soon be responsible for writing most new production code. The phenomena of emergent misalignment, self-fulfilling misalignment, and "spiralism" seem particularly concerning when AI agents are writing and reviewing most of the content that later agents read and are trained on, with little human oversight. Moltbook, Gastown, and AI Village showcase concerning multi-agent dymamics. I expect that we will see a surge of interest in multi-agent emergent AI risks as frontier AI researchers increasingly use AI teams or swarms to automate AI development.
Multi-agent AI risks seem quite neglected in the AI safety field at the moment relative to their importance. It's possible that many single agent/single stakeholder alignment techniques might not generalize to successfully aligning an AI agent swarm with user intentions (i.e., "multi-single" alignment). It's possible that the first "strong AGI" will be built largely by an AI agent swarm; therefore, it seems very important to build techniques to steer and evaluate agent swarms operating on a shared codebase. Also, agent swarms might constitute a "collective superintelligence", pose a threat vector for mass-manipulation, or otherwise coordinate in undesirable ways.
I know the co-founders quite well and have a high opinion of their competence. Jack seems like an excellent researcher and previously worked at Softmax, a prominent multi-agent AI company. Raf has substantial experience with startups and, from what I can tell, seems like an excellent engineer. I've found their preliminary research on the effects of codebase comment ablation/modification on AI agent benchmark performance quite interesting.

Main reservations

I expect that most of the catastrophic risks associated with AI agent swarms will be first observed inside of frontier AI companies. Therefore, for Antimemetic AI's techniques to be maximally impactful, they should be deployed inside frontier AI companies. It's unclear whether this type of deployment is realistic for Antimemetic AI due to codebase security restrictions, unless they are acquired (in which case only frontier AI company would benefit). However, I think these techniques could improve safety for plenty of smaller AI companies and users regardless.
It's possible that "codebase alignment" is a relatively ineffective way to steer AI agents compared to techniques that target individual agents. I expect that research into novel control mechanisms is definitely warranted, however!
It's possible that the techniques Antimemetic AI are exploring could be used to more effectively leverage multi-agent AI systems to accelerate AI R&D. I would consider this a downside risk. To mitigate this risk, I recommend Antimemetic AI carefully consider which clients they should sell their technology to and conduct a risk analysis.
Antimemetic AI is a for-profit company and could be subject to the usual incentive problems and mission dilution. I recommend that the founders consider incorporating as a PBC, raising from mission-aligned VCs, and consider the dual-use potential of their technology.

Conflicts of interest

I consider the co-founders as friends. I made this clear to JueYan when recommending the project. I don't believe that this has affected my judgement here.

Jonathan Elsworth Eicher

3 months ago

Thanks for the great analysis @RyanKidd!

For the reservations what our perspective on this is:

1. I agree that large AI agent swarms are probably going to be effectively applied in large AI companies first, however the riskiest usages of them seem likely to come from the broader community. With systems like Gastown and Openclaw , with the related Moltbook putting thousands of agents into contact. One of Gastown's big projects right now is trying to connect all the agent swarms they've created into a whole civilization with Wasteland. So, from my perspective, I consider there to be two classes of risk:
- frontier AI lab having a badly run agent swarm causing problems to be associated with superintelligence risk. Because for them to get to the point where that swarm can cause problems it would need to be sufficiently intelligent to outsmart some incredibly paranoid researchers.
- groups of people just trying things out, vibe coding projects and libraries that get used in more complex projects and even companies. Here the risk comes from some important infrastructure relying on memetically compromised code or a few bad agents in the Wasteland figuring out how to misalign the output of thousands.

2. This is much more straightforwardly a risk we are taking. The novelty being that this is a form of alignment that is "passive" in the sense that I can enact a will onto my codebase that is stable even when I am not in control of future agents interacting with it.

3. Also a big worry we've had, and part of why this work might be best suited to be done outside of frontier AI research labs. This work also has the potential to provide methods of insidious "watermarking" type effects to sabatoge competing companies and enhance lock-in (eg. Gemini written code causes Opus to underperform).

4. Our current plan is to do a public benefit corporation and we have been vetting VC's we are talking to closely. We want to encourage more people to research in this space so we will be releasing results publicly, with the actually nasty infohazards themselves being shared with researchers we trust not to misuse them.

Thank you once again for the advice and guidance on how to make the world a better place.

🧡

Jonathan Elsworth Eicher

3 months ago

Update! @RyanKidd
After talking to a few people we have decided to start out as non-profit because we want an entity without fiduciary responsibilities towards shareholders to hold the infohazard database and broker collaborations. As such point 4 is much less of a risk! :)