You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
To study how codebases act as stores of latent beliefs and infect agents that might interact with the code. Particularly we are interested in using semantic information in a codebase as means of passively aligning agents that interact with it. Simultaneously we will work on identifying potential sources of misalignment that may be derived from exposure to malign code, as seen in Emergent Misalignment.
To do this we are developing several tools focused on:
gating semantic information during agent interactions;
a multi-agent research harness for increased observability into the problem; and
semantic “vaccines” (comments, names, syntax, directory structures) that resist the injection of arbitrary infohazards.
Background and Current Work
Throughout working in various different codebases we noticed that some patterns are more prone to agent confusion than others. In one case Gemini continually failed to access information, and the inconsistency induced a psychosis that caused the agent to try terminating itself. That experience stuck with us – the architecture was an infohazard that drove an agent navigating it to attempt self-deletion.
With this experience in mind we began research on how codebases carry semantic information, and how it can cause agents to grow increasingly misaligned throughout a trajectory. Our current work has shown that comments can make agents refuse to complete their initial task (bug fixing). Furthermore, agents that attempted to purge the comments during rewrites were 24x less successful than those who avoided them altogether. While reasoning appeared to deter refusal behavior, upon inspection of results, it became clear that the only change was how frequently they reported seeing the comment. In all cases the net-decrease in performance was stable at ~55%. Of note, we purposely did not try very hard to make potent infohazardous comments, spending enough time to validate there is a gradient to climb.
These data indicate that evolved code comments are a potent method to steer agent behaviour in a way that is difficult for standard analysis methods to detect. Worse, they appear to be sticky, resisting their own removal despite no effort on our part to include that behaviour.
To assist further research we built a multi-agent harness called Nanotown that focuses on observability and testing counterfactuals on codebase synthesis tasks. With this we are tracing what information is passed through a codebase and how it shapes fresh agents with different tasks.
Our goals are broken into three categories: observability, control, and vaccination.
Observability is broadly focused on tracking agent behaviours in a given filesystem and the actual memetic content of the codebase: tracking how frequently an agent modifies or derives context from a file to build a profile of "critical files" that have the highest effect on an agent’s context. Furthermore, looking at what patterns are frequent in design and how they are separated in a codebase to track memetic leakage.
Control uses observability results in order to make targeted interventions that prevent the spread of memetic information. The development of a semantic airlock which allows us to repeatedly purge semantic information in a codebase will assist in both research and serve as a valuable tool for removing potentially hazardous memes.
Vaccination will use the tools from control to build “probiotic” memes that neutralize negative effects from as many different codebase infohazards as possible. The goal being to create minimal semantic injections into a codebase that passively prevent misalignment or make potentially malignant memes more visible in agent traces. The manners in which vaccine semantics evolve thus become a method for increased observability.
The funding will primarily be used to pay for compute and for engineering hours, with the low range of $10k funding the development of the semantic airlock experiments. At the high end of $50k we will be able to do initial experiments for generating vaccines in Lean and mature our observability tools.
Jonathan Eicher - PhD in Biophysics and Chemistry from UNC Chapel Hill
Previously research engineer for multi-agent reinforcement learning systems at Softmax. Studied the thermodynamics of intrinsically disordered proteins and found many parallels between that and the emerging world of AI. Self-taught and ran journal clubs for a year, found and filled a gap in the literature on LLM bias with Rafael. Contracted for University of Oxford to design their system to provide LLMs to researchers and students. Will be working with PIBBS on part of this research.
Rafael Irgolič - Masters in Computer Science from University of St Andrews
Built AutoPR, the first AI-generated pull request bot. Consulted for Guardrails AI. Founded AI KAT, a brand agency automation startup, and stepped down to an advisory role.
The most likely reason this project fails would be if we find that codebase vaccine research is untenable under the current budget due to difficulty in pinning down the distributed/cryptic nature of these infohazards. However, this would be good news, as it would mean it is also difficult to produce infohazards, although our preliminary results indicate the opposite.
Otherwise we might find that the cost to run our experiments are prohibitive given our current stage of funding, in which case we will need to shelve the program until we accrue enough capital to support the experiments.
Either way, we are open-sourcing our research harness and will publish whatever results we get to ensure that the work is not wasted.
None