Split Personality Training

I want to run tests to develop the AI Alignment technique I describe in this post, which contains additional details: https://www.lesswrong.com/posts/iocPzHRsH9JWtwHHG/split-personality-training-revealing-latent-knowledge

This idea was developed during MATS. My mentor Evan Hubinger said it was a great idea and I should work on it next, after MATS is over.

Project summary

The personality-shift token works by modifying the standard generation process of an LLM. In a typical interaction, the model starts with:

[prompt]

and produces:

[prompt][main output]<end-of-turn>

In our proposed modification, when we receive this output, we replace the <end-of-turn> token with a <personality-shift> token and simply continue the generation until we get a second <end-of-turn> token:

[prompt][main output]<personality-shift>[second output]<end-of-turn>

This gives us two separate outputs.

We want to use the [main output] for solving problems as usual, while the [second output] is dedicated exclusively to alignment, by criticizing the main output and the hidden thought process that led to it.

Main Benefits of the Approach

Access to internal states: The second personality has direct access to the model's internal states during the main generation, potentially revealing latent knowledge that external evaluators cannot access.
Architectural simplicity: Requires minimal modifications to existing LLM architectures - just adding a special token and modifying the training process rather than architectural redesigns.
Separation of concerns: Allows different training objectives for the user-facing output and the oversight mechanism, avoiding tradeoffs between efficiency and safety.
Computational efficiency: Potential for reduced computational overhead compared to running two separate models when the detection mechanism produces empty strings for safe outputs.
Adaptability: The approach can potentially detect various types of alignment failures beyond those explicitly trained for, since it has access to the model's reasoning process.
Scalability with model capability: As models become more capable, their ability to detect their own misalignment could improve proportionally.
Complementarity: Works alongside existing alignment techniques rather than replacing them.

What are this project's goals? How will you achieve them?

I want to test if the personality-shift mechanism works by running it on several datasets:

Reward hacks, similar to those in Anthropics paper Auditing language models for hidden objectives. This will show if we can train an LLM to have a second personality that will reveal undesired behavior by the main personality even if it was not trained on that behavior and the behavior is not visible to external observers.
Jailbreaks: This will show if the second personality is independent of the main personality and can report issues even if the input breaks the main personality.

How will this funding be used?

To pay for API calls: I need to enhance all training data by generating appropriate [second output] blocks for the second personality. This will require many API calls. Additionally, I will want to try many different variants of the approach, to test which of them work best: Is it better for the second output to be brief, or elaborate? Clinical, or with personality?

If enough money can be raised to pay for my living expenses, I will be able to pursue this project full time (I work as a freelance AI engineer and I will otherwise have to finance myself with client work).

The funding is therefore highly variable: At the low end I can try the approach I consider most promising. At the high end I can try many different variants and the research will be much faster because I can dedicate myself to it full time.