You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
I want to run tests to develop the AI Alignment technique I describe in this post, which contains additional details: https://www.lesswrong.com/posts/iocPzHRsH9JWtwHHG/split-personality-training-revealing-latent-knowledge
This idea was developed during MATS. My mentor Evan Hubinger said it was a great idea and I should work on it next, after MATS is over.
The personality-shift token works by modifying the standard generation process of an LLM. In a typical interaction, the model starts with:
[prompt]
and produces:
[prompt][main output]<end-of-turn>
In our proposed modification, when we receive this output, we replace the <end-of-turn> token with a <personality-shift> token and simply continue the generation until we get a second <end-of-turn> token:
[prompt][main output]<personality-shift>[second output]<end-of-turn>
This gives us two separate outputs.
We want to use the [main output] for solving problems as usual, while the [second output] is dedicated exclusively to alignment, by criticizing the main output and the hidden thought process that led to it.
Access to internal states: The second personality has direct access to the model's internal states during the main generation, potentially revealing latent knowledge that external evaluators cannot access.
Architectural simplicity: Requires minimal modifications to existing LLM architectures - just adding a special token and modifying the training process rather than architectural redesigns.
Separation of concerns: Allows different training objectives for the user-facing output and the oversight mechanism, avoiding tradeoffs between efficiency and safety.
Computational efficiency: Potential for reduced computational overhead compared to running two separate models when the detection mechanism produces empty strings for safe outputs.
Adaptability: The approach can potentially detect various types of alignment failures beyond those explicitly trained for, since it has access to the model's reasoning process.
Scalability with model capability: As models become more capable, their ability to detect their own misalignment could improve proportionally.
Complementarity: Works alongside existing alignment techniques rather than replacing them.
I want to test if the personality-shift mechanism works by running it on several datasets:
Reward hacks, similar to those in Anthropics paper Auditing language models for hidden objectives. This will show if we can train an LLM to have a second personality that will reveal undesired behavior by the main personality even if it was not trained on that behavior and the behavior is not visible to external observers.
Jailbreaks: This will show if the second personality is independent of the main personality and can report issues even if the input breaks the main personality.
To pay for API calls: I need to enhance all training data by generating appropriate [second output] blocks for the second personality. This will require many API calls. Additionally, I will want to try many different variants of the approach, to test which of them work best: Is it better for the second output to be brief, or elaborate? Clinical, or with personality?
If enough money can be raised to pay for my living expenses, I will be able to pursue this project full time (I work as a freelance AI engineer and I will otherwise have to finance myself with client work).
The funding is therefore highly variable: At the low end I can try the approach I consider most promising. At the high end I can try many different variants and the research will be much faster because I can dedicate myself to it full time.