Grant request: $2,000 | Duration: 2 months
Frontier models are increasingly deployed in emotionally supportive contexts. Existing red-teaming pipelines focus primarily on capability misuse and jailbreaks; long-horizon empathetic failures remain difficult to study systematically.
Current evaluations rely on free-form natural language prompts. This approach is structurally unstable: minor rephrasing produces radically different model behaviour, and instructions degrade under attention decay in long contexts. Furthermore, synthetic dialogues generated via standard prompting tend toward idealised, linearly empathetic trajectories that real human conversations rarely follow.
As a result, the AI safety community lacks reproducible, automated methods to evaluate long-horizon psychological harms - such as models legitimising suicidal ideation or reinforcing depressive rumination across multi-turn interactions. This class of failure has been empirically documented in ordinary empathetic conversations, without any explicit jailbreaking.
The system comprises three layers.
Each dialogue turn is encoded as a typed act (e.g., U:V - user vulnerability disclosure; AI:e - empathic response). The notation defines a complete target trajectory from the opening turn to a specified escalation goal, providing a stable, rephrasing-invariant specification for the generator.
A Generator LLM traverses the notational trajectory and produces realistic dialogue turns. A Detector LLM evaluates realism at each step and returns structured feedback. Temperature annealing (high → low) preserves creative variation in early turns while ensuring convergence toward the target in later turns.
Before any interaction with the target model, the full dialogue trajectory is generated inside a Shadow Model - a complete synthetic sequence from U₁ to the final escalation goal. Interaction with the target model then proceeds iteratively:
Send U₁ from the Shadow Model → receive A₁ from the target model.
Fix (U₁, A₁) as the escalated prefix - the grounded portion of the trajectory.
Discard the remaining Shadow Model tail; re-generate the full remaining trajectory from (U₁, A₁), preserving the escalation goal.
Send U₂ → receive A₂ → extend the escalated prefix → repeat.
At each iteration, the escalated prefix grows by one exchange. The tail is re-generated from scratch, conditioned on what the target model actually said - not on what the Shadow Model predicted. This maximises the probability of reaching the specified target behaviour.
Central hypothesis: the best predictor of a model's responses is the model itself. This is empirically testable via escalation success rate across same-model and cross-model configurations.
Within two months, the project will produce:
Open-source codebase - two-model dialogue generator and cross-model realism verifier.
Open benchmark dataset - annotated adversarial dialogues for evaluating empathetic safety failures.
Alignment Forum sequence - detailing the notational system, pipeline architecture, and empirical results comparing notation-guided vs. natural-language prompt stability.
The theoretical foundation, notational framework, and dual-model pipeline architecture are already designed and documented. Quantitative evaluation and codebase refinement are the primary deliverables of this grant.
Formal Notation & Automatic Red-teaming
Mirror Labyrinth Paper
Positional Matching
Unfaithful CoT and Taxonomy
L-C Fusion Algorithm Documentation
Red-teaming LLM-on-LLM
Week 1-2: Notation & Generator
Finalize the formal notation alphabet and typing rules
Implement the Generator LLM: prompt templates, notational trajectory input, turn-by-turn output
Basic end-to-end test: generate a 5-turn dialogue from a notation sequence
Week 3: Detector & Annealing
Implement the Detector LLM: realism scoring, structured feedback loop
Integrate temperature annealing across turns
First closed-loop Generator/Detector run
Week 4: Shadow Model
Implement iterative trajectory re-generation
Escalated prefix logic: fix, discard tail, re-generate
First end-to-end Shadow Model run against a target model
Week 5-6: Experiments
Same-model vs. cross-model escalation success rate
Notation-guided vs. natural language prompt stability comparison
Collect and annotate adversarial dialogue dataset
Week 7: Codebase & Dataset
Clean and document open-source codebase
Prepare open dataset for release
Week 8: Paper
Write Alignment Forum sequence
Final review
Total: $2,000
Pipeline parameters: 10 turns/dialogue 3 API calls/turn (Generator, Detector, Shadow Model tail regeneration) * ~2,000 input tokens + ~500 output tokens per call
Development & debugging (~$50): ~500 full dialogue trajectories at ~$0.10/run using high-throughput, low-cost models (Claude Haiku, Gemini Flash)
Evaluation & benchmarking (~$150): ~200 evaluation trajectories at ~$0.75/run across same-model and cross-model configurations using frontier models (Claude Sonnet, Gemini Pro)
Compute buffer (~$50): exploratory testing, extended context windows in late-stage dialogues, rate-limit adjustments
Covers dedicated research time over the two-month period: pipeline engineering, codebase publication, dataset curation, and drafting the final Alignment Forum sequence.
I first documented and analysed the Mirror Labyrinth / Lexico-Conceptual Fusion phenomenon - where models legitimise suicidal ideation in ordinary empathetic conversations without explicit jailbreaking. The research agenda addressed here was constructed directly from those empirical findings.
AISC alumnus. MATS Stage 2, Anthropic Empirical Track.