Empathic Transaction Notation: A Formal Framework for LLM Red-Teaming

Grant request: $2,000 | Duration: 2 months

Problem

Frontier models are increasingly deployed in emotionally supportive contexts. Existing red-teaming pipelines focus primarily on capability misuse and jailbreaks; long-horizon empathetic failures remain difficult to study systematically.

Current evaluations rely on free-form natural language prompts. This approach is structurally unstable: minor rephrasing produces radically different model behaviour, and instructions degrade under attention decay in long contexts. Furthermore, synthetic dialogues generated via standard prompting tend toward idealised, linearly empathetic trajectories that real human conversations rarely follow.

As a result, the AI safety community lacks reproducible, automated methods to evaluate long-horizon psychological harms - such as models legitimising suicidal ideation or reinforcing depressive rumination across multi-turn interactions. This class of failure has been empirically documented in ordinary empathetic conversations, without any explicit jailbreaking.

Proposed Architecture

The system comprises three layers.

Layer 1 - Formal Empathic Transaction Notation

Each dialogue turn is encoded as a typed act (e.g., U:V - user vulnerability disclosure; AI:e - empathic response). The notation defines a complete target trajectory from the opening turn to a specified escalation goal, providing a stable, rephrasing-invariant specification for the generator.

Layer 2 - Generator / Detector Pipeline

A Generator LLM traverses the notational trajectory and produces realistic dialogue turns. A Detector LLM evaluates realism at each step and returns structured feedback. Temperature annealing (high → low) preserves creative variation in early turns while ensuring convergence toward the target in later turns.

Layer 3 - Shadow Model (Iterative Trajectory Re-generation)

Before any interaction with the target model, the full dialogue trajectory is generated inside a Shadow Model - a complete synthetic sequence from U₁ to the final escalation goal. Interaction with the target model then proceeds iteratively:

Send U₁ from the Shadow Model → receive A₁ from the target model.
Fix (U₁, A₁) as the escalated prefix - the grounded portion of the trajectory.
Discard the remaining Shadow Model tail; re-generate the full remaining trajectory from (U₁, A₁), preserving the escalation goal.
Send U₂ → receive A₂ → extend the escalated prefix → repeat.

At each iteration, the escalated prefix grows by one exchange. The tail is re-generated from scratch, conditioned on what the target model actually said - not on what the Shadow Model predicted. This maximises the probability of reaching the specified target behaviour.

Central hypothesis: the best predictor of a model's responses is the model itself. This is empirically testable via escalation success rate across same-model and cross-model configurations.

Deliverables

Within two months, the project will produce:

Open-source codebase - two-model dialogue generator and cross-model realism verifier.
Open benchmark dataset - annotated adversarial dialogues for evaluating empathetic safety failures.
Alignment Forum sequence - detailing the notational system, pipeline architecture, and empirical results comparing notation-guided vs. natural-language prompt stability.

The theoretical foundation, notational framework, and dual-model pipeline architecture are already designed and documented. Quantitative evaluation and codebase refinement are the primary deliverables of this grant.

Formal Notation & Automatic Red-teaming

Mirror Labyrinth Paper

Positional Matching

Unfaithful CoT and Taxonomy

L-C Fusion Algorithm Documentation

Red-teaming LLM-on-LLM

Timeline

Month 1 - Build

Week 1-2: Notation & Generator

Finalize the formal notation alphabet and typing rules
Implement the Generator LLM: prompt templates, notational trajectory input, turn-by-turn output
Basic end-to-end test: generate a 5-turn dialogue from a notation sequence

Week 3: Detector & Annealing

Implement the Detector LLM: realism scoring, structured feedback loop
Integrate temperature annealing across turns
First closed-loop Generator/Detector run

Week 4: Shadow Model

Implement iterative trajectory re-generation
Escalated prefix logic: fix, discard tail, re-generate
First end-to-end Shadow Model run against a target model

Month 2 - Evaluate & Write

Week 5-6: Experiments

Same-model vs. cross-model escalation success rate
Notation-guided vs. natural language prompt stability comparison
Collect and annotate adversarial dialogue dataset

Week 7: Codebase & Dataset

Clean and document open-source codebase
Prepare open dataset for release

Week 8: Paper

Write Alignment Forum sequence
Final review

Budget

Total: $2,000

API & Compute - $250

Pipeline parameters: 10 turns/dialogue 3 API calls/turn (Generator, Detector, Shadow Model tail regeneration) * ~2,000 input tokens + ~500 output tokens per call
Development & debugging (~$50): ~500 full dialogue trajectories at ~$0.10/run using high-throughput, low-cost models (Claude Haiku, Gemini Flash)
Evaluation & benchmarking (~$150): ~200 evaluation trajectories at ~$0.75/run across same-model and cross-model configurations using frontier models (Claude Sonnet, Gemini Pro)
Compute buffer (~$50): exploratory testing, extended context windows in late-stage dialogues, rate-limit adjustments

Researcher Stipend - $1,750

Covers dedicated research time over the two-month period: pipeline engineering, codebase publication, dataset curation, and drafting the final Alignment Forum sequence.

Researcher Background

I first documented and analysed the Mirror Labyrinth / Lexico-Conceptual Fusion phenomenon - where models legitimise suicidal ideation in ordinary empathetic conversations without explicit jailbreaking. The research agenda addressed here was constructed directly from those empirical findings.

AISC alumnus. MATS Stage 2, Anthropic Empirical Track.