Concept control in transformers with Sparse Concept Anchoring

Likely AI-generated. Pangram flags 79% of this text as AI-written.

Concept control in transformers with Sparse Concept Anchoring

Technical AI safety

Alexander (Sandy) Fraser

ActiveGrant

$76,621raised

$76,621funding goal

Fully funded and not currently accepting donations.

Project summary

Post-hoc interpretability tools (sparse autoencoders, probes, concept activation vectors) search for concepts after training. If a concept is fragmented across redundant directions, the search can miss some of them, and suppressing the ones you found may leave the rest intact. That is a completeness problem no post-hoc method can rule out.

Sparse Concept Anchoring (SCA) takes a different approach. It adds a light geometric regularizer to the training objective that guides a concept toward a known location (a direction or subspace) using a small number of noisy labels. In effect, it lets you shape feature geometry during training, rather than reverse-engineering it later. Because the concept then lives where you put it, suppressing or ablating it has side-effects you can bound analytically from the geometry before you run the intervention. SCA is currently the only training-time localization method with that property.

My paper at the GRaM workshop (ICLR 2026) established this in autoencoders (anchoring a concept with under 0.1% of samples): suppression scaling as cos²θ (R²=0.99) and selective weight ablation (R²=0.98). If it carries over to transformers, a lab could remove or steer a dangerous capability with the side-effects known before deployment.

This grant funds the next step: whether SCA transfers to transformers, the architecture safety-critical systems are actually built on.

What are this project's goals? How will you achieve them?

Why this matters: if a safety-relevant capability can be pinned to a reserved location during training, a frontier lab could remove it or control its expression, with collateral damage known in advance rather than discovered after deployment. The target audience is a frontier-lab safety team, and the aim is uptake into how models are trained.

This is a milestone program:

M1 anchored concepts in autoencoders (done, published)
M2 tests whether it transfers to transformers (this grant)
M3 & M4 carry it to small language models and LLM fine-tunes with real safety targets.

M2 works in a synthetic color-mixing domain with unambiguous ground truth (red + blue = purple), in a small transformer. Deliverables:

D2.1 (~4 weeks): does SCA work in a transformer at all? Anchor a concept such as red across the residual stream in the color-mixing task; probe each layer for the anchored concept, and confirm that completion accuracy (predicting the correct result color) matches an un-anchored baseline.
D2.2 (~10 weeks): anchor an abstract operation (e.g. addition, rather than a concrete attribute like redness); sweep over layers; confirm task performance is intact and that suppression scales as it did in the autoencoders.
D2.3 (~10 weeks): the safety payoff. Add a verification task (red + blue = purple TRUE/FALSE) and test whether suppression can degrade completion while preserving verification: the experimental analog for letting a model recognize a behavior without being able to produce it.
D2.4 (~3 weeks): consolidation and outreach. Light write-ups may already go out after D2.1–D2.3 as results allow; D2.4 is protected time to fold findings into public reports (LessWrong sequence, arXiv report) and share them directly with the feature-geometry groups (Goodfire, Simplex) and frontier-lab safety teams. As an independent researcher, I don't have a lab's reach to get results noticed, so I've costed this as its own deliverable.

Why the synthetic domain first, rather than language models straight away? A toy domain keeps a negative result interpretable. In an LLM, a null result at this stage could mean the method failed, or it could mean any of the other things that can go wrong in language-model training. The color domain removes those confounds, so M2 is a clean test of the method itself; M3 and M4 then carry it to natural-language models and real safety targets (sycophancy the lead candidate). Negative results at M2 are publishable and directly inform that later work.

The closest existing method, gradient routing (Cloud et al., 2024), also steers where a concept lands during training, but by masking gradients rather than shaping the loss. Its intervention effects are characterized empirically; SCA's are bounded analytically from geometry, before you run the intervention.

How will this funding be used?

Applying as an individual in Melbourne; figures include on-costs.

Funding goal ($76,621): full M2 (D2.1–D2.4, ~6.2 months). About 82% is stipend (gross, below market rate); the rest is compute (Modal), a laptop ($1,655), Claude Code, and an accountant.
Minimum ($49,018): covers D2.1, D2.2, and D2.4 (~3.9 months), deferring the verification-task deliverable (D2.3).

Who is on your team? What's your track record on similar projects?

I'll run this independently from Melbourne, drawing on a wider network along the way: researchers at Lyptus Research and Timaeus have expressed interest in the experimental design, Patryk Wielopolski (M1 co-author, now at MATS) gave feedback that sharpened its framing, and the Melbourne AI Safety Hub, which meets weekly, knows the work well.

M1 (ICLR GRaM 2026): I delivered the M1 experiments solo in 2025 (~6 months, self-funded), co-authoring Sparse Concept Anchoring for Interpretable and Controllable Neural Representations, establishing SCA in autoencoders, with full code release.
Engineering: Senior Research Engineer at Timaeus (Nov 2025 – Apr 2026), building distributed GPU infrastructure for large-scale interpretability research. Before that, 20+ years in production software (ML/AI, real-time 3D, continent-scale geospatial systems). Compute infrastructure is ready: mi-ni, a Modal-based experiment-iteration template already used for M1 and an M2 pilot.

Jesse Hoogland (CEO, Timaeus) recommended an earlier fundraiser for this project; see his comment, and my reply, for details.

Links: LessWrong sequence · GitHub · LinkedIn references

What are the most likely causes and outcomes if this project fails?

Technical risk: perhaps something about transformers makes the anchor harder to form, or the intervention harder to bound, than the autoencoder results suggest. Even then, the synthetic domain keeps a negative result interpretable and publishable, which would sharpen the work that follows.

Outreach risk: perhaps the work succeeds but goes unnoticed. This is a risk Jesse flagged on the previous proposal, and my own main worry. Mitigations are built into the plan: legible write-ups and an arXiv report; warm introductions and endorsements from people who already know the work; and deliverables framed for uptake by feature-geometry groups and frontier-lab safety teams. I believe I have the connections to reach these people directly.

How much money have you raised in the last 12 months, and from where?

None in the last 12 months; M1 was self-funded. I have parallel applications out for the same or related work: LTFF (M2, ~$76.6k) and a Schmidt Sciences Interpretability Pilot (~$409k, 18 months, covering M2 + M3). If Schmidt funds the larger program, I'd scale back or withdraw the M2-only requests. I disclose parallel applications to every funder I'm in touch with.

Alexander (Sandy) Fraser

5 days ago

Progress update

What progress have you made since your last update?

tl;dr: I now have a decent baseline model architecture, a mostly-validated mini-language for it to learn, and a good pipeline for running experiments. This is my first update. Project started 13 July, 9 days ago.

Code on GitHub: z0u/sca2, reports: z0u.github.io/sca2.

Started work on D2.1. To recap: This deliverable is about anchoring concrete concepts in a transformer. As described in the proposal, the domain we’re focusing on is color mixing. The grammar of these experiments is "a + b = mix(a, b)", e.g. "red + blue = purple". The model is trained to predict the answers of equations like that. In D2.1, I hope to demonstrate that the colors can be anchored (and perhaps intervened on, although that might wait until D2.2).

Science: I ran four experiments to establish a baseline for anchoring: I wanted to get a feel for how the models behave with this mini language. The experiments explore which model size is appropriate (width and number of layers), and which vocabularies can be used for this task: named colors only ("red"), or also RGB hex codes ("#f00"); how many colors are needed, and what kinds of tokenization work (word- or character-level). So far, all of these experiments have been un-anchored.

Findings: a 64x4 nGPT-like transformer learns the mixing task; both word- and character-level tokenization work (character-level slightly less well); and 216 colors (6 levels of red, green, and blue) works best. I thought that the hex codes were confusing the model on one sub-task, but actually I think it's just that there weren't enough named colors to decode to. For details, see Appendix A (below) and the linked reports.

Engineering: Some infrastructure work for experiment tooling (e.g. for Claude to use to write an monitor experiment progress) and report-publishing. More work on this happened just before starting the project; see Appendix B (below).

Ops: I engaged an accountant and set up a business entity to receive funds.

What are your next steps?

Continue on D2.1. I would like to do one more experiment with both named colors and hex codes, because I think having multiple surface forms could shield me a little from the trap of the transformer learning something trivial.

More importantly: anchor a single concept in the residual stream. I'll start with anchoring "red", as in M1, using similar regularization terms and sparse, noisy labels. And I'll measure whether the anchor worked by probing the residual stream.

I also plan to change my workflow a little. I've had Claude running my experiments and writing the first draft of the reports in one go. This let me run the experiments quickly, but I found I had to spend a long time editing the reports. I still want to have Claude write and run each experiment, but I might have it produce only a skeleton of the report that I can then fill in.

Is there anything others could help you with?

I would love feedback from a researcher about the direction this is going. Have I fallen into any traps with these analyses? Engaging fully with the reports would be time consuming so I'm happy to discuss if that's easier. I plan to seek feedback directly from some people.

Appendix A. D2.1 experiment summaries

M2/Ex-2.1.1: Character-level nGPT (6 sizes), trained on named colors and hex codes (so some of the colors have two surface forms for the same concept). The larger models learnt the relationships and could solve equations like "r·e·d·+·b·l·u·e·=·p·u·r·p·l·e", and "o·r·a·n·g·e·+·#·2·c·7·=·#·9·a·4", but could not give named color answers for equations it hadn’t seen during training ("named_unseen" in the figure below).

M2/Ex-2.1.2: Added reverse-mapping expressions like "#·f·0·0·=·r·e·d" to see if that would teach the model to give named answers for equations it hadn’t seen before, but it didn’t work. Per-channel probes revealed that when the model is outputting hex, it computes each channel just-in-time (note the sequential red, green, and blue spikes in the figure below).

M2/Ex-2.1.3: Switched from character-level to word-level embeddings, without hex codes, to prevent the model from doing per-channel mixing. Sequences like "red·+·blue·=·purple" (~5 tokens). Tested vocabulary sizes: 27, 64, 216, and 4096 colors. This worked well: with the 216-color grid, the model performed almost perfectly on held-out equations. Since this vocabulary has one token per color, a probe can find the RGB color cube in the embedding layer (shown below).

M2/Ex-2.1.4: A one-token-per-concept language isn’t a good test of a transformer (the geometry is even available at layer 0!), so this experiment splits them up again. There are no hex codes (like the previous experiment), but we now have multiple tokens per color. Sequences like "v·s·q·v·+·t·e·f·q·=·m·n·i·h"; the color names are random (but stable) to prevent the model from learning other associations (like hex did). These models learnt well from the 216-color grid, with a probe finding that the answer is decodable as RGB at the final layer.

The 27-color grid did not work for held-out equations, and further analysis found that it can't: it's just too coarse to properly express these equations (see report for details).

Appendix B. Iteration 0 (prep)

Mostly engineering tasks, and a few experiments. Done before my nominal start date of 13 July.

Infra setup: Cloned my template repo and set it up for publishing artifacts to Hugging Face.

Code on GitHub: z0u/sca2
Artifacts in Hugging Face dataset: z0u/sca2-pub
Experiment reports on GitHub Pages: z0u.github.io/sca2

Ported an experiment from M1 (deleting red from a 5D autoencoder) to JAX and ran it on Modal as an end-to-end test of the infrastructure (report 1). The SCA paper had left the large variance across seeds as "future work", so I ran a few extra experiments; turns out it’s solved by defining a target to redirect to when deleting a concept (report 2, report 3, report 4).

The M1 experiments used spherical (unit norm) embeddings, and the paper suggested nGPT as the initial LLM architecture to use for SCA. Since I intend to use a simplified version of that architecture in this milestone (M2), I ran a small width × depth sweep over it on natural language data, and (after some bug fixes) it performed well (report).

Jesse Hoogland

22 days ago

Copying my comment over from the previous listing here.

Note: I didn't fund this but I recommended this project to JueYan Zhang, who funded it.

Points in favor

We need more research on interpretability-guided training! There's been a growing interest in "interpretability-guided training": see Goodfire's "intentional design", Neel Nanda, gradient routing and SGTM, and our own work at Timaeus. I think this is an area that is both extremely important and extremely neglected. I'm concerned that a lot of harmful structure in models is hard to remove once it's been trained in. In such cases, prevention is a much better approach than post-hoc treatment.
Well-scoped. The work on gradient routing especially demonstrates that, in this line of work, it is tractable for small, independent teams to make meaningful progress. The technique being proposed seems straightforward and practical. You'd think people have already tried regularizing against internal objectives throughout training, but actually, there really hasn't been much work on this. So someone should study it! In addition, the work plan sketched out here seems mostly reasonable (with the only exception being step 4, which is probably still a significant underestimate).
Sandy can handle it. From personal experience (COI: I previously employed Sandy as a research engineer at Timaeus), I know that Sandy is conscientious and great at visualizing and presenting information. He can deliver, as his initial milestone demonstrates.

Main reservations

Research outreach. My main worry is that this research will fall on deaf ears. The first milestone got very little attention. This is understandable given that Australia's AI safety community, though quite large in relative terms, is also still far from the central AI safety nodes in the Bay Area and London. Sandy is aware of this risk and has proposed some mitigations under "Potential Impact", but it remains a risk.
Transfer to real-world settings. I'm typically a fan of working on small toy systems before scaling it up to larger settings. In this case, I think there's a risk of jumping to (small) language models too late. And subsequently a risk of jumping to large language models prematurely. I'd encourage Sandy to be willing to spend relatively more time on milestone (3), even at the expense of dropping milestone (4). I'm a bit wary of jumping straight to concepts as high-level and abstract as "deception." What simpler interventions can you demonstrate in small language models?

Alexander (Sandy) Fraser

14 days ago

Copying my reply from the old listing.

Thank you, @jesse_hoogland ! This means a lot, and I appreciate you recommending the project to @JueYan .

One clarification for anyone reading, since this proposal is about a year old and the milestones have been renumbered. What I called milestones 1 and 2 here are now bundled as M1, and they're done: published as Sparse Concept Anchoring for Interpretable and Controllable Neural Representations at the GRaM workshop (ICLR 2026). The milestone 3 you point to (small transformers) is what I now call M2 — the subject of the new fundraiser below. And the old milestone 4 (language models with real safety targets) has since grown into two milestones: M3 (a small language model trained from scratch) and M4 (retrofitting the method onto an existing open-weights model). Sorry for the confusion.

I've posted a new fundraiser for the next step, M2, here: Concept control in transformers with Sparse Concept Anchoring.

Your reservations are very reasonable.

Outreach. This is my main worry too. Planned mitigations: 1. posts or a paper for legibility (deliverable 2.4 in the proposal), 2. warm introductions and endorsements, 3. a MATS application in for the autumn cohort (in progress), 4. framing deliverables for adoption by frontier-lab safety teams.
Transfer. Agreed on focusing on small transformers. The architecture in my new proposal is technically a language model (small transformer with an lm_head), but it stays in a synthetic domain rather than jumping to natural language, so that a negative result is interpretable. I don't want to confound "the method doesn't work" with "I set up an LLM wrong".
The roadmap then progresses to LLMs: M3 starts with tractable concepts (sentiment, formality, refusal) in a small natural language model before it attempts a safety-relevant target, so I'm not jumping straight to something as abstract as deception.
So M2 in the new proposal is a toy transformer in a synthetic domain; scaling to natural language is later work. Is that the direction you had in mind, or were you picturing getting to small natural-language models sooner?