You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Post-hoc interpretability tools (sparse autoencoders, probes, concept activation vectors) search for concepts after training. If a concept is fragmented across redundant directions, the search can miss some of them, and suppressing the ones you found may leave the rest intact. That is a completeness problem no post-hoc method can rule out.
Sparse Concept Anchoring (SCA) takes a different approach. It adds a light geometric regularizer to the training objective that guides a concept toward a known location (a direction or subspace) using a small number of noisy labels. In effect, it lets you shape feature geometry during training, rather than reverse-engineering it later. Because the concept then lives where you put it, suppressing or ablating it has side-effects you can bound analytically from the geometry before you run the intervention. SCA is currently the only training-time localization method with that property.
My paper at the GRaM workshop (ICLR 2026) established this in autoencoders (anchoring a concept with under 0.1% of samples): suppression scaling as cos²θ (R²=0.99) and selective weight ablation (R²=0.98). If it carries over to transformers, a lab could remove or steer a dangerous capability with the side-effects known before deployment.
This grant funds the next step: whether SCA transfers to transformers, the architecture safety-critical systems are actually built on.
Why this matters: if a safety-relevant capability can be pinned to a reserved location during training, a frontier lab could remove it or control its expression, with collateral damage known in advance rather than discovered after deployment. The target audience is a frontier-lab safety team, and the aim is uptake into how models are trained.
This is a milestone program:
M1 anchored concepts in autoencoders (done, published)
M2 tests whether it transfers to transformers (this grant)
M3 & M4 carry it to small language models and LLM fine-tunes with real safety targets.
M2 works in a synthetic color-mixing domain with unambiguous ground truth (red + blue = purple), in a small transformer. Deliverables:
D2.1 (~4 weeks): does SCA work in a transformer at all? Anchor a concept such as red across the residual stream in the color-mixing task; probe each layer for the anchored concept, and confirm that completion accuracy (predicting the correct result color) matches an un-anchored baseline.
D2.2 (~10 weeks): anchor an abstract operation (e.g. addition, rather than a concrete attribute like redness); sweep over layers; confirm task performance is intact and that suppression scales as it did in the autoencoders.
D2.3 (~10 weeks): the safety payoff. Add a verification task (red + blue = purple TRUE/FALSE) and test whether suppression can degrade completion while preserving verification: the experimental analog for letting a model recognize a behavior without being able to produce it.
D2.4 (~3 weeks): consolidation and outreach. Light write-ups may already go out after D2.1–D2.3 as results allow; D2.4 is protected time to fold findings into public reports (LessWrong sequence, arXiv report) and share them directly with the feature-geometry groups (Goodfire, Simplex) and frontier-lab safety teams. As an independent researcher, I don't have a lab's reach to get results noticed, so I've costed this as its own deliverable.
Why the synthetic domain first, rather than language models straight away? A toy domain keeps a negative result interpretable. In an LLM, a null result at this stage could mean the method failed, or it could mean any of the other things that can go wrong in language-model training. The color domain removes those confounds, so M2 is a clean test of the method itself; M3 and M4 then carry it to natural-language models and real safety targets (sycophancy the lead candidate). Negative results at M2 are publishable and directly inform that later work.
The closest existing method, gradient routing (Cloud et al., 2024), also steers where a concept lands during training, but by masking gradients rather than shaping the loss. Its intervention effects are characterized empirically; SCA's are bounded analytically from geometry, before you run the intervention.
Applying as an individual in Melbourne; figures include on-costs.
Funding goal ($76,621): full M2 (D2.1–D2.4, ~6.2 months). About 82% is stipend (gross, below market rate); the rest is compute (Modal), a laptop ($1,655), Claude Code, and an accountant.
Minimum ($49,018): covers D2.1, D2.2, and D2.4 (~3.9 months), deferring the verification-task deliverable (D2.3).
I'll run this independently from Melbourne, drawing on a wider network along the way: researchers at Lyptus Research and Timaeus have expressed interest in the experimental design, Patryk Wielopolski (M1 co-author, now at MATS) gave feedback that sharpened its framing, and the Melbourne AI Safety Hub, which meets weekly, knows the work well.
M1 (ICLR GRaM 2026): I delivered the M1 experiments solo in 2025 (~6 months, self-funded), co-authoring Sparse Concept Anchoring for Interpretable and Controllable Neural Representations, establishing SCA in autoencoders, with full code release.
Engineering: Senior Research Engineer at Timaeus (Nov 2025 – Apr 2026), building distributed GPU infrastructure for large-scale interpretability research. Before that, 20+ years in production software (ML/AI, real-time 3D, continent-scale geospatial systems). Compute infrastructure is ready: mi-ni, a Modal-based experiment-iteration template already used for M1 and an M2 pilot.
Jesse Hoogland (CEO, Timaeus) recommended an earlier fundraiser for this project; see his comment, and my reply, for details.
Links: LessWrong sequence · GitHub · LinkedIn references
Technical risk: perhaps something about transformers makes the anchor harder to form, or the intervention harder to bound, than the autoencoder results suggest. Even then, the synthetic domain keeps a negative result interpretable and publishable, which would sharpen the work that follows.
Outreach risk: perhaps the work succeeds but goes unnoticed. This is a risk Jesse flagged on the previous proposal, and my own main worry. Mitigations are built into the plan: legible write-ups and an arXiv report; warm introductions and endorsements from people who already know the work; and deliverables framed for uptake by feature-geometry groups and frontier-lab safety teams. I believe I have the connections to reach these people directly.
None in the last 12 months; M1 was self-funded. I have parallel applications out for the same or related work: LTFF (M2, ~$76.6k) and a Schmidt Sciences Interpretability Pilot (~$409k, 18 months, covering M2 + M3). If Schmidt funds the larger program, I'd scale back or withdraw the M2-only requests. I disclose parallel applications to every funder I'm in touch with.