Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
3

Salaries for SAE Co-occurrence Project

Science & technologyTechnical AI safety
MatthewClarke avatar

Matthew A. Clarke

Not fundedGrant
$0raised

Matthew A. Clarke, Hardik Bhatnagar, Joseph Bloom

Project summary

We’re seeking funding (mainly living expenses) for Matthew Clarke and Hardik Bhatnagar to finish a project Joseph and Matthew have been working on studying the co-occurence of SAE Latents. 

Scientific details:

  • Not All Language Model Features Are Linear claims that not all fundamental units of language model computation are uni-dimensional. We are trying to answer related questions including: 

    • 1. What fraction of SAE latents might best be understood in groups rather than individually?

    • 2. Do low-dimensional subspaces mapped by co-occurring groups of features ever provide a better unit of analysis (or possibly intervention) than individual SAE latents? 

  • Our initial results suggest more extensive co-occurrence structure in smaller SAEs and more subspaces of interest that previously found by Engels et al, including subspaces that may track:

    • Uncertainty such as between various plausible hypotheses about how a word is being used.

    • Continuous quantities such as how fair through a 10 token url a token occurs. 

What are this project's goals? How will you achieve them?

Goal 1: Produce a LessWrong post / Academic paper comprehensively studying SAE Latent co-occurrence. We’ll achieve this by:

  1. Finishing our current draft. We’ve got most of a draft written and mainly want to get a few more experimental results / do a good job communicating our results to the community.

  2. Reproducing our existing results on larger models / more SAEs. Our methods include measuring latent co-occurrence and generating co-occurrence networks on Gemma Scope SAEs (so far we’ve studied Joseph’s GPT2 small feature splitting SAEs). 

  3. (Stretch Goal): Train probes of various kinds and study causal interventions on feature subspaces to provide more conclusive evidence of the need to reason about some features as groups rather than individually. 

Goal 2: Provide further mentorship to Matthew and Hardik. We’ll achieve this by:

  1. Meeting for 1-2 hours a week to discuss the results / the state of the project.

  2. Joseph will review work done by Matthew / Hardik and assist with the write-up. 

How will this funding be used?

MVP: $6400 USD

  • Research Assistant Salaries: $3000 per person for one month, total $6000

  • Compute Budget: Compute costs for 1 month: code requires A100s, cost of $1.19 per hour on RunPod, estimate of $50 per person per week: total $400

Lean: $9600 USD

  • Research Assistant Salaries: $3000 per person for 1.5 months, total $9000

  • Compute Budget: Compute costs for 1.5 months: code requires A100s, cost of $1.19 per hour on RunPod, estimate of $50 per person per week: total $600

Ideal: $18800

  • Research Assistant Salaries: $3000 per person for 2 months, total $18000

  • Compute Budget: Compute costs for 2 months: code requires A100s, cost of $1.19 per hour on RunPod, estimate of $5075 per person per week: total $800

Who is on your team? What's your track record on similar projects?

Joseph Bloom - Mentor / Supervisor

  • Recent Mechanistic Interpretability Research Mentoring: LASR Scholars recently published “A is for Absorption” demonstrating that the sparsity objective also encourages undesirable “gerrymandering” of information. (Neel’s tweet here, accepted to Interpretable AI workshop at NeurIPS). 

  • Mechanistic Interpretability Research: Nanda MATS Alumni. Various works include Published exceptionally popular open source SAEs, Understanding SAE Features with the Logit Lens,  Linear Representations underpinning spelling in GPT-J, and various publications on Decision Transformer Interpretability. Work mentioned in the Circuits Thread Interpretability Update.

  • Mechanistic Interpretability Infrastructure:  Author of SAELens. Previous Maintainer of TransformerLens. Cofounder of Decode Research (building Neuronpedia). Author of DecisionTransformerInterpretability Library. 

Matthew Clarke - Research Assistant / Mentee:

  • Mechanistic Interpretability Research: 3 months work solo (interrupted by surgery with long recovery) on this project as part of PIBBSS, see end of fellowship talk on this project here: Examining Co-occurrence of SAE Features - Matthew A. Clarke - PIBBSS Symposium.   

  • 10 years of experience in academic research, focusing on modelling the regulatory networks of cancer cells to better understand and treat this disease.

  • Four first-author publications in leading scientific journals, and experience in leading successful collaborations as part of the Jasmin Fisher Lab UCL. 

  • Now transitioning into mechanistic interpretability research to bring the skills learned from understanding biological regulatory networks and apply those to the problems of AI safety, and vice versa. 

Website: https://mclarke1991.github.io/ 

Hardik Bhatnagar - Research Assistant / Mentee:

  • Research scholar in the LASR labs program. Mentored by Joseph Bloom working on Feature Absorption project. The project paper is on arxiv (submitted to ICLR) and on LessWrong.

  • Previously worked at Microsoft Research on understanding model jailbreaks and how harmful concepts are represented in large language models as a function of training (pretraining, finetuning, RLHF).

  • Previously worked in computational neuroscience for a year where he tried to understand the mechanisms for human vision saliency with human fMRI experiments.  

What are the most likely causes and outcomes if this project fails?

  • Technical Challenges: We could post what we have today (though the results would be less convincing / claims can’t be as solid), but the marginal work to be done involves working with larger models and possibly some newer methods which may fail to provide additional value if Matthew / Hardik aren’t able to get them working. Joseph assigns < 20% chance of substantial complications in execution / implementation.

  • Team members are offered other opportunities: Both Hardik and Matthew are applying to MATS and other AI safety related jobs, and may leave the project sooner than optimal. This is probably a good outcome and we will know if this is the case before taking receipt of funding.

How much money have you raised in the last 12 months, and from where?

Via Decode Research / Neuronpedia, projects associated with Joseph have received substantial amounts of funding (> $500K USD). PIBBSS and LASR programs which Joseph has mentored in have received funding.

Comments1Similar8
mfatt avatar

Matthew Farr

MoSSAIC

Probing possible limitations and assumptions of interpretability | Articulating evasive risk phenomena arising from adaptive and self modifying AI

Science & technologyTechnical AI safetyAI governanceGlobal catastrophic risks
1
0
$0 raised
kacloud avatar

Alex Cloud

Compute for 4 MATS scholars to rapidly scale promising new method pre-ICLR

Technical AI safety
3
5
$16K raised
mfatt avatar

Matthew Farr

Collaboration to develop a DAG formalism to express instrumentality

Stipend to upskill under and collaborate with Sahil K and Topos for 4-6 months, seeking to obtain teleological DAGs as the dual of causal DAGs

Technical AI safety
3
2
$0 raised
LucyFarnik avatar

Lucy Farnik

Discovering latent goals (mechanistic interpretability PhD salary)

6-month salary for interpretability research focusing on probing for goals and "agency" inside large language models

Technical AI safety
7
4
$1.59K raised
ronakrm avatar

Ronak Mehta

Coordinal Research: Accelerating the research of safely deploying AI systems.

Funding for a new nonprofit organization focusing on accelerating and automating safety work.

Technical AI safety
10
3
$40.1K raised
🍉

Cadenza Labs

Cadenza Labs: AI Safety research group working on own interpretability agenda

We're a team of SERI-MATS alumni working on interpretability, seeking funding to continue our research after our LTFF grant ended.

Science & technologyTechnical AI safetyGlobal catastrophic risks
4
4
$7.81K raised
briantan avatar

Brian Tan

WhiteBox Research: Training Exclusively for Mechanistic Interpretability

1.9 FTE for 9 months to pilot a training program in Manila exclusively focused on Mechanistic Interpretability

Technical AI safety
7
15
$12.4K raised
Peter-Vamplew avatar

Peter Vamplew

Mitigating Reward Misspecification in Reinforcement Learning

Mitigating Reward Misspecification in Reinforcement Learning Using Multiple Independent Reward Specifications and Multi-objective Reinforcement Learning

1
0
$0 raised