Neuronpedia - Open Interpretability Platform

Project summary

Neuronpedia (neuronpedia.org) accelerates interpretability researchers who use Sparse Autoencoders (SAEs) by acting as a comprehensive reference and fast-feedback analysis platform. I am collaborating with Joseph Bloom (SERI-MATS) to plan, build, and test the most important features. This is a significant pivot from the previous "crowdsourcing via gamification" project.

Usage by Joseph Bloom: Uploaded to Neuronpedia, comment by Neel Nanda, direct link to directions, original SAE post

Walkthrough / usage tutorial by exploring OpenAI's MLP directions
Example Search / live inference

What are this project's goals and how will you achieve them?

The problem we’re solving is: The amount and complexity of data involved in understanding neural networks is growing rapidly due to unsupervised methods like Sparse Autoencoders or Automatic Interpretability. This will inevitably raise the bar for researchers contributing to Mechanistic Interpretability without access to great infrastructure / engineers. We’ve already made it easier for some researchers to work with Sparse Autoencoders trained on GPT2-SMALL and want to accelerate more researchers working on more models / project.

The way we solve this problem is: Distributed infrastructure which supports sharing and analysis of common datasets. We’re inspired by the the vast amounts of data shared between biomedical researchers in databases which contain data about genes, proteins and diseases. But, this is still a young field so we don’t know exactly what kinds of data / resources are going to be most useful to collaborate on and/or share. Therefore, we think a good compromise between scaling fast (around existing techniques) and waiting to see what comes next is to have an agile mentality while working with researchers to solve the problems currently blocking them when they try to make sense of neural network internals.

The particular solutions we’re excited to expand include:

Hosting SAEs trained by organisations like OpenAI and researchers like Joseph Bloom. Sparse Autoencoders provide a previously unprecedented ability to decompose model internals into meaningful components. We’re going to accelerate research by sharing model transparency as widely as possible with the research community.

Generation of feature dashboards. Data visualization is incredibly useful for summarising data and enabling researchers to build intuition. However generating feature dashboards at scale and storing them is a challenging engineering feat which many researchers / labs will be able to sidestep by using Neuronpedia.
Generation of Explanations (and scoring) Sparse Autoencoder Features. Scaling intepretability research may require significant automation which has already been integrated into Neuronpedia. Benchmarking, improving and speed up our automatic interpretability features could further accelerate our understanding of model internals. See an example of an automatic explanation for a feature here.
Interactive interfaces which enable:
- Mechanistic Interpretability via SAE features: Since we’re already supporting live inference and calculation of feature activations, it’s plausible that we could support a limited degree of experimentation live such as sampling with features pinned (steering with features) or feature ablations. Other features we could build might include search over prompt templates designed to find features involved in common algorithms.
- Red-Teaming of SAE Features: We already provide the ability to type in text on feature dashboards which enables a weak form of red teaming. However, we could expand features here such as enabling users to provide a hypothesis which GPT4 uses to generate text on which a feature may or may note fire.
- Exploration of SAE Quality and Evaluations: SAEs are a new technique where mileage may vary depending on a variety of factors. It may be useful to develop features that create transparency around SAE quality and compare results between SAEs.

Specific Milestones we think are worth working towards include:

By July 1st, release a technical note sharing Neuronpedia feature details and three different case studies in how Neuronpedia has accelerated research into neural network internals. This technical note will focus on basic feature exploration, redteaming and search related features.
By October 1st, release a second technical note sharing Neuronpedia feature details and three different case studies in how Neuronpedia has accelerated research into neural network internals. This technical note could focus on more advanced features which have been developed in consultation with the research community. For example, we might build features that enable comparison of features across layers to understand how representations change with depth, or compare features accross models to understand universality in features found between models.

How will this funding be used?

$250k funds Neuronpedia for 1 year, from Feb 2023 to Feb 2024.

Breakdown:

$90k is state, federal, and local taxes (source for California) - possibly higher as 1099.
$3-5k/mo for inference servers (currently using CPU), databases, etc
$3k/mo for contractors, potentially an intern, and other help
Remaining $5-$7k/mo is living costs (rent, insurance, utilities, etc).

This is closer to the lower bound of keeping Neuronpedia running. Less means that there's no cushion for adding significant amounts of data/features to Neuronpedia - and we want to be free, open, and frictionless hosting for all interpretability researchers. Example: we are currently hosting data for GPT2-SMALL, which is tiny and can sort of be run on CPU. Would love to study GPT2-XL, but that requires GPU (that has a decent amount of GPU memory) which gets expensive super fast.

More (500k) would be ideal. I'm incredibly time-starved right now and am doing every role it takes to run Neuronpedia end-to-end - 100% of the coding, UI/UX/design, ops, fixing bugs, PMing tasks, keeping up with requests, comms, finding grants, writing, testing, etc. I currently do 70+ hours a week and spend my other time thinking about Neuronpedia. With more funding I'd delegate a lot of tasks to contractors to build features faster and in parallel.

Who is on your team and what's your track record on similar projects?

I'm an ex-Apple engineer who founded a few apps, and last year went full time into interpretability building Neuronpedia. The Neuronpedia infrastructure I've built and knowledge I've gained during that time is helping me move much faster on the new research pivot. Previously, my most popular app got over 1,000,000 organic downloads and my writing has been featured in the Washington Post, Forbes, FastCompany, etc. I have code contributions to OpenAI's public repository for Automated Interpretability - new features, some fixes (another), etc
My collaborator is Joseph Bloom (MATS) and recently published residual directions for GPT2-SMALL. Joseph Bloom has 2+ years experience at a SaaS platform doing data processing and live exploration of biomedical data and is currently an independently funded mechanistic interpretability researcher. We collaborated to upload his directions to Neuronpedia, where all 294k directions can be browsed, filtered, and searched. Joseph is providing feedback on how Neuronpedia can help and accelerate interp research (dogfooding, PM-ing/prioritization of features, connecting with community of interp researchers), as well as training/uploading new models.

What are the most likely causes and outcomes of this project's failure? (premortem)

Not enough feedback (I super-need people to quickly and bluntly tell me "hey that idea/feature doesn't work or isn't useful, scrap it" and also ideally "work on this idea/feature instead")
Not iterating useful features quickly enough
Not enough resources to sustain the service including hosting, inference, etc and have good uptime, and to be able to scale to larger models
Bad UX and/or bugs: Researchers have difficulty using it and give up on it because it's too confusing

What other funding are you or your project getting?

All previous funding has been exhausted as of Feb 9th, 2024 - this is now being funded by me personally. Previous short-term grants include $2,500 from Manifund - details available upon request.
Applied to OpenAI's Superalignment Fast Grants and LTTF a few days ago. No status yet.