Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
3

Neuronpedia - Open Interpretability Platform

Technical AI safety
hijohnnylin avatar

Johnny Lin

ActiveGrant
$2,500raised
$250,000funding goal

Donate

Sign in to donate

Project summary

Neuronpedia (neuronpedia.org) accelerates interpretability researchers who use Sparse Autoencoders (SAEs) by acting as a comprehensive reference and fast-feedback analysis platform. I am collaborating with Joseph Bloom (SERI-MATS) to plan, build, and test the most important features. This is a significant pivot from the previous "crowdsourcing via gamification" project.

  • Usage by Joseph Bloom: Uploaded to Neuronpedia, comment by Neel Nanda, direct link to directions, original SAE post

  • Walkthrough / usage tutorial by exploring OpenAI's MLP directions

  • Example Search / live inference

What are this project's goals and how will you achieve them?

The problem we’re solving is: The amount and complexity of data involved in understanding neural networks is growing rapidly due to unsupervised methods like Sparse Autoencoders or Automatic Interpretability. This will inevitably raise the bar for researchers contributing to Mechanistic Interpretability without access to great infrastructure / engineers. We’ve already made it easier for some researchers to work with Sparse Autoencoders trained on GPT2-SMALL and want to accelerate more researchers working on more models / project.

The way we solve this problem is: Distributed infrastructure which supports sharing and analysis of common datasets. We’re inspired by the the vast amounts of data shared between biomedical researchers in databases which contain data about genes, proteins and diseases. But, this is still a young field so we don’t know exactly what kinds of data / resources are going to be most useful to collaborate on and/or share. Therefore, we think a good compromise between scaling fast (around existing techniques) and waiting to see what comes next is to have an agile mentality while working with researchers to solve the problems currently blocking them when they try to make sense of neural network internals. 

The particular solutions we’re excited to expand include:

  • Hosting SAEs trained by organisations like OpenAI and researchers like Joseph Bloom. Sparse Autoencoders provide a previously unprecedented ability to decompose model internals into meaningful components. We’re going to accelerate research by sharing model transparency as widely as possible with the research community. 

  • Generation of feature dashboards. Data visualization is incredibly useful for summarising data and enabling researchers to build intuition. However generating feature dashboards at scale and storing them is a challenging engineering feat which many researchers / labs will be able to sidestep by using Neuronpedia. 

  • Generation of Explanations (and scoring) Sparse Autoencoder Features. Scaling intepretability research may require significant automation which has already been integrated into Neuronpedia. Benchmarking, improving and speed up our automatic interpretability features could further accelerate our understanding of model internals. See an example of an automatic explanation for a feature here.

  • Interactive interfaces which enable:

    • Mechanistic Interpretability via SAE features: Since we’re already supporting live inference and calculation of feature activations, it’s plausible that we could support a limited degree of experimentation live such as sampling with features pinned (steering with features) or feature ablations. Other features we could build might include search over prompt templates designed to find features involved in common algorithms. 

    • Red-Teaming of SAE Features: We already provide the ability to type in text on feature dashboards which enables a weak form of red teaming. However, we could expand features here such as enabling users to provide a hypothesis which GPT4 uses to generate text on which a feature may or may note fire. 

    • Exploration of SAE Quality and Evaluations: SAEs are a new technique where mileage may vary depending on a variety of factors. It may be useful to develop features that create transparency around SAE quality and compare results between SAEs. 

Specific Milestones we think are worth working towards include:

  1. By July 1st, release a technical note sharing Neuronpedia feature details and three different case studies in how Neuronpedia has accelerated research into neural network internals. This technical note will focus on basic feature exploration, redteaming and search related features.

  2. By October 1st, release a second technical note sharing Neuronpedia feature details and three different case studies in how Neuronpedia has accelerated research into neural network internals. This technical note could focus on more advanced features which have been developed in consultation with the research community. For example, we might build features that enable comparison of features across layers to understand how representations change with depth, or compare features accross models to understand universality in features found between models. 

How will this funding be used?

$250k funds Neuronpedia for 1 year, from Feb 2023 to Feb 2024.

Breakdown:

  • $90k is state, federal, and local taxes (source for California) - possibly higher as 1099.

  • $3-5k/mo for inference servers (currently using CPU), databases, etc

  • $3k/mo for contractors, potentially an intern, and other help

  • Remaining $5-$7k/mo is living costs (rent, insurance, utilities, etc).

This is closer to the lower bound of keeping Neuronpedia running. Less means that there's no cushion for adding significant amounts of data/features to Neuronpedia - and we want to be free, open, and frictionless hosting for all interpretability researchers. Example: we are currently hosting data for GPT2-SMALL, which is tiny and can sort of be run on CPU. Would love to study GPT2-XL, but that requires GPU (that has a decent amount of GPU memory) which gets expensive super fast.

More (500k) would be ideal. I'm incredibly time-starved right now and am doing every role it takes to run Neuronpedia end-to-end - 100% of the coding, UI/UX/design, ops, fixing bugs, PMing tasks, keeping up with requests, comms, finding grants, writing, testing, etc. I currently do 70+ hours a week and spend my other time thinking about Neuronpedia. With more funding I'd delegate a lot of tasks to contractors to build features faster and in parallel.

Who is on your team and what's your track record on similar projects?

  • I'm an ex-Apple engineer who founded a few apps, and last year went full time into interpretability building Neuronpedia. The Neuronpedia infrastructure I've built and knowledge I've gained during that time is helping me move much faster on the new research pivot. Previously, my most popular app got over 1,000,000 organic downloads and my writing has been featured in the Washington Post, Forbes, FastCompany, etc. I have code contributions to OpenAI's public repository for Automated Interpretability - new features, some fixes (another), etc

  • My collaborator is Joseph Bloom (MATS) and recently published residual directions for GPT2-SMALL. Joseph Bloom has 2+ years experience at a SaaS platform doing data processing and live exploration of biomedical data and is currently an independently funded mechanistic interpretability researcher. We collaborated to upload his directions to Neuronpedia, where all 294k directions can be browsed, filtered, and searched. Joseph is providing feedback on how Neuronpedia can help and accelerate interp research (dogfooding, PM-ing/prioritization of features, connecting with community of interp researchers), as well as training/uploading new models.

What are the most likely causes and outcomes of this project's failure? (premortem)

  • Not enough feedback (I super-need people to quickly and bluntly tell me "hey that idea/feature doesn't work or isn't useful, scrap it" and also ideally "work on this idea/feature instead")

  • Not iterating useful features quickly enough

  • Not enough resources to sustain the service including hosting, inference, etc and have good uptime, and to be able to scale to larger models

  • Bad UX and/or bugs: Researchers have difficulty using it and give up on it because it's too confusing

What other funding are you or your project getting?

  • All previous funding has been exhausted as of Feb 9th, 2024 - this is now being funded by me personally. Previous short-term grants include $2,500 from Manifund - details available upon request.

  • Applied to OpenAI's Superalignment Fast Grants and LTTF a few days ago. No status yet.

Comments9Donations1Similar8
🍉

Cadenza Labs

Cadenza Labs: AI Safety research group working on own interpretability agenda

We're a team of SERI-MATS alumni working on interpretability, seeking funding to continue our research after our LTFF grant ended.

Science & technologyTechnical AI safetyGlobal catastrophic risks
4
4
$7.81K raised
Adam-Shai avatar

Adam Shai

Simplex - building our research team

Fund a new research agenda, based on computational mechanics, bridging mechanism and behavior to develop a rigorous science of AI systems and capabilities.

Science & technologyTechnical AI safety
6
2
$0 raised
ampdot avatar

ampdot

Act I: Exploring emergent behavior from multi-AI, multi-human interaction

Community exploring and predicting potential risks and opportunities arising from a future that involves many independently controlled AI systems

Technical AI safetyEA Community ChoiceForecastingGlobal catastrophic risks
26
98
$67.8K raised
francisco avatar

Francisco Carvalho

Building Tooling to Map how Ideas Spread

The nooscope will deliver public tools to map how ideas spread, starting with psyop detection, within 18 months

Science & technologyForecastingGlobal catastrophic risks
2
1
$2.63K / $100K
briantan avatar

Brian Tan

WhiteBox Research: Training Exclusively for Mechanistic Interpretability

1.9 FTE for 9 months to pilot a training program in Manila exclusively focused on Mechanistic Interpretability

Technical AI safety
7
15
$12.4K raised
briantan avatar

Brian Tan

WhiteBox Research’s AI Interpretability Fellowship

~4 FTE for 9 months to fund WhiteBox Research, mainly for the 2nd cohort of our AI Interpretability Fellowship in Manila

Technical AI safetyEA Community Choice
6
7
$2K raised
Miles avatar

Miles Tidmarsh

Synthetic pretraining to make AIs more compassionate

Enabling Compassion in Machine Learning (CaML) to develop methods and data to shift future AI values

Technical AI safetyAnimal welfare
3
0
$0 / $159K
Apart avatar

Apart Research

Keep Apart Research Going: Global AI Safety Research & Talent Pipeline

Funding ends June 2025: Urgent support for proven AI safety pipeline converting technical talent from 26+ countries into published contributors

Technical AI safetyAI governanceEA community
30
36
$131K raised