hi anton - great questions! also lol @ subscription - if only!
i'll preface by saying that i'm by no means an expert in mechanistic interpretability, and I apologize for not including more detailed justification on the grant application or website - if you've been doing this awhile, you probably know more than me, and your question of "why try to understand neurons?" is probably best answered by someone who has an academic background in this.
Re: Usefulness of neuron naming
People aren't currently using GPT2-SMALL as their daily chatbot, but the things we learn from smaller models can ideally be applied to larger models, and the idea is that eventually we'd have new campaigns to help identify neurons in larger models. Purely for example's sake, maybe we're able to identify a neuron (or group of neurons) for violent actions - in that case we might try to update the model to avoid/reduce its impact. Of course this can turn into a potential trolley problem quickly (maybe changing that part affects some other part negatively) - but having this data is likely better than not having it.
Aside from the actual explanations themselves, data around how a user finds a good explanation can also be useful - what activations are users looking at? Which neurons tend to be easily explained and others not? Etc.
There is a greater question of the usefulness of looking at individual neurons, vs other "units", as highlighted in the 2nd premortum. You're correct that neuronpedia will eventually likely need to adapt to analyzing single neurons. This is high priority on the TODO.
Re: Can't the network generate the explanation itself?
Yes, that's exactly what the existing explanations are generated from. Basically uses GPT4 to guess what the neurons in GPT2-SMALL are related to. Please see this paper from OpenAI: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
The issue is that these explanations aren't great, and that's why Neuronpedia solicits human help to solve these neuron puzzles.
Re: how do you score the explanation suggested by the user?
The scoring uses a simulator from Automated Interpretability based on the top known activating text and its activations. You can see how it works here: https://github.com/openai/automated-interpretability/tree/main
One of the things the game currently does not do (that I would like to do given more resources) is to re-score all explanations when a new high-activation text is found. This would mean higher quality (more accurate) scores. Also, larger models (even GPT2-XL) requires expensive GPUs to perform activation text testing.
again, i'm no expert in this - i'm fairly new to AI and but I want to build useful things. let me know if you have further questions and i'll try my best to answer!