Joseph Bloom - Independent AI Safety Research

Project summary

I would like to continue studying offline-RL agents using mechanistic interpretability in order to understand goals and agency. I believe the derived insights may help predict, detect and/or prevent AI misalignment.

Key Activities:

Research mechanistic interpretability (MI) of trajectory transformer models.

Build/Maintain Open Source tooling (eg: TransformerLens package)
Mentor, support and advise other researchers/engineers.
Possibly: Start a 501c3 foundation modeled on the Farama Foundation to accelerate alignment tools/infrastructure.

Key reasons:

Research: My work is likely to produce insights into alignment relevant questions including foundational MI, goal representations and validating new MI techniques.
Tooling/OpenSource: Open Source Packages enable better MI will lead to faster innovation and adoption.
Community: I’d continue to help others develop technical skills, prioritize research directions and apply for funding and contribute to open source projects.

Project goals

Concretely, I’d like to broaden my current research scope to offline-RL transformers (from Decision Transformers). This will involve training models and then trying using current or novel MI approaches to look for goal representations and understand the mechanisms by which next token predictors “simulate” agents.

Conceptually, I’d like to:

Better understand transformers/prosaic AI in general.

Reduce my confusion about things like why GPT4 isn’t more agentic or to what extent you could say it has goals.

I expect the impact of this work to be that I will publish toy models, tools, analyses and experimental results which improve the state of public knowledge around agency and goals in transformer models.

How will this funding be used?

Salary-50k

Taxes-40k

Travel/Conferences-5k

Computing budget-10k

Work Requirements-5k

Total 110k

Another 140k would go towards
1. Starting a foundation to organize better tools for independent researchers working on alignment.

Hiring a research intern

How could this project be actively harmful?

A) This person could show enough promise that they are headhunted by capabilities labs.
B) Open source tooling for mech interp could be used for bad purposes?

What other funding is this person or project getting?

I am confident that I should be doing AI alignment work given my skill set and so will seek funding from other sources. I have no current applications with other funders. I am interviewing for a role as a Research Engineer at DeepMind.