Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
8

Joseph Bloom - Independent AI Safety Research

Technical AI safety
josephbloom avatar

joseph bloom

CompleteGrant
$51,400raised

Project summary

I would like to continue studying offline-RL agents using mechanistic interpretability in order to understand goals and agency. I believe the derived insights may help predict, detect and/or prevent AI misalignment. 


Key Activities:

  • Research mechanistic interpretability (MI) of trajectory transformer models.

  • Build/Maintain Open Source tooling (eg: TransformerLens package) 

  • Mentor, support and advise other researchers/engineers.  

  • Possibly: Start a 501c3 foundation modeled on the Farama Foundation to accelerate alignment tools/infrastructure.

Key reasons:

  • Research: My work is likely to produce insights into alignment relevant questions including foundational MI, goal representations and validating new MI techniques. 

  • Tooling/OpenSource: Open Source Packages enable better MI will lead to faster innovation and adoption. 

  • Community: I’d continue to help others develop technical skills, prioritize research directions and apply for funding and contribute to open source projects.

Project goals

Concretely, I’d like to broaden my current research scope to offline-RL transformers (from Decision Transformers). This will involve training models and then trying using current or novel MI approaches to look for goal representations and understand the mechanisms by which next token predictors “simulate” agents. 


Conceptually, I’d like to:

  • Better understand transformers/prosaic AI in general.

  • Reduce my confusion about things like why GPT4 isn’t more agentic or to what extent you could say it has goals. 


I expect the impact of this work to be that I will publish toy models, tools, analyses and experimental results  which improve  the state of public knowledge around agency and goals in transformer models.

How will this funding be used?

Salary-50k

Taxes-40k

Travel/Conferences-5k

Computing budget-10k

Work Requirements-5k

Total 110k

Another 140k would go towards
1. Starting a foundation to organize better tools for independent researchers working on alignment.

  1. Hiring a research intern

How could this project be actively harmful?

A) This person could show enough promise that they are headhunted by capabilities labs.
B) Open source tooling for mech interp could be used for bad purposes?

What other funding is this person or project getting?

I am confident that I should be doing AI alignment work given my skill set and so will seek funding from other sources. I have no current applications with other funders. I am interviewing for a role as a Research Engineer at DeepMind.

Comments13Donations8Similar7
JaesonB avatar

Jaeson Booker

Jaeson's Independent Alignment Research and work on Accelerating Alignment

Collective intelligence systems, Mechanism Design, and Accelerating Alignment

Technical AI safety
2
0
$0 raised
LawrenceC avatar

Lawrence Chan

Exploring novel research directions in prosaic AI alignment

3 month

Technical AI safety
5
9
$30K raised
LucyFarnik avatar

Lucy Farnik

Discovering latent goals (mechanistic interpretability PhD salary)

6-month salary for interpretability research focusing on probing for goals and "agency" inside large language models

Technical AI safety
7
4
$1.59K raised
robertzk avatar

Robert Krzyzanowski

Scaling Training Process Transparency

Compute and infrastructure costs

Technical AI safety
3
4
$5.15K raised
MatthewClarke avatar

Matthew A. Clarke

Salaries for SAE Co-occurrence Project

Working title - “Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces”

Science & technologyTechnical AI safety
3
1
$0 raised
alexhb61 avatar

Alexander Bistagne

Alignment Is Hard

Proving Computational Hardness of Verifying Alignment Desirata

Technical AI safety
4
13
$6.07K raised
jesse_hoogland avatar

Jesse Hoogland

Scoping Developmental Interpretability

6-month funding for a team of researchers to assess a novel AI alignment research agenda that studies how structure forms in neural networks

Technical AI safety
13
11
$145K raised