Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
7

Discovering latent goals (mechanistic interpretability PhD salary)

Technical AI safety
LucyFarnik avatar

Lucy Farnik

CompleteGrant
$1,590raised

ALREADY FULLY FUNDED ELSEWHERE

Project summary

I'm working on interpreting LLMs, specifically trying to move us closer to a world where we can understand and edit what a model's goals are by examining/changing its activations. I plan to study this by explicitly giving a language model a task (eg. "Write linux terminal commands to do [X]") and then understand how this is implemented in the model.

Project goals

The broad goal is to understand how language models represent goals, and also try to understand whether or not we can "probe for agency" within them. More narrowly, I want to understand how language models implement simple "agentic" behavior when you directly prompt them to do a specific task.

If directly discovering goals turns out to be methodologically out of reach, I may pivot to interpreting easier LLM behaviors to help develop the methodology we need.

I'll be doing a portion of this within my alignment PhD. I also plan to spend some time in the next 6 months upskilling in AI governance so I can better understand how my work could contribute to informing policymakers (since there seems to be a lot of value in technical researchers bringing their expertise into public policy).

How will this funding be used?

The funding will go towards my salary for the next 6 months as well as a travel budget (attending conferences, spending some time in the Bay etc).

Here's a rough breakdown: I need around $64k/y to live comfortably and maximize my productivity (this is roughly half of what I was earning as a senior developer), plus a $8k/y travel budget. I'll be receiving a $28k/y PhD stipend along with a $2k/y travel budget from the UK government starting in October. I'm asking for 6 months worth of funding, meaning funding for the next 3 months before my PhD starts ((64+8)/4=$18k) plus the first 3 months of my PhD ((64+8-28-2)/4=$10.5k). I'm adding an extra 20% to cover the income tax I'll have to pay on this, and an extra 10% as an "unexpected expenses buffer" as LTFF recommends which gets me to $38k.

What is your (team's) track record on similar projects?

I started coding at age 7 and became a senior developer at a tech startup at age 18 where I stayed for 4 years while doing my undergrad. I then switched to alignment research and within 6 months, I submitted to NeurIPS as the second-name author along with people from FHI, CHAI, and FAR AI. I've also taken part in ARENA, AISC, and AGISF, and I've done alignment community building and founded an upskilling group in my city. While at ARENA, I won #1 prize in an Apart AI hackathon for my team's work on accelerated automated circuit discovery.

How could this project be actively harmful?

Looking for goals and "agency" directly could obviously be dangerous because if you find a robust way to mathematically express something, you could optimize for it. Even if this project doesn't directly get to that point, it might still move us closer to a world where you could directly optimize for agency. It's entirely possible that, if this project succeeds, I won't make the results public and will only share it with specific members of the AI safety community because it could constitute an infohazard. I believe this should mitigate any negative consequences, and I also don't think there's any meaningful alignment research which doesn't generate potential infohazards.

What other funding is this person or project getting?

I'll start receiving $30k/y from the UK government starting in October.

Comments4Donations4Similar7
Bart-Bussmann avatar

Bart Bussmann

Epistemology in Large Language Models

1-year salary for independent research to investigate how LLMs know what they know.

Technical AI safety
5
0
$0 raised
LawrenceC avatar

Lawrence Chan

Exploring novel research directions in prosaic AI alignment

3 month

Technical AI safety
5
9
$30K raised
mfatt avatar

Matthew Farr

MoSSAIC

Probing possible limitations and assumptions of interpretability | Articulating evasive risk phenomena arising from adaptive and self modifying AI

Science & technologyTechnical AI safetyAI governanceGlobal catastrophic risks
1
0
$0 raised
SandyFraser avatar

Sandy Fraser

Concept-anchored representation engineering for alignment

New techniques to impose minimal structure on LLM internals for monitoring, intervention, and unlearning.

Technical AI safetyGlobal catastrophic risks
3
1
$0 / $72.3K
jesse_hoogland avatar

Jesse Hoogland

Scoping Developmental Interpretability

6-month funding for a team of researchers to assess a novel AI alignment research agenda that studies how structure forms in neural networks

Technical AI safety
13
11
$145K raised
🍓

James Lucassen

LLM Approximation to Pass@K

Technical AI safety
3
6
$2K / $6K
🐙

Aryeh L. Englander

Continued funding for a PhD in AI x-risk decision and risk analysis

Continuation of a previous grant to allow me to pursue a PhD in risk and decision analysis related to AI x-risks

Technical AI safety
2
1
$0 raised