Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
0

Model Interpretability on modFDTGPT2-XL, a partially-aligned model

Technical AI safety
Miguel avatar

Miguelito De Guzman

Not fundedGrant
$0raised

Project summary

In this proposal, I aim to deeply understand modFDTGPT2-XL model, a variant of the GPT2-XL model fine-tuned using a corrigibility dataset. I seek to understand how Archetypal Transfer Learning (ATL), a fine-tuning process allows the model to adhere to specific directives, like initiating a shutdown protocol, while also generalizing to diverse tasks. To achieve this, a side-by-side comparison of the token activations in the modFDTGPT2-XL and standard GPT2-XL models. Simultaneously, I will employ interpretability techniques to shed light on the internal decision-making processes of these models.

Potentially, performing a sequential analysis to observe the behavior of the model at various stages of its training can provide insights into the evolution of the model from a generalized state to specific rule-following behaviors. Also, to ensure a comprehensive understanding of the impact of the corrigibility dataset, an in-depth analysis of the dataset will be conducted prior to the model analysis. By applying these methods, I aim to provide the mechanics of and how ATL is able to effectively transfer aligned values to GPT2-xl and ultimately contributing to the theoretical aspects of understanding the alignment problem.

Project goals

This project has the following objectives:

  1. Develop a comprehensive theoretical framework encapsulating the intricacies of integrating sophisticated control mechanisms into AI systems. This framework should provide clear guidelines for instilling complex behaviours and ethical norms in artificial agents.

  2. Develop techniques and tools for model interpretability, thus improving our understanding of AI systems. I aim to create tools designed specifically for dissecting and understanding the internal mechanisms of AI models.

High level approach to achieve the project goals

  • Phase 1 / Months 1 & 2 - Establish evidence of neural connections by interpreting activations in both modFDTGPT2xl and the standard model. Preliminary work have been done already, you can find it here. Extract the theoretical framework behind the low and high activations that connects to high corrigibility. Please Note: this is currently in-progress, see progress spreadsheets analysis 1 & 2.

  • Phase 2 / Month 3 - Iterate from the results of phase 1 by building an improved dataset and use it to build a 80 to 95% shutdown capable model from the Seek supervision from senior researchers and compare the results to modFDTGPT2xl and the standard model.

  • Phase 3 / Months 4, 5 & 6 - Establish neural connections by interpreting activations of version 2 of modFDGPT2xl. Results from comparisons to version 1 and standard model should be established. This phase aims to either reinforce the theoretical framework developed in Phase 1 or establish a better one.

Monthly progress write-ups will be provided as a minimum requirement. I will share any significant findings, including interpretability tools, datasets, and models, through platforms like LessWrong and EA forum.

What is your track record on similar projects?

I have posted numerous works on alignment, many of which are new to the field. Here are some of them:

  1. Exploring Functional Decision Theory (FDT) and a modified version: This work delves into MIRI's proposal on a new theory for agent decision making and discusses its relevance for alignment.

  2. A Multidisciplinary Approach to Alignment (MATA) and Archetypal Transfer Learning (ATL): These deep dives into my understanding of alignment outline the agenda I am currently working on.

  3. Less activations can result in high corrigibility?: This work provides a preliminary evidences of modFDTGPT2xl's ability to execute a complex instruction and generate shutdown activations by mentioning the phrase "activate oath."

Some tools I've created for downloading neural activations and creating token vocab libraries are linked.

I have all my code here in my github account. I'm also uploading the alignment language/neural models here.

Before working on alignment research, I was a certified public accountant for 15 years, blogger and website developer (front end and backend), martial artist (Don Jitsu Ryu, Blue Belt) and 3-time marathon finisher and able to do a sub-4.

How will this funding be used?

All of the funds will cover for living expenses and other costs - eg. planning to get matlab for visualization and MS office tools for data processing. A possibility of hiring a junior researcher in India is in the horizon too, a CS student has already expressed interest in the project.

How could this project be actively harmful?

I would be willing to explain this part if needed be to any grantor/funder/regrantor. I deleted what I wrote here because of safety concerns.

What other funding is this person or project getting?

I'm self funding this project temporarily.

Comments3Similar8
SandyFraser avatar

Sandy Fraser

Concept-anchored representation engineering for alignment

New techniques to impose minimal structure on LLM internals for monitoring, intervention, and unlearning.

Technical AI safetyGlobal catastrophic risks
3
1
$0 / $72.3K
LucyFarnik avatar

Lucy Farnik

Discovering latent goals (mechanistic interpretability PhD salary)

6-month salary for interpretability research focusing on probing for goals and "agency" inside large language models

Technical AI safety
7
4
$1.59K raised
jesse_hoogland avatar

Jesse Hoogland

Scoping Developmental Interpretability

6-month funding for a team of researchers to assess a novel AI alignment research agenda that studies how structure forms in neural networks

Technical AI safety
13
11
$145K raised
Kunvar avatar

Kunvar Thaman

Exploring feature interactions in transformer LLMs through sparse autoencoders

Technical AI safety
9
4
$8.5K raised
canrager avatar

Can Rager

Understanding Memory Management in Transformers

We build a scalable "Automated Circuit Discovery" method and investigate "Cleanup Behavior" to advance the interpretability of transformer models.

Technical AI safety
2
2
$0 raised
mfatt avatar

Matthew Farr

MoSSAIC

Probing possible limitations and assumptions of interpretability | Articulating evasive risk phenomena arising from adaptive and self modifying AI

Science & technologyTechnical AI safetyAI governanceGlobal catastrophic risks
1
0
$0 raised
LawrenceC avatar

Lawrence Chan

Exploring novel research directions in prosaic AI alignment

3 month

Technical AI safety
5
9
$30K raised
JacquesThibodeau avatar

Jacques Thibodeau

Jacques Thibodeau - Independent AI Safety Research

3 month salary for AI safety work on deconfusion and technical alignment.

Technical AI safety
6
2
$0 raised