Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
0

Wise AI: Fine-Tune an Open Source Model

Klingefjord avatar

Oliver Klingefjord

CompleteGrant
$5,550raised

This grant is for the Meaning Alignment Institute. A detailed proposal and project plan can be found here.

Project summary

We believe we need AI that’s not just intelligent, but morally astute. We work towards models which can do superhuman moral reasoning, where such reasoning can be checked or evaluated by humans, or by lesser models through scalable supervision. Together with OpenAI, we've taken a step towards this. We used theories of moral learning to gather data about convergent values from humans.

This grant covers our next step: Generating synthetic data according to these theories, fine-tune a model with it, and qualitatively evaluate the model with crowd workers.

This 4-6 month project will result in an open-sourced wise model, a wisdom alignment dataset, and an academic paper. We hope this can spark a race in the alignment community towards wise AI.

What are this project's goals and how will you achieve them?

Traditionally, AI alignment has been defined as alignment to “operator intent”. With more powerful models deployed in social contexts, this definition is becoming unworkable:

“Aligned with operator intent” means aligned with our current societal incentive structure, which will cause problems. AI systems aligned with operator intent replacing humans in key decision-making pipelines is analogous to introducing intelligent and obedient “sociopaths”, with no regard for the values and social norms that currently prevent misaligned incentives from destroying us. There are many examples of humans disobeying orders (e.g., orders to launch nuclear missiles, or or to execute profitable but unethical business moves) that illustrate this point.

Therefore, alignment includes the broader question of “what to align towards”. This broader notion of alignment has been defined as aligned to operator intent and human values.

So far, the work done to define human values has been very vague, resorting to equating human values with moral judgements, or revealed preference, and disregarding the contextuality of values (a constitution cannot cover the many cases in which an LLM will find itself in - this is why we have case law and precendent in our legal system, not just constitutions).

We’re writing a paper in which we argue a good alignment target for human values should be the following:

  • Robust to manipulation.

  • Fine-grained with regard to contexts.

  • Generalizable to new situations.

  • Auditable & interpretable for humans.

  • Scalable, such that more elicited data yields a better model.

  • Legitimate, such that participants and users of the resulting model agree it is operating on a fair selection of human values.

The goal of this project is to pave a way for a values alignment approach – informed by a theory of moral learning fleshed out by philosophers like Charles Taylor, Ruth Chang, and others – that we believe will meet these criteria.

Based on RAG-experiments, we expect interacting with a model trained on a moral graph to be more like interacting with an agent that has a sense of the moral situation it is in – instead of providing static bullets-point lists, or refusing request that fail to meet the overly-broad HHH criteria.

How will this funding be used?

This project will take roughly 4-6 months and result in open-sourced wisdom alignment dataset, fine-tuned model and an academic paper.

The budget will be used for fine-tuning compute, inference compute, crowd workers (eval), and salary for 1 FT AI researcher, 1 FT Project Lead, 1 PT AI engineer.

Who is on your team and what's your track record on similar projects?

Joe Edelman (MIT, Dartmouth, co-founder CHT), Oliver Klingefjord (AI Objectives), Ivan Vendrov (Google Scholar, Anthropic, advisor) did the prior work with OpenAI leading to this grant.

Ryan Lowe (Google Scholar), OpenAI, InstructGPT will be advising.

What are the most likely causes and outcomes if this project fails?

It could be the case that a moral graph needs to be bigger than we anticipate in order to meaningfully improve upon existing approaches. We will most likely be able to validate our hypothesis on a more narrowly defined context if so, but this might be less convincing to alignment researchers.

What other funding are you or your project getting?

The final $25k is already committed from another source.

Comments4Donations3Similar7
Klingefjord avatar

Oliver Klingefjord

AI-Driven Market Alternatives for a post-AGI world

Develop an LLM-based coordinator and test against consumer spending with 200 people.

Science & technologyTechnical AI safetyAI governanceGlobal health & development
5
5
$15.7K raised
TA avatar

Tianyi (Alex) Qiu

Moral Progress in AI to Prevent Premature Value Lock-in

Early exploration, agenda-setting, technical infrastructure, and early community building

Science & technologyTechnical AI safetyLong-Term Future FundGlobal catastrophic risks
2
0
$0 raised
lisathiergart avatar

Lisa Thiergart

Activation vector steering with BCI

Technical AI safety
7
6
$30.3K raised
🐯

Scott Viteri

Attention-Guided-RL for Human-Like LMs

Compute Funding

Technical AI safety
4
2
$3.1K raised
Peter-Vamplew avatar

Peter Vamplew

Mitigating Reward Misspecification in Reinforcement Learning

Mitigating Reward Misspecification in Reinforcement Learning Using Multiple Independent Reward Specifications and Multi-objective Reinforcement Learning

1
0
$0 raised
jesse_hoogland avatar

Jesse Hoogland

Scoping Developmental Interpretability

6-month funding for a team of researchers to assess a novel AI alignment research agenda that studies how structure forms in neural networks

Technical AI safety
13
11
$145K raised
KabirKumar avatar

Kabir Kumar

AI-Plans.com

Alignment Research Platform

3
1
$0 raised