Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
2

Research on cognitive bias in LLMs + exacerbation by RLHF

Science & technologyTechnical AI safetyACX Grants 2024
🐌

Jon Bogard

Not fundedImpact certificate
$0raised
$14,000valuation

Longer description of your proposed project

Trained on human data, LLMs may inherit many of the psychological properties of the humans who produced that training data. While this has been tentatively shown in some social and moral domains (e.g., racial bias), in two phases, we seek to demonstrate that many *cognitive* biases that afflict human judgment (e.g., base-rate neglect) also become problems for LLMs. In Phase 1, Using various language models, we intend to systematically evaluate the extent to which LLMs pass validated benchmarks of rationality (e.g., Stanovich & West’s “rationality quotient”). After this, we will compare actual performance against predictions of lay people and a special sample of computer scientists; we expect that people see LLMs as more ideal reasoners and fail to account for the biases that they inherit. In Phase 2 we will test the hypothesis that, contrary to the common view that LLMs become more rational as they advance, the process of Reinforcement Learning by Human Feedback (RLHF) can actually exacerbate the problem of cognitive bias in LLMs. We will first give a sample of human raters a battery of questions known to produce bias; bias will be assessed by having participants choose between two possible responses to a question: an intuitive (wrong) answer and a correct answer, thus mimicking the human feedback component of the RLHF process. We will then fine tune a model with these human responses and compare accuracy of LLM responses pre- versus post-RLHF. We expect that RLHF will bring LLMs further away from ideal reasoning whenever human biases systematically deviate from rationality. Altogether, we hope to demonstrate ways in which LLMs may inherit human cognitive biases — and how this may be exacerbated by RLHF — in the hopes of improving the chances of aligning AI reasoning with human goals.

Describe why you think you're qualified to work on this

My PhD is in judgment & decision making, and this project is joint with two other faculty members with similar expertise: Lucius Caviola (GPI/Oxford) & Josh Lewis (NYU). The engineer we would hire with this grant money has unmatched experience doing extremely similar work and comes recommended from top engineers of a colleague's CS lab.

Other ways I can learn about you

https://www.jonbogard.com/

How much money do you need?

$14,000

Links to any supporting documents or information

This was the quote from an engineer for his time + compute purchase

Estimate your probability of succeeding if you get the amount of money you asked for

>90% chance of learning something publishable in a general science/psychology journal and useful for the community; >65% chance of publication in a top 3 journal.

Comments1Similar6
🐯

Scott Viteri

Attention-Guided-RL for Human-Like LMs

Compute Funding

Technical AI safety
4
2
$3.1K raised
Bart-Bussmann avatar

Bart Bussmann

Epistemology in Large Language Models

1-year salary for independent research to investigate how LLMs know what they know.

Technical AI safety
5
0
$0 raised
laurence_ai avatar

Laurence Aitchison

Bayesian modelling of LLM capabilities from evals

4
4
$32K raised
🍓

James Lucassen

LLM Approximation to Pass@K

Technical AI safety
3
6
$2K / $6K
LucyFarnik avatar

Lucy Farnik

Discovering latent goals (mechanistic interpretability PhD salary)

6-month salary for interpretability research focusing on probing for goals and "agency" inside large language models

Technical AI safety
7
4
$1.59K raised
lisathiergart avatar

Lisa Thiergart

Activation vector steering with BCI

Technical AI safety
7
6
$30.3K raised