Hallucination Detector

🐮

Oscar Balcells Obeso

ActiveGrant

$6,000raised

$6,000funding goal

Fully funded and not currently accepting donations.

Project summary

Building on our ICLR 2025 oral paper "Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models," developed during MATS, we propose to advance the detection and mitigation of factual hallucinations in open-ended text generation. Our previous work identified internal representations in language models that encode knowledge awareness about entities (people’s names, movies, songs…), and demonstrated their causal relationship to hallucinatory behaviors. We now aim to develop robust, generalizable methods for detecting factual inaccuracies in open-ended generations.

How will this funding be used?

The funding will primarily be used to pay for compute (API credits & GPU time). A portion ($1.2K) will also reimburse previous research costs that we've personally covered to maintain research continuity.

Who is on your team? What's your track record on similar projects?

Oscar Balcells Obeso is a Master's student at ETH Zurich and MATS 2025 scholar advised by Neel Nanda. His previous research includes Refusal is Mediated by a Single Direction (NeurIPS 2024) and Do I Know this Entity? Knowledge Awareness and Hallucinations in Language Models (ICLR 2025 Oral).

Javier is a Research Engineer at the Barcelona Supercomputing Center and fellow MATS 2025 scholar advised by Neel Nanda. He's the author of Do I Know this Entity? Knowledge Awareness and Hallucinations in Language Models and has overall more than 10 first-author publications in leading NLP conferences such as ICLR, ACL and EMNLP (https://scholar.google.com/citations?user=ZNsw8ZUAAAAJ&hl=es). His industry experience includes research internships at Meta (FAIR), Apple's Machine Translation team, and Amazon's Books Science division. Javier holds a BSc in Electronic Engineering and an MSc in Data Science.

How much money have you raised in the last 12 months, and from where?

We received a 4-month grant from the Long-Term Future Fund (LTFF) in September 2024 that funded our participation in the MATS extension program.

donated $6,000

Neel Nanda

6 months ago

I've supervised Javier & Oscar for a paper before, and they've done great work. I think this project seems promising - to my knowledge there's been little work on long-form hallucination detection, I expect there's low hanging fruit here, and the results seem decent so far. This is a fairly small grant, and will help unblock the project, so feels like a no-brainer to me

I think that detecting hallucinations in long-form content is a good downstream task to improve monitoring methods on (and find insights that translate to higher stakes tasks like control or deception), in addition to being near-term useful for making models safer. Improving factuality is likely to also help with making systems more profitable, which is slightly accelerationist, but I am generally pro work that legibly demonstrates how safety is incentivised by profit-seeking.

CoI Note: I am a (very hands off) supervisor for this project and will likely be a co-author on the paper resulting from this. But my quality bar for spending time on a project is notably higher than my bar for regrants, so I don't feel too concerned