Hallucination Detector

Project summary

Building on our ICLR 2025 oral paper "Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models," developed during MATS, we propose to advance the detection and mitigation of factual hallucinations in open-ended text generation. Our previous work identified internal representations in language models that encode knowledge awareness about entities (people’s names, movies, songs…), and demonstrated their causal relationship to hallucinatory behaviors. We now aim to develop robust, generalizable methods for detecting factual inaccuracies in open-ended generations.

How will this funding be used?

The funding will primarily be used to pay for compute (API credits & GPU time). A portion ($1.2K) will also reimburse previous research costs that we've personally covered to maintain research continuity.

Who is on your team? What's your track record on similar projects?

Oscar Balcells Obeso is a Master's student at ETH Zurich and MATS 2025 scholar advised by Neel Nanda. His previous research includes Refusal is Mediated by a Single Direction (NeurIPS 2024) and Do I Know this Entity? Knowledge Awareness and Hallucinations in Language Models (ICLR 2025 Oral).

Javier is a Research Engineer at the Barcelona Supercomputing Center and fellow MATS 2025 scholar advised by Neel Nanda. He's the author of Do I Know this Entity? Knowledge Awareness and Hallucinations in Language Models and has overall more than 10 first-author publications in leading NLP conferences such as ICLR, ACL and EMNLP (https://scholar.google.com/citations?user=ZNsw8ZUAAAAJ&hl=es). His industry experience includes research internships at Meta (FAIR), Apple's Machine Translation team, and Amazon's Books Science division. Javier holds a BSc in Electronic Engineering and an MSc in Data Science.

How much money have you raised in the last 12 months, and from where?

We received a 4-month grant from the Long-Term Future Fund (LTFF) in September 2024 that funded our participation in the MATS extension program.