Detecting and mitigating AI fabrication from its output geometry

Project summary

A model doesn't drift into a fabrication. It commits to one in a single token, then writes the rest of the answer as if that token were true. After that it's just faithfully continuing from its own invention, calm and fluent, and it looks completely normal. That's why checking the finished answer barely works.

But the commitment leaves a mark, right there in the model's output. When a model knows the answer, its runner-up guesses cluster tight, a few close paraphrases of the truth. When it's fabricating, that mass scatters, because any invented detail would do. The tell isn't in the word the model picks. It's in the disorganization of the alternatives behind it, and that disorganization has a shape you can read off the output alone, in one forward pass, without ever opening up the model. It's also why reading only the word the model chose misses it: the signal is in everything it almost chose instead.

I read that shape live, while the answer is still being written, early enough that a system can stop and retrieve, or hand the answer to a check, before the fabricated claim reaches anyone.

Confident fabrications are the dangerous ones, the ones a model states with full confidence and no hedge, and confidence alone can't catch them. My full read separates those hallucinations from the truth almost perfectly, about 0.95 AUC (0.5 is a coin flip, 1.0 is perfect), where the standard methods sit near the coin flip, 0.50 to 0.60. That gap held up under cross-validation and a permutation test.

This grant is to build that intervention for real, measure how much fabrication it actually stops before it reaches a user, and publish what I find.

What are this project's goals? How will you achieve them?

The first is the one that matters most, and the one I haven't built yet: turn the signal into a working intervention. Right now I can catch the commitment token. What I want to build is the loop around it: fire at that token, pause, retrieve or verify, then continue or abstain, and measure how much it cuts the fabrications that reach a user. I have the detector. I don't have the closed loop or that end-to-end number yet. That's what this grant pays for.

Second, harden the evidence. Two different things are already solid here, and they're worth separating. The measurement itself works across 18 model architectures: the geometry comes out the same from one model to the next, and it runs on nothing more than the top-20 token probabilities a closed API will hand you. Separately, on the hardest case, a model as confident inventing as recalling, it tells fabrication from truth at about 0.95, where confidence and the standard scores sit near chance. The honest gap is that the 0.95 is on one model so far, and the signal's strength varies from one architecture to the next. So this grant extends that confident-fabrication result across more models and generators at full precision, and maps where the signal is strong and where it goes faint. That's compute and careful runs, not new invention.

Third, write it up and get it in front of people who can use it and try to break it. A clean public writeup of the method and the honest results, and a public demo people can try. I'd rather the field check this than take my word for it.

How will this funding be used?

Mostly my time, three or four months of it, to build the intervention loop and run the experiments.

The rest is hardware. Everything right now runs on one 12 GB GPU, and the larger models I need to test don't fit, so they spill into system RAM and crawl. To run those models at full precision, and to keep the whole pipeline on my own machine instead of someone's cloud, I need a workstation with more VRAM, more RAM, and storage for the model checkpoints and the per-token capture data. It's a one-time cost in the low thousands, and across a few months of heavy runs it comes out cheaper than renting equivalent GPUs by the hour. A few models are too big even for that, so a small amount stays set aside for cloud time on those.

Who is on your team? What's your track record on similar projects?

Just me. I did non-destructive testing and now IT operations for a living, and the ML is self-taught, built on a single RTX 3060 with the bigger models spilling into system RAM. The track record is the work itself: a fabrication signal tested across 18 models. In a head-to-head on labeled probes it beat semantic entropy (Farquhar, 2024) by about 0.22 to 0.25 AUC at a tenth of the compute, and it didn't fail on the confident fabrications that method misses, where semantic entropy actually drops below chance. On the confident-fabrication test the full read separates fabrication from truth at about 0.95, where confidence and sampling-based methods sit near chance. There's a working public demo, and a patent pending on the trigger. I named the core measurement the Doodle Residual, after my twin brother; he's the reason I do any of this.

What are the most likely causes and outcomes if this project fails?

The detection isn't the risk. On confident fabrication the full read already catches around 0.95, where every standard method sits near chance, and it held up under a permutation test. The honest risks are two. One: the case where it's weak like everything else is a single invented fact buried in an otherwise-correct, grounded answer, the sparse case, where no cheap one-pass method does well yet, mine included at about 0.60. Two, and this is the real one for the grant: I have the trigger, but I haven't built and measured the closed loop, so I can't yet tell you how much automatically acting on it cuts what reaches a user once you count the false alarms. If that end-to-end number comes out modest, it's still a useful, honest result, and I'd publish the ceiling. What won't happen is the detection failing. On the case that matters, it already works. This money pays to find out how much of the fabrication you can actually stop.

How much money have you raised in the last 12 months, and from where?

None, Self-funded so far, my own time and computer.