A deterministic verifier for AI outputs (an alternative to LLM-as-judge)

Project summary

The default way we check AI output is to ask another model whether it looks right. That second model has the same blind spots as the first. I've built a working prototype of the alternative: a small trusted checker that verifies a claim by executing it, and returns a signed result anyone can re-run. Refuted means an executed counterexample. Confirmed comes tagged with exactly how much assurance it earned. Abstain is a real option, never a fake pass.

What are this project's goals? How will you achieve them?

On a labelled set of smart-contract vaults, ten independent LLM auditors flagged a safe vault as vulnerable ten times out of ten, and missed a real bug 40% of the time. My checker got the same set with zero false positives and zero false negatives, and produces a witness you can re-run yourself. Wherever an output has checkable structure (code, math, structured data, rules, contracts), executing the check beats grading it with a bigger model.

How will this funding be used?

Open-source the kernel, publish a benchmark across several domains, and work on the open problem that gates the whole idea: getting a usable specification cheaply for outputs that don't come with one. If it doesn't pan out, I'll publish that too.

Who is on your team? What's your track record on similar projects?

Single-author prototype, not shipped, audited adversarially twice (only cosmetic overclaims found, no soundness holes). The hard part is unsolved and I'm saying so up front. I pre-register what I expect and retract when it fails: I dropped this program's earlier framing after five pre-registered tests came back negative.

Independent researcher based in Italy, working on AI you can verify rather than just trust. 2026 papers on representation limits and substrate scaling (on Zenodo). I also build AionNexus (industrial predictive maintenance) and NormaAI (EU compliance tooling). github.com/Dan23RR, daniel.culotta@gmail.com

What are the most likely causes and outcomes if this project fails?

someone arrives first