You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Project summary
I'm building an open-source, real-time hallucination suppression system for locally-run LLMs. The system monitors token-level entropy of the model's output distribution during generation and dynamically adjusts sampling parameters (Temperature, Min-P) to suppress hallucinations before they reach downstream actions like tool calls or code edits.
The approach is grounded in established work: entropy-based uncertainty estimation detects confabulations in LLMs (Farquhar et al., 2024, Nature), and token-level entropy correlates with hallucination probability (Huang et al., 2025, ACM TOIS). Entropix (xjdr, 2024) demonstrated that entropy and varentropy are actionable signals during generation, using them to switch between discrete sampling strategies at each token. This project closes the loop: rather than rule-based strategy switching, it replaces open-loop decisions with continuous feedback control and a dynamic setpoint. Entropix also never published large-scale benchmark evals; this project is building that validation from the ground up.
The controller uses a 4th-order state-space formulation that tracks the entropy error signal, its integral, velocity, and acceleration. The acceleration term is the key contribution. It catches the characteristic upward curvature that precedes a hallucination spike, enabling intervention before it peaks.
What are this project's goals? How will you achieve them?
The current stage of the project is a single-model entropy controller operating on the logits of Qwen 3.5 2B at 4-bit quantization, with preliminary validation against MATH benchmark problems. Early results show a 3.5 point accuracy improvement over an uncontrolled baseline with untuned gains, indicating room for significant further improvement.
The GPU upgrade unlocks the next phase: a dual-model architecture where a smaller reference model (Qwen 3.5 0.8B) runs alongside the 9B on the same GPU, providing an adaptive entropy setpoint and enabling additional sensor channels. These include KL divergence between the two models' distributions, speculative decoding acceptance rate as a measure of model agreement, and KV cache health monitoring via SVD of the Value matrix. A fourth channel draws on quantum-inspired signal design: a rolling window of logit vectors is embedded as a density operator, and its von Neumann entropy is tracked over time, adapted from the density-operator framework introduced by Gong, Sedai, and Medda (arXiv:2511.21515, 2025). The resulting early-warning signal detects hallucination onset via structural shifts in the distribution of distributions, rather than a spike in instantaneous entropy, catching confident wrong answers that naive entropy misses. The computation is GPU-native, the density operator is constructed directly from logits already resident in VRAM, making it essentially zero-cost overhead on top of normal inference. Each channel feeds an independent controller whose outputs are combined via weighted sum at the actuator level, creating a multi-channel system that catches failure modes entropy alone cannot: confident wrong answers, slow context degradation, and repetition loops.
The experimental roadmap follows a decision rule: first implement a QNA-based von Neumann entropy signal, then move on to a dual model architecture to enable dynamic entropy targeting.
All code and results will be released as open-source.
How will this funding be used?
$1,000 toward a used NVIDIA RTX 3090 (24GB VRAM), including shipping and tax, as well as any unforeseen expenses (PSU, cables, etc).
I'm currently running experiments on Qwen 3.5 2B at 4-bit quantization on an RTX 3070 with 8GB VRAM. The 2B model fits comfortably and gives 157 tokens/sec, but it is not the model I want to validate against — the 9B is the target. The 9B has to be aggressively quantized to fit on 8GB, degrading output quality, and leaves no room for the reference model the next phase requires.
A 24GB GPU makes the dual-model architecture feasible, with both models on GPU at better quantization and headroom for KV cache, SVD computation, and telemetry. The 3090 also provides roughly double the memory bandwidth of the 3070 (936 GB/s vs 448 GB/s), which directly increases inference throughput. Faster tokens per second means more experiments per night, the bandwidth gain multiplies the pace of the research, not just its ceiling. Hardware only, no stipend. Everything is open-source.
Who is on your team? What's your track record on similar projects?
Solo independent researcher, full-time on this project. I hold a Master's in Mathematics from Montana State University, where I was a graduate teaching assistant for differential equations and numerical linear algebra, which are the direct mathematical foundations of this work (state-space models and SVD respectively).
I've been building and operating autonomous AI agents on fully local infrastructure for the past year. Currently: Qwen 3.5 2B on llama.cpp with CUDA, Ubuntu server, working agentic pipeline with tool use and autonomous code generation. The controller design emerged from observing this system's real failure modes and recognizing that classical control theory applies directly to steering LLM generation.
Preliminary validation results have been published openly on LessWrong. No prior publications. This would be my first formal research output.
What are the most likely causes and outcomes if this project fails?
Most likely failure: entropy control is a dead end. The signal might spike simultaneously with hallucination rather than before it, leaving no window for intervention. If preliminary validation shows this, I publish the negative result. Empirical data on whether token-level entropy acceleration precedes hallucination is valuable regardless of whether the controller works.
Second failure mode: the controller oscillates or over-corrects, suppressing hallucinations but also suppressing creative output. This is a tuning problem rather than a fundamental flaw, but it's possible no gain settings achieve a good safety/fluency tradeoff.
Third: the single-model controller works but the dual-model extension doesn't add meaningful value. The reference model's entropy might not provide a better setpoint than a fixed or rolling-average target. In that case, the simpler single-model system is still a useful contribution and would be released as is.
In all cases, validation data and analysis get published openly.
How much money have you raised in the last 12 months, and from where?
$0. This is my first funding application. The project has been entirely self-funded to date, including all hardware and infrastructure.
There are no bids on this project.