Interpretability by construction: no disentangled reward-hacking surface

Project summary

BMC is a deterministic, fully-inspectable simulation of memetic-affective agents with no engineered scalar reward function — the agent's objective is an inspectable homeostatic set-point (restore internal affective balance), where the gradient is zero at the set-point and reverses past it. Because no engineered channel exists whose maximization yields an ever-better outcome, the canonical wireheading target is structurally absent rather than penalized post-hoc (an emergent gauge-level route is disclosed as a caveat below, not a contradiction of this). This project delivers a technical note formalizing why, plus two pre-registered stress tests of the two open failure modes I already disclose as honest limits.

What are this project's goals? How will you achieve them?

Goals: (1) write a self-contained technical note formalizing the three-ingredient reward-hacking analysis — an external proxy maximand, unbounded optimization pressure, and a proxy detachable from its referent — and show which of the three this architecture structurally lacks; (2) run two pre-registered stress tests targeting the two open failure modes disclosed as honest limits below: gauge-level interoceptive wireheading, and Cultural-Memory parasite-meme contamination; (3) scope (not yet execute) the next step that would convert this from an architectural claim into a behavioral one — a decoupled signal channel with closed discriminative feedback, letting a capable agent actually attempt to game a proxy and fail or succeed.

How I'll achieve them: the engine is deterministic and every internal variable is inspectable, so this is a mechanistic exercise, not a training run — I read the architecture off directly, then pre-register and run the two stress tests the same way I've run prior experiments in this program (pre-registration before compute, independent replication before banking a result).

How will this funding be used?

The $20,000 buys 6 months of my time to: write the technical note formalizing the three-ingredient analysis with the symbol-grounding assay as its falsifiable core; run and analyze the two pre-registered stress tests; and publish all of it, including null results reported as such if the stress tests find the architecture's disclosed failure modes ARE exploitable. No engine internals are disclosed — only results, methods, and published papers. No compute cluster is needed; this runs on my existing local machine.

Who is on your team? What's your track record on similar projects?

Solo — no team, no organization. 5 published papers with DOIs on Zenodo (bmc-theory.org/publications): the theoretical framework; a working-memory-capacity derivation; a communication-pressure/memetic-replication study; an emergent-language paper (compositional signaling arising from an empty starting state, no communication-objective training); and an integration paper extending the same empty-start setup to show cultural convergence and functional meta-cognition, again without reward signals or pretraining. Multiple headline findings independently blind-replicated before I bank or publish a result. My published papers already report negative findings directly where that's what was found (communication is survival-neutral; reception without expression is indistinguishable from isolation, p=0.64), and I maintain dated internal pre-registrations, several of which have resolved as banked nulls. Engine: a deterministic, from-scratch Rust simulation (300+ tests, 103 automated gate checks), built and maintained solo since 2026 without institutional support.

What are the most likely causes and outcomes if this project fails?

Most likely failure mode: time — this is solo, unpaid-to-date work alongside other obligations, so the technical note or one of the two stress tests could slip past 6 months or not reach publication quality. If that happens, I'd publish whatever is complete (the note, or one stress test) rather than withhold a partial result. A separate, non-failure outcome worth naming explicitly: the stress tests could find that the disclosed failure modes (gauge-level interoceptive wireheading, Cultural-Memory parasite-meme contamination) ARE exploitable under pressure — that would be a genuine negative finding about this architecture's safety properties, not a failure of the project, and I would report it as such, the same way I've reported prior null results in this program.

How much money have you raised in the last 12 months, and from where?

$0 raised in the last 12 months. I have a parallel application pending with the Long-Term Future Fund (submitted 2026-07-01, ask $44,000) for the same underlying architecture and results, but scoped and framed differently — a broader 9-month research stipend covering my full living costs while I do this work, versus this $20,000 ask which funds narrowly the 6-month technical note + stress tests themselves. These are alternative asks, not stacked: if both were funded, the budgets would be reconciled to the actual work, not paid twice for the same deliverables. No other funding applications in the last 12 months.