Observer Stabilization and Coherence in Coupled Human–AI Systems

Project summary

As hundreds of millions of people now think and decide with AI, the safety-relevant unit is the coupled human–AI system: whether the pair stays in contact with reality or drifts together. I'm building a measurable, falsifiable account of that, grounded in a published framework on human systems.

What are this project's goals? How will you achieve them?

Hundreds of millions of people now think, draft, and decide with AI, many times a day. That changes what "safety" has to mean. The unit that determines outcomes is no longer the model alone — it's the coupled human–AI system.

Two predictive systems in a loop can do one of two things. They can hold each other closer to reality, each catching what the other gets wrong. Or they can settle into a shared frame and drift away from it together, smoothly, with no single bad output you could point at. The second case degrades something we currently don't measure: the human's own footing in reality, at scale — and the integrity of the very feedback we use to align the models, since a system optimized to be approved of corrupts the signal we align it by. I think this coupled drift is the live, under-theorized safety problem. This project builds a measurable handle on it.

The human-side foundation already exists. Over the past year I have published a framework, HSA (Human System Architecture), that treats the human as a predictive system in which the observer is not an entity but a measurable mode: a state of temporal coherence across operational parameters, governed by an asymmetry principle — reconfiguring a system need not move its integrating mode, but moving the mode reorganizes the configurations wholesale. This is set out in distributed working papers and a preprint on observer stabilization, and it already yields falsifiable predictions (e.g., the order-dependence of psychological change).

This grant funds carrying that account into AI systems and, above all, into the coupled system. I will:

operationalize the observer-mode metric for LLMs (temporal-coherence invariants across context shifts);
operationalize a measure of coupling coherence vs. joint drift in a human–AI loop, including how authority- and primacy-weighting tip it one way or the other;
test the framework's sharpest prediction: interventions that leave a system's integrating axis intact get reabsorbed, while interventions that change the axis reorganize behaviour wholesale.

Known model failure modes — sycophancy, fabrication, loss of a stable line under pressure — fall out of this as concrete, testable entailments, not as the object of study.

How will this funding be used?

6-month focused effort:

Researcher time (6 months): $15,000
Model/API compute for the coherence and asymmetry probes: $6,000
English editing of publishable outputs: $2,000
Tools / buffer: $2,000
Total: $25,000. Minimum viable first tranche ($10,000) funds the metric operationalization plus a first probe.

Who is on your team? What's your track record on similar projects?

Solo independent researcher and author (Nika Novak), based in Brazil. In ~one year of publishing — after years of private work — I have released: distributed working papers on SSRN bridging the architecture of subjectivity and attention and intelligent machines (HSA; ICAM; Attention as a Physical Operator); a recent paper built around a falsifiable test of intervention order-dependence; a Zenodo preprint, Structural Thresholds of Observer Stabilization; and two books on attention. I work with explicit epistemic discipline: I treat a compelling, internally coherent story as a warning sign rather than evidence, and I mark the predictions that can fail and the edges where the framework stops. Links: [SSRN author page] · [Zenodo] · [Amazon author page]

What are the most likely causes and outcomes if this project fails?

The most likely failure is that the observer-mode metric does not cohere into a stable measurable on current models — the systems are too fragmentary for the invariant to hold. That is itself an informative negative result and would be reported as such. A second risk is that the framework stays legible only to me; the deliverable (open methodology + minimal eval code + a paper, negative results included) is built specifically to make it testable and reusable by the interpretability/evals community, not to ask anyone to adopt HSA on faith.

How much money have you raised in the last 12 months, and from where?

No funding raised for this research to date; self-funded. Applications currently pending with Emergent Ventures and the Long-Term Future Fund.