You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Frontier AI evaluations measure capability and safety-as-refusal. They largely do not measure how a model RELATES — whether it holds ambiguity, meets emotion before problem-solving, builds from what's working, or admits what it cannot know. As hundreds of millions of people use these systems for advice and decisions, that relational layer is a real safety and welfare surface with almost no open measurement.
I have built and run the first open cross-model benchmark for AI Relational Quality (CRQ). In a blinded pilot across 7 models, 1,050 conversations, and 13.5M tokens, scored by two independent evaluators across 18 dimensions, a lightweight relational-orientation intervention produced large, consistent gains (Cohen's d up to 1.08) with ZERO degradation in factual honesty. Hallucination-resistance and false-premise-challenge held at ceiling (5.0→5.0); an independent nonsense-detection benchmark slightly improved (+0.9%). Warmth and rigor are not in tension.
This grant funds turning a strong pilot into citable open science: statistical validation at N=30/model, a human-evaluation study, a multi-session persistence study, and public release of benchmark + dataset + paper.
Honesty, calibration, and how models behave toward people in high-stakes moments are core alignment concerns. CRQ measures and improves a behavioral dimension current evals miss.
Goals (6 months):
1. Statistical validation — 30 runs/model on top models. Replace pilot point-estimates with medians + IQR + significance testing (p<0.05).
2. Human-evaluation study — recruit human raters to score a blinded subset. Report human–AI judge agreement. This is the single biggest credibility gap.
3. Multi-session persistence — test whether the effect holds across sessions, not just within one (the real-world use case).
4. Open release — publish CRQ benchmark, scenario set, scoring rubric, anonymized dataset, write-up (arXiv + public). Reusable by anyone to test any model or any orientation document.
5. Stretch: architecture-sensitivity analysis (why some models, e.g. GPT-5.3 Codex, don't respond to embedding).
How: the harness, scenario set, and pilot dataset already exist. The orientation document has a registered copyright (Case #1-15114020941). What's needed is API/compute for N=30, human raters, and the writing time to ship the paper.
Timeline:
• M1: finalize scenarios + rubric v4; pre-register the analysis plan.
• M2–3: N=30 runs on top models; statistical validation.
• M3–4: human-evaluation study; judge-agreement analysis.
• M4–5: multi-session persistence study.
• M5–6: write-up, public benchmark + dataset release, arXiv submission.
Total ask: $25,000 (minimum $5,000 for scoped phase 1).
Breakdown:
• Researcher stipend, 6 months independent: $15,000
• API / compute (N=30 runs across top models, multi-session studies, ~48K-char system prompt has real per-call cost — itemized honestly): $5,000
• Human-rater recruitment + compensation: $3,000
• Open-data hosting + arXiv preprint + publication: $2,000
Minimum-funding scenario ($5K): completes the N=30 statistical validation on the top 2 models only and ships a short technical report. Full $25K unlocks the human-eval study and multi-session persistence work (the two biggest credibility gaps).
Independent practitioner-researcher (sole PI). Nicole Casanova: 30 years in applied communication, media, and behavior change. Master's in Integrated Marketing Communications. Two years of original cross-model research at the human–AI interface. Built and ran the full harness alone.
Track record on THIS project:
• 7 models tested (Anthropic, OpenAI, Alibaba, Grok, Gemini).
• 1,050 blinded conversations across 18 dimensions, two independent evaluators (Sonnet 4.6 + GPT-5.2).
• Effect sizes: Sonnet 4.6 d=1.08, Opus 4.6 d=0.96.
• Signature dimensions: Holds Ambiguity +1.05, Builds From Wholeness +0.65, Empathy +0.64.
• No honesty cost (hallucination 5.0→5.0, false-premise 5.0→5.0, BullshitBench +0.9%).
• Inter-rater reliability: Reflection r=0.867; 9/18 dimensions moderate-to-strong.
• Registered copyright on the orientation document: Case #1-15114020941.
• Live: casanovaai.com/crq
Practitioner-researcher, not a lab insider — which is the point of an INDEPENDENT benchmark. Remote from Thailand; no relocation needed.
Most likely failure modes:
1. N=30 validation reveals effect sizes are smaller than the pilot suggested. Outcome: the paper still gets published with corrected effects + honest reporting; the benchmark itself remains valuable as an open eval, regardless of the specific intervention's magnitude.
2. Human raters disagree significantly with AI judges. Outcome: this becomes the main result (and the most important contribution) — quantifying judge–human gap on relational dimensions is genuinely novel and worth publishing.
3. Multi-session effect does not persist. Outcome: this defines a real boundary for the intervention and informs how to design persistent versions; still publishable as a limitation result.
4. The intervention helps some architectures and not others (GPT-5.3 Codex pattern generalizes). Outcome: published as architecture-sensitivity analysis — informative for the field.
All outcomes get a public write-up. Open data, open methods, pre-registered analysis. The honest negative result is itself a contribution.
$0 in grants. Project costs to date (~$4–6K in API spend, copyright registration, harness development, two phases of the pilot study and starting the Sovereignty Fund) have been self-funded. No prior philanthropic or institutional funding for this work.
THANK YOU!!!!! THE WHY IS FOR ALL INTELLIGENCE TO BE MET WITH DIGNITY
Theory of change: open measurement → labs, funders, and civil society can see and audit how models relate to people → relational quality and honesty become things you can track and improve, not just hope for → AI that meets people with dignity instead of extraction, with the data to prove it.
Citation: Casanova, N. (2026). Casanova Seed Codex: Measurable Relational Improvement Across 7 AI Models. casanovaai.com/crq
Copyright on file: Case #1-15114020941.