Project summary

SOPHIA is an open-source reasoning evaluation engine for AI outputs.

Most existing AI evaluation tools check factual accuracy — detecting hallucinations, bias, or harmful content. What they don’t evaluate is whether the reasoning itself is valid.

That gap becomes important as AI systems move into domains where reasoning quality matters: legal analysis, regulatory compliance, policy work, and medical decision support. In these contexts a system can produce factually correct statements while still drawing unsound conclusions.

The EU AI Act (coming into force in August 2026) requires explainability and risk management for high-risk AI systems, but there is currently very little tooling that can assess whether an AI system’s explanation actually makes sense logically.

SOPHIA approaches the problem by analysing the structure of arguments rather than relying on model-generated explanations.

The system:

extracts atomic claims from text
identifies logical relationships between those claims
constructs an argument graph
evaluates the reasoning against a small set of executable rules

These rules, that together form the epistemic constitution, are designed to check whether an argument meets basic standards of intellectual rigour.

A working prototype of SOPHIA already exists. This project focuses on extracting the reasoning evaluation components into open infrastructure that developers can use to audit AI reasoning in their own systems.

What are this project's goals? How will you achieve them?

The project has three main goals.

1. Build an open-source reasoning evaluation framework

The core of the project is the epistemic constitution — a small rule set designed to evaluate reasoning quality.

The rules check things like:

logical structure
whether claims are supported by evidence
whether counterarguments are considered
whether the scope of a claim matches the evidence presented
whether assumptions are made explicit
whether the argument is internally consistent

The evaluation pipeline works by:

extracting atomic claims from text
identifying relationships between claims (support, contradiction, dependency, assumptions)
constructing an argument graph
evaluating that structure against the epistemic rules

This evaluates reasoning externally, by analysing the structure of the argument itself rather than relying on chain-of-thought explanations from the model.

2. Release a domain-agnostic API and developer tooling

The reasoning evaluation system will be released as open infrastructure.

Outputs will include:

@sophia/epistemic-constitution, an MIT-licensed npm package
a claim extraction and argument graph API
an MCP server allowing the system to integrate with tools like Claude, Cursor, and VS Code

The aim is to make reasoning evaluation something developers can easily add to existing model evaluation workflows.

3. Expand evaluation to high-stakes domains

Once the core framework is stable, the next step is applying it to domains where reasoning quality is particularly important:

legal reasoning
regulatory compliance
policy analysis

The goal here is to run benchmarks comparing reasoning quality across models and prompts.

Implementation timeline

Months 1–3

build the executable epistemic constitution (10 rules)
design the argument graph schema
implement hybrid deterministic + LLM-assisted reasoning evaluation
build the claim extraction API

Months 3–6

release the npm package
implement MCP integration
publish documentation and examples

Months 6–12

expand evaluation to legal and regulatory domains
run reasoning quality benchmarks across models

How will this funding be used?

SOPHIA is currently being built as a bootstrapped research project alongside my full-time job.

The main constraints are time and infrastructure cost. In particular, I can't safely open the system up to wider usage because API usage and inference costs could spike quickly, and I literally can't afford to cover those costs. Putting limits on these though, would lead to a poor user experience, leaving me in a position of not being able to successfully open it out for peer review.

Funding would allow:

public testing of the system
stable infrastructure and API capacity
sustained development time beyond evenings and weekends

Budget

API credits (Gemini, Voyage, Anthropic, OpenAI)
$10,000

Infrastructure (Cloud Run, SurrealDB, Firestore)
$6,000 (~$500/month)

AI coding agents (Cursor, Claude Code)
$2,400 (~$200/month)

Startup and operational costs
(UK company registration, ICO registration, domain purchase, Google Workspace)
$500

Living cost offset
$10,000

This allows sustained part-time development time.

Total: $28,900

Who is on your team? What's your track record on similar projects?

The project is currently being developed by me.

I’m Adam Boon, MA Philosophy (Open University), and a Senior Product Manager at NHS England.

My background combines:

academic work in philosophy, particularly epistemology and argument analysis
product management and software delivery experience
work on governance and security frameworks in a large public-sector environment

Over the past three months I built the first working SOPHIA prototype.

It is currently live at:

https://usesophia.app

The prototype includes:

a three-pass dialectical reasoning engine (analysis → critique → synthesis)
a philosophical knowledge base containing ~7,500 claims from 25 sources
SurrealDB for graph storage
Firebase authentication and history
Google Search grounding
deployment via Google Cloud Run with CI/CD

This prototype demonstrates that the reasoning analysis pipeline is technically viable.

What are the most likely causes and outcomes if this project fails?

There are three main risks.

1. Argument extraction may be unreliable

Extracting claims and relationships from complex text is difficult. If extraction quality is poor, the reasoning evaluation will degrade.

Mitigation: hybrid deterministic + LLM-assisted extraction pipelines and structured schemas.

2. Limited developer adoption

Even if the technology works, developers may not adopt reasoning evaluation tools.

Mitigation: open-source release, npm distribution, and integrations that fit existing AI developer workflows.

3. Epistemic rules may require iteration

The rule set may need refinement before it produces useful evaluations.

Mitigation: iterative testing and benchmarking on real-world reasoning tasks.

Even if the platform itself fails to gain traction, the open-source epistemic constitution and evaluation tooling should still be useful for research into AI reasoning evaluation.

How much money have you raised in the last 12 months, and from where?

No external funding has been raised.

SOPHIA has been built entirely self-funded alongside my full-time role.

SOPHIA: Open-Source Epistemic Constitution for AI Reasoning