Meridian Eval: a public benchmark for tool-routing failures in LLM agents

Project Summary

LLM agents increasingly choose which tool to call from a large registry, and which-tool-did-it-pick is a precondition for safe deployment — incident analysis, capability auditing, and red-teaming all need a routing signal humans can actually inspect. Today that signal is usually a cosine similarity score inside a 1,536-dimensional embedding space.

I’ve built and shipped Meridian, an open-source MCP that replaces opaque embedding routing with a deterministic orbital classifier: every candidate receives a physics signature, a celestial class, and a one-line decision rule explaining why it ranked where it did.

Live now:

MCP endpoint: mcp.ask-meridian.uk
GitHub: Meridian MCP Repository

Meridian v2.1.0 already ships with:

OAuth 2.1 + PKCE
stdio + Streamable-HTTP transports
deterministic orbital routing
47 unit tests
npm package + GHCR image
Cloudflare Worker deployment
GitHub Actions CD pipeline

This $5,000 grant funds a public benchmark for tool-routing failures:

a labelled routing dataset
a two-judge evaluation matrix
an open-source eval harness
and a reproducible write-up

The goal is to make tool-routing quality measurable in the same way perplexity standardized language-model evaluation.

Goals & Execution Plan

Goal

Make tool-routing failure rates a measurable, comparable, and citable metric.

Deliverables (4 Months)

1. Labelled task→skill dataset (~500 pairs)
Tasks spanning coding, research, operations, and creative domains. Each task includes:

one correct skill
four distractors
paid human labelling via Prolific or Surge AI

Published publicly on HuggingFace.

2. Two-judge evaluation matrix
Routing judged by:

Anthropic Sonnet 4.6
xAI Grok-4

Plus:

self-hosted BGE-large-en embedding baseline running on Modal

This creates a reproducible comparison between:

frontier-model routing
embedding-based routing
deterministic routing

3. Open-source evaluation harness (CLI)
One-command evaluation against:

Meridian
LangChain routers
LlamaIndex routers
vanilla embedding systems
MCP-compatible routing systems

4. Public write-up
A LessWrong / Alignment Forum post with:

reproducible code
benchmark methodology
routing analysis
published dataset

Success Criteria

Dataset downloadable on HuggingFace by month 3
Harness runnable externally with one command by month 4
At least one external framework adopts or cites the eval within 6 months
Cross-classifier results produce publishable insight regardless of outcome:
- If deterministic routing wins → interpretable routing is viable
- If it loses → interpretable routing carries measurable capability cost

Funding Breakdown

Anthropic + xAI API credits (judges + baselines) — $1,500
Human labelling (~500 pairs) — $1,400
Starlink hardware + 4 months service — $950
Cloudflare Workers Paid + GitHub Models inference — $400
GPU compute (Modal / Lambda) — $300
Buffer (~10%) — $450

Total: $5,000

No salary or stipend is included. Development work is performed independently alongside contracting income.

Minimum Viable Funding: $2,500

Reduced scope:

single judge (Sonnet 4.6 only)
~250 labelled pairs
no Starlink reliability layer

Still useful, but less reproducible and less citable.

Team & Track Record

Independent solo engineer.

Recent shipped work

Meridian MCP
GitHub Repository
Lens — WebXR Vision Lab pairing SmolVLM + Meridian routing
Lens Repository
lens.ask-meridian.uk
Photon — photonic retrieval router using the Meridian backend
Photon Repository
Writing & architecture notes
ask-meridian.uk/blog

Published work includes:

classifier walkthroughs
deterministic routing analysis
OAuth operator-pays architecture
Cloudflare Workers vs GitHub Pages deployment trade-offs

Risks & Failure Modes

1. Deterministic routing loses the benchmark

Embedding or LLM-based routing may outperform Meridian.

Outcome:

still produces a useful public benchmark
still yields a publishable result
clarifies whether interpretability costs capability

2. Dataset quality issues

Routing datasets are difficult to label reliably.

Mitigation:

second-pass review
manual validation sampling
reduced dataset size if noise exceeds threshold

3. No external adopters

Possible outcome:

benchmark remains useful as a public reference
still functions as an internal regression metric for Meridian

4. Funding shortfall

Lower funding reduces:

dataset size
judge diversity
reproducibility

The project still ships regardless of funding outcome.

Prior Funding

$0 external funding raised.

All work to date has been self-funded through contracting income, with approximately:

£1,500 (~$1,900 USD)
spent across 2025–2026 on:
Cloudflare Workers Paid
GitHub Models inference overage
domains/DNS
monitoring
infrastructure