Interpretable Multi-LLM Judge System for Alignment Evaluation

Project summary

An interpretable multi-LLM judge system that decomposes evaluation criteria (e.g., truthfulness, instruction-following) and enables auditing of alignment pipelines. Includes a scalable persona-based method to simulate diverse human preferences and detect bias or drift in evaluation systems.

What are this project's goals? How will you achieve them?

Goals

Build interpretable evaluation systems for LLM-as-a-judge pipelines
Detect bias, inconsistency, and preference drift
Improve alignment with human judgments
Enable monitoring of evaluation systems in real-world deployment

Approach

Extend multi-judge aggregation:
- More evaluation dimensions
- Stronger modeling backbone
Benchmark against human-annotated datasets
Develop methods for:
- Detecting latent preference drift
- Identifying judge bias patterns
Explore integration into:
- Prompt optimization loops
- Production monitoring systems

How will this funding be used?

This is a travel grant to present the work at TAIS (Oxford).

Breakdown (~$1,800 total):

~70%: Flights (India → UK)
~10–12%: Accommodation (2 nights, budget stay)
~5%: Airport transfer (Heathrow–Oxford)
~10%: Visa fees
~10%: Contingency buffer

This represents the minimum required funding to attend and present.

Who is on your team? What's your track record on similar projects?

This is an individual funding request.

The research was developed during the Apart Fellowship with collaborators:

Eitan Sprejer
Augusto Mariano Bernardi
Fernando Avalos
Jacob Haimes
Michael Lan
Narmeen Fatimah Oozeer

Track record

Apart Fellowship researcher (AI safety focus)
3rd place — Apart × Martian hackathon (https://apartresearch.com/project/judge-using-sae-features-y1r1)
Work submitted to NeurIPS MechInterp Workshop (positive feedback)
Published technical writing on interpretability
Current project accepted at TAIS

Focus areas:

Interpretability and Alignment
Evaluation

What are the most likely causes and outcomes if this project fails?

Risks

Limited empirical validation vs human labels
Persona-based simulation may not fully capture real preferences
Interpretability may not scale cleanly to complex evaluation setups

Failure outcomes

System provides limited improvement over baseline judges
Insights remain theoretical rather than practically deployable
Low adoption by alignment researchers

Mitigation

Benchmarking against human annotations
Iterative refinement with researcher feedback (via TAIS)
Focus on practical integration pathways

How much money have you raised in the last 12 months, and from where?

This work was conducted during the Apart Fellowship, where I received a stipend as part of the program. This stipend supported my time but was not project-specific funding and did not cover travel or research expenses.