You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
An interpretable multi-LLM judge system that decomposes evaluation criteria (e.g., truthfulness, instruction-following) and enables auditing of alignment pipelines. Includes a scalable persona-based method to simulate diverse human preferences and detect bias or drift in evaluation systems.
Build interpretable evaluation systems for LLM-as-a-judge pipelines
Detect bias, inconsistency, and preference drift
Improve alignment with human judgments
Enable monitoring of evaluation systems in real-world deployment
Extend multi-judge aggregation:
More evaluation dimensions
Stronger modeling backbone
Benchmark against human-annotated datasets
Develop methods for:
Detecting latent preference drift
Identifying judge bias patterns
Explore integration into:
Prompt optimization loops
Production monitoring systems
This is a travel grant to present the work at TAIS (Oxford).
Breakdown (~$1,800 total):
~70%: Flights (India → UK)
~10–12%: Accommodation (2 nights, budget stay)
~5%: Airport transfer (Heathrow–Oxford)
~10%: Visa fees
~10%: Contingency buffer
This represents the minimum required funding to attend and present.
This is an individual funding request.
The research was developed during the Apart Fellowship with collaborators:
Eitan Sprejer
Augusto Mariano Bernardi
Fernando Avalos
Jacob Haimes
Michael Lan
Narmeen Fatimah Oozeer
Apart Fellowship researcher (AI safety focus)
3rd place — Apart × Martian hackathon (https://apartresearch.com/project/judge-using-sae-features-y1r1)
Work submitted to NeurIPS MechInterp Workshop (positive feedback)
Published technical writing on interpretability
Current project accepted at TAIS
Focus areas:
Interpretability and Alignment
Evaluation
Limited empirical validation vs human labels
Persona-based simulation may not fully capture real preferences
Interpretability may not scale cleanly to complex evaluation setups
System provides limited improvement over baseline judges
Insights remain theoretical rather than practically deployable
Low adoption by alignment researchers
Benchmarking against human annotations
Iterative refinement with researcher feedback (via TAIS)
Focus on practical integration pathways
This work was conducted during the Apart Fellowship, where I received a stipend as part of the program. This stipend supported my time but was not project-specific funding and did not cover travel or research expenses.