Research on AI-Powered Peer Review: Evaluating LLMs for Academic Feedback

Project summary

This project investigates whether large language models (LLMs) can generate high-quality reviews for academic papers on platforms like OpenReview and arXiv. We’ll compare closed-source models (e.g., Gemini, Claude) with open-source models (e.g., Qwen), fine-tune local models on existing reviews, and use interpretability tools to understand what drives effective feedback.

What are this project's goals and how will you achieve them?

Goal 1: Assess Review Quality
- Use OpenReview papers and their human-written reviews as a benchmark.
- Generate reviews with LLMs and compare them to human reviews using BLEU, ROUGE, and human evaluations for insightfulness.
Goal 2: Compare Model Types
- Evaluate reviews from closed-source (Gemini, Claude) and open-source (Qwen, LLama) models on identical papers.
- Identify trade-offs in performance, cost, and accessibility.
Goal 3: Enhance Reviews via Fine-Tuning
- Fine-tune an open-source model (e.g., Qwen) on OpenReview review data.
- Measure improvements in review quality post-fine-tuning.
Goal 4: Interpret Review Drivers
- Apply sparse autoencoders to pinpoint paper elements (e.g., abstract, methods) most influencing LLM-generated reviews.

How will this funding be used?

Computational Resources: Rent GPU instances for training and inference ($3,000-$5,000)
API calls to LLM providers ($1,000-$2,000)
Dissemination: Conference fees or publication costs ($500-$1,000)
Cost of Living in MCOL area: $3000-$6000 per month

Who is on your team and what's your track record on similar projects?

Dmitrii Magas, ML SWE @ Lyft, https://eamag.me/

What are the most likely causes and outcomes if this project fails? (premortem)

LLMs fail to match human review depth.
Fine-tuning yields minimal gains.
Interpretability tools lack actionable insights.