Research on AI-Powered Peer Review: Evaluating LLMs for Academic Feedback

Project summary

This project investigates whether large language models (LLMs) can generate high-quality reviews for academic papers on platforms like OpenReview and arXiv. We’ll compare closed-source models (e.g., Gemini, Claude) with open-source models (e.g., Qwen), fine-tune local models on existing reviews, and use interpretability tools to understand what drives effective feedback.

What are this project's goals and how will you achieve them?

Goal 1: Assess Review Quality
- Use OpenReview papers and their human-written reviews as a benchmark.
- Generate reviews with LLMs and compare them to human reviews using BLEU, ROUGE, and human evaluations for insightfulness.
Goal 2: Compare Model Types
- Evaluate reviews from closed-source (Gemini, Claude) and open-source (Qwen, LLama) models on identical papers.
- Identify trade-offs in performance, cost, and accessibility.
Goal 3: Enhance Reviews via Fine-Tuning
- Fine-tune an open-source model (e.g., Qwen) on OpenReview review data.
- Measure improvements in review quality post-fine-tuning.
Goal 4: Interpret Review Drivers
- Apply sparse autoencoders to pinpoint paper elements (e.g., abstract, methods) most influencing LLM-generated reviews.

How will this funding be used?

Computational Resources: Rent GPU instances for training and inference ($3,000-$5,000)
API calls to LLM providers ($1,000-$2,000)
Dissemination: Conference fees or publication costs ($500-$1,000)
Cost of Living in MCOL area: $3000-$6000 per month

Who is on your team and what's your track record on similar projects?

Dmitrii Magas, ML SWE @ Lyft, https://eamag.me/

What are the most likely causes and outcomes if this project fails? (premortem)

LLMs fail to match human review depth.
Fine-tuning yields minimal gains.
Interpretability tools lack actionable insights.

Research on AI-Powered Peer Review: Evaluating LLMs for Academic Feedback

Project summary

What are this project's goals and how will you achieve them?

How will this funding be used?

Who is on your team and what's your track record on similar projects?

What are the most likely causes and outcomes if this project fails? (premortem)

Implicit planning in LLMs Paper

Bayesian modelling of LLM capabilities from evals

Survey for LLM Self-Knowledge and Coordination Practices

Survey for LLM Self-Knowledge and Coordination Practices

Attention-Guided-RL for Human-Like LMs

Survey for LLM Self-Knowledge and Coordination Practices

The Metascience Observatory

AI-Plans.com