Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
2

Research on AI-Powered Peer Review: Evaluating LLMs for Academic Feedback

Science & technology
🐝

Dmitrii

Not fundedGrant
$0raised

Project summary

This project investigates whether large language models (LLMs) can generate high-quality reviews for academic papers on platforms like OpenReview and arXiv. We’ll compare closed-source models (e.g., Gemini, Claude) with open-source models (e.g., Qwen), fine-tune local models on existing reviews, and use interpretability tools to understand what drives effective feedback.

What are this project's goals and how will you achieve them?

  • Goal 1: Assess Review Quality

    • Use OpenReview papers and their human-written reviews as a benchmark.

    • Generate reviews with LLMs and compare them to human reviews using BLEU, ROUGE, and human evaluations for insightfulness.

  • Goal 2: Compare Model Types

    • Evaluate reviews from closed-source (Gemini, Claude) and open-source (Qwen, LLama) models on identical papers.

    • Identify trade-offs in performance, cost, and accessibility.

  • Goal 3: Enhance Reviews via Fine-Tuning

    • Fine-tune an open-source model (e.g., Qwen) on OpenReview review data.

    • Measure improvements in review quality post-fine-tuning.

  • Goal 4: Interpret Review Drivers

    • Apply sparse autoencoders to pinpoint paper elements (e.g., abstract, methods) most influencing LLM-generated reviews.

How will this funding be used?

  • Computational Resources: Rent GPU instances for training and inference ($3,000-$5,000)

  • API calls to LLM providers ($1,000-$2,000)

  • Dissemination: Conference fees or publication costs ($500-$1,000)

  • Cost of Living in MCOL area: $3000-$6000 per month

Who is on your team and what's your track record on similar projects?

Dmitrii Magas, ML SWE @ Lyft, https://eamag.me/

Similar projects: https://openreview-copilot.eamag.me/, https://eamag.me/2024/Automated-Paper-Classification

What are the most likely causes and outcomes if this project fails? (premortem)

  • LLMs fail to match human review depth.

  • Fine-tuning yields minimal gains.

  • Interpretability tools lack actionable insights.

CommentsSimilar8
Jimmm avatar

Jim Maar

Implicit planning in LLMs Paper

Reproducing the Claude poetry planning results quantitatively

Technical AI safety
1
1
$1K raised
laurence_ai avatar

Laurence Aitchison

Bayesian modelling of LLM capabilities from evals

5
4
$32K raised
francisco avatar

Francisco Carvalho

Survey for LLM Self-Knowledge and Coordination Practices

Science & technologyAI governance
3
0
$0 raised
francisco avatar

Francisco Carvalho

Survey for LLM Self-Knowledge and Coordination Practices

Science & technologyAI governance
3
0
$0 raised
🐯

Scott Viteri

Attention-Guided-RL for Human-Like LMs

Compute Funding

Technical AI safety
4
2
$3.1K raised
francisco avatar

Francisco Carvalho

Survey for LLM Self-Knowledge and Coordination Practices

1
4
$10K raised
delton avatar

Dan Elton

The Metascience Observatory

AI pipelines to conduct metascience research at scale, mapping how reproducibility and rigor vary across fields, countries, institutions, and journals.

ACX Grants 2025
1
1
$25K raised
KabirKumar avatar

Kabir Kumar

AI-Plans.com

Science & technologyTechnical AI safetyAI governance
5
4
$5.37K raised