You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
This project investigates whether large language models (LLMs) can generate high-quality reviews for academic papers on platforms like OpenReview and arXiv. We’ll compare closed-source models (e.g., Gemini, Claude) with open-source models (e.g., Qwen), fine-tune local models on existing reviews, and use interpretability tools to understand what drives effective feedback.
Goal 1: Assess Review Quality
Use OpenReview papers and their human-written reviews as a benchmark.
Generate reviews with LLMs and compare them to human reviews using BLEU, ROUGE, and human evaluations for insightfulness.
Goal 2: Compare Model Types
Evaluate reviews from closed-source (Gemini, Claude) and open-source (Qwen, LLama) models on identical papers.
Identify trade-offs in performance, cost, and accessibility.
Goal 3: Enhance Reviews via Fine-Tuning
Fine-tune an open-source model (e.g., Qwen) on OpenReview review data.
Measure improvements in review quality post-fine-tuning.
Goal 4: Interpret Review Drivers
Apply sparse autoencoders to pinpoint paper elements (e.g., abstract, methods) most influencing LLM-generated reviews.
Computational Resources: Rent GPU instances for training and inference ($3,000-$5,000)
API calls to LLM providers ($1,000-$2,000)
Dissemination: Conference fees or publication costs ($500-$1,000)
Cost of Living in MCOL area: $3000-$6000 per month
Dmitrii Magas, ML SWE @ Lyft, https://eamag.me/
Similar projects: https://openreview-copilot.eamag.me/, https://eamag.me/2024/Automated-Paper-Classification
LLMs fail to match human review depth.
Fine-tuning yields minimal gains.
Interpretability tools lack actionable insights.