Raspberry: Open-Source CoT Reasoning Dataset for Finetuning LLMs

Project Summary

The Raspberry project aims to create an initial open-source dataset of 1,000 chain-of-thought (CoT) reasoning data points, with potential scaling to more samples through additional funding milestones. These data points will enhance the reasoning capabilities of Large Language Models (LLMs). Using a variety of methods—including multi-agent frameworks and extraction from research papers—we will generate and compile reasoning data. The dataset will be used to finetune an open-source LLM and evaluate its performance on established benchmarks such as SIMPLE Bench and AIME. This initiative seeks to democratize AI research by making the dataset, the finetuned open-source models, and the data synthesis methodology available to the broader community.

What are this project's goals? How will you achieve them?

Goals:

Create an Open-Source Finetuning Dataset: Generate an initial dataset of 1,000 chain-of-thought (CoT) query-answer pairs that utilize the "test time compute" scaling paradigm, similar to OpenAI's Strawberry and other models like QWEN. This will be achieved through a multi-agent approach that synthesizes diverse, high-quality reasoning examples.
Demonstrate Near-SOTA Performance on Open-Source LLMs: Finetune open-source LLMs using the dataset to demonstrate that finetuning, without the expensive lift of full reinforcement learning runs, can achieve near-state-of-the-art performance on reasoning benchmarks. This approach assumes the underlying model is already intelligent enough, aiming to highlight the sufficiency of finetuning.
Democratize Access to Reasoning Models: Make the dataset, the finetuned models, and the data synthesis methodology publicly available, empowering the broader AI community not only to research but also to build powerful and effective reasoning models.

Methods:

Data Generation: Employ two complementary approaches to create the dataset:
- Synthetic Generation (Subproject: o7): Built on the AgentForge framework and inspired by the Dignity project, o7 leverages advanced multi-prompt reasoning techniques such as chain-of-thought (CoT), reflection, and theory of mind capabilities. It employs a structured, multi-agent workflow to tackle complex problems systematically, producing high-quality query-answer pairs in JSONL format. By breaking down tasks into logical steps and ensuring transparency through explainable AI, o7 generates data that captures diverse problem-solving strategies.
  (https://github.com/DataBassGit/o7)
- Extraction from Research Papers (Subproject: Papers-to-CoT Pipeline): This automated pipeline processes academic papers sourced from arXiv.org, transforming them into high-quality CoT training data. The 10-stage pipeline includes paper profiling, reasoning extraction, multi-stage refinement, and rigorous quality scoring. It efficiently scales dataset creation while ensuring diversity, verifiability, and expert-level reasoning. This method captures domain-specific expertise from over 60 academic disciplines, enriching the dataset with real-world complexity and depth.
  (https://github.com/thehunmonkgroup/raspberry-paper-to-cot-pipeline)
Quality Assurance: Use grading techniques, such as rubrics, to assess the quality of the data samples and enhance their coherence and usability. Each dataset entry undergoes multiple quality control checkpoints to ensure consistency and reliability..
Model Finetuning: Finetune multiple LLMs—including open-source models like Llama and closed-source models like OpenAI GPT-4 and Claude—using the curated dataset. While we will finetune both types, only the finetuned open-source models can be published. We will compare baseline reasoning and math performance against their finetuned versions to assess the dataset's effectiveness.
Performance Evaluation: Evaluate the finetuned models against baseline performance on established benchmarks, such as SIMPLE Bench and AIME, to measure improvements in reasoning capabilities and quantify the gains achieved through finetuning.

Deployment and Delivery:

The outputs of this project—including the dataset, synthesis methodology, finetuned open-source models, and supporting code—will be published under an open-source MIT license. They will be made freely available on:

GitHub: The primary repository for all code, datasets, and documentation, ensuring accessibility and ease of collaboration.
Kaggle: A platform for hosting the dataset and analysis, facilitating access for researchers and practitioners who rely on data-driven insights.

This ensures transparency, reproducibility, and scalability, demonstrating our commitment to fostering collaboration and democratizing access to advanced AI reasoning research.

How will this funding be used?

The funding will be primarily allocated to LLM inference costs required for generating the training data and finetuning the model. This includes expenses associated with computational resources and accessing higher-capacity models for data generation. All code development and implementation efforts are being contributed voluntarily by the team members, highlighting our dedication to making this project a reality.

Who is on your team? What's your track record on similar projects?

Team Members:

John Smith - Infosys Analyst
Ansel Anselmi - Computer Science Engineer
Chad Phillips - Systems Engineer

Track Record:

AgentForge: Developed a low-code framework for rapid development and testing of AI-powered autonomous agents and cognitive architectures.
(https://github.com/DataBassGit/AgentForge)
Dignity: Built a Discord chatbot using advanced retrieval augmented generation, incorporating reflection and multi-prompt chain-of-thought techniques.
(https://github.com/anselale/Dignity)
ACE Framework: Pioneered a six-layer cognitive architecture focusing on ethics-first AI development.
(https://github.com/daveshap/ACE_Framework)
ETHOS: Created an AI alignment analysis architecture that won a hackathon; designed to evaluate and ensure AI systems align with intended purposes.
(https://lablab.ai/event/autonomous-gpt-agents-hackathon/cogark/ethos)
LLM Workflow Engine: Created a CLI and workflow manager for LLMs, facilitating the design of AI pipelines.
(https://github.com/llm-workflow-engine/llm-workflow-engine)
Open-Source Contributions: Participated in numerous other open-source AI projects, enhancing tools and frameworks in the community.

Collaborative Expertise:

This team has worked together extensively across multiple projects, building a strong collaborative dynamic and a shared understanding of what it takes to execute complex AI initiatives. The diverse experiences in developing AI tools, frameworks, and cognitive systems have uniquely positioned the team to tackle Raspberry. The skills gained from creating tools like AgentForge, ETHOS, and the ACE Framework directly inform this project's methodology, particularly in designing robust multi-agent frameworks, data synthesis strategies, and performance evaluation pipelines. This experience positions the team well to address the challenges of Raspberry and deliver a scalable and impactful dataset and model.

What are the most likely causes and outcomes if this project fails?

Potential Causes of Failure:

Insufficient Funding: Limited resources may constrain the quantity or quality of generated data. We have already invested significant time, effort, and hundreds of dollars of our own money into the project and have made substantial progress. However, token funding remains the largest bottleneck to further advancement, as generating high-quality synthetic data is computationally expensive.
Data Quality Challenges: Ensuring the dataset meets the high standards required for effective model training can be challenging. Funding can facilitate the development of more robust data cleaning and integrity pipelines, which will help scale the system to generate arbitrarily large amounts of synthetic data. While this approach reduces the need for human labor, it still requires access to sufficient tokens for data generation.
Model Performance Limitations: The finetuned models may not exhibit significant improvements due to unforeseen technical challenges. We anticipate potential limitations, especially with smaller models, but aim to see statistically significant improvements in reasoning abilities. With additional data and funding, we could also explore finetuning larger models, increasing the likelihood of success. This contingency planning ensures that we have considered various scenarios for achieving meaningful results.

Contingencies and Merits of the Project:

Even if the project does not significantly improve model reasoning capabilities, it remains highly meritorious for the following reasons:

Reusable Data Synthesis Pipelines: The data synthesis pipelines developed in this project will be valuable tools for a variety of AI research initiatives, providing a scalable method for generating synthetic data.
Open Sharing of Resources: All synthetic data, along with the associated code and documentation, will be publicly available on platforms like Kaggle. This ensures that other teams can learn from and build upon our work, advancing the broader research community.
Learning Through Challenges: Understanding what does not work can be as valuable as discovering what does. By sharing our findings, we can save other teams time and resources, enabling the community to focus on more promising avenues for improving AI reasoning capabilities.

How much money have you raised in the last 12 months, and from where?

We have not raised any funds in the last 12 months. The project has been entirely self-funded and built on volunteer time, demonstrating our commitment and seriousness about its success.

With Minimum Funding ($2,500): This funding will enable us to generate an initial dataset of 1,000 high-quality CoT reasoning data points using a combination of synthetic generation and extraction from research papers. With these resources, we will finetune an open-source LLM and conduct preliminary evaluations of improvements in reasoning capabilities. The results will provide a proof of concept, demonstrating the potential of this methodology and paving the way for further scaling.
With Full Funding ($10,000): We can expand the dataset, enhance diversity and complexity, perform more extensive finetuning, and conduct thorough evaluations using a broader set of benchmarks.