TimeAlign v2 builds on my NeurIPS ER and Responsible Foundation Models workshop paper to create a maintained, contamination-aware evaluation toolkit for small and resource-constrained foundation models. It unifies temporal screening, near-duplicate removal, and calibration or risk-coverage reporting into one reproducible workflow that runs on 16 GB GPUs.
The goal is to make clean evaluation a default practice rather than a custom research script. In the NeurIPS paper, contamination inflated benchmark accuracy by 74.5 percentage points, while a clean evaluation ran in under eight hours with less than two percent overhead, showing that contamination correction is both urgent and practical.
Paper link: https://openreview.net/pdf?id=t5N0fM3Y5T
Goal 1: Make clean evaluation simple and repeatable.
Ship a stable command-line interface and configs that reproduce the TimeAlign pipeline with temporal cutoffs, shingle-based near-duplicate detection, calibration, and risk-coverage metrics.
Goal 2: Create a standard contamination-card artifact.
Each run will output a structured summary of dataset versions, screening sources, items removed, and representative overlaps.
Goal 3: Test generality and reliability.
Replicate results on multiple small models and quantization settings, integrate with at least one evaluation harness, and publish complete artifacts.
Execution plan:
I will lead design and quality control and contract two part-time freelancers: one ML-tooling engineer for integration, packaging, and CI, and one data or pipeline engineer for screening and reproducibility. This ensures production-quality delivery without single-person dependence.
The total requested amount is 46,000 USD, with a minimum funding threshold of 22,000 USD.
Freelancers (17,000 USD): Two part-time contractors for about 300 hours total. One focuses on ML tooling and CI, the other on data screening and reproducibility.
Cloud compute credits (13,000 USD): GPU, API, and storage costs for replications using providers such as RunPod or Lambda. Cloud credits keep costs proportional to usage.
My runway 11,000 USD): Four to five months of full-time work.
Continuous integration and hosting (2,000 USD): Persistent dataset snapshots, artifact hosting, and CI infrastructure for reproducible runs.
Outreach and adoption (1,000 USD): User onboarding and limited travel to promote adoption.
Buffer and fees (1,500 USD)
Funding Scenarios:
At 46,000 USD(Full Goal): I deliver the full scope: polished CLI, Contamination Cards, deep harness integrations, and extensive model replications.
At 22,000 USD(Minimum): I will reduce freelancer hours by ~50% and limit compute usage. I will prioritize shipping the core CLI and Contamination Cards over broader model replications and integrations.
Team
Independent researcher (project lead) with part-time freelance engineering and documentation support.
Track record
TimeAlign (NeurIPS ER Workshop 2025) introduced the contamination-aware evaluation pipeline with temporal screening, shingle-Jaccard decontamination, and quantization-aware calibration, showing large contamination inflation and reproducible, resource-efficient evaluations.
Risks
Integration may take longer than expected.
The tool could be correct but inconvenient to use.
Shingle-based matching might miss paraphrased or cross-lingual leakage.
Outcomes
Even with limited adoption, the project will yield a fully documented, reproducible implementation of the NeurIPS pipeline and open artifacts for future work. The short timeline, multiple freelancers as well as existing code base reduce overall risk.
0 USD. No prior funding for this specific project.
February 2026: Set up the project infrastructure and hire freelancers. Reorganise the existing TimeAlign codebase to prepare for new components.
March 2026: Develop the contamination-card module and connect temporal screening to the main pipeline. Enable continuous integration and start small model replications.
April 2026: Add calibration and risk coverage features. Complete documentation, publish the release and involve early users for testing and feedback.