TimeAlign v2: contamination-aware evals for small models (16GB GPUs)

Project Summary

TimeAlign v2 builds on my NeurIPS ER and Responsible Foundation Models workshop paper to create a maintained, contamination-aware evaluation toolkit for small and resource-constrained foundation models. It unifies temporal screening, near-duplicate removal, and calibration or risk-coverage reporting into one reproducible workflow that runs on 16 GB GPUs.

The goal is to make clean evaluation a default practice rather than a custom research script. In the NeurIPS paper, contamination inflated benchmark accuracy by 74.5 percentage points, while a clean evaluation ran in under eight hours with less than two percent overhead, showing that contamination correction is both urgent and practical.

Paper link: https://openreview.net/pdf?id=t5N0fM3Y5T

What are this project’s goals? How will you achieve them?

Goal 1: Make clean evaluation simple and repeatable.

Ship a stable command-line interface and configs that reproduce the TimeAlign pipeline with temporal cutoffs, shingle-based near-duplicate detection, calibration, and risk-coverage metrics.

Goal 2: Create a standard contamination-card artifact.

Each run will output a structured summary of dataset versions, screening sources, items removed, and representative overlaps.

Goal 3: Test generality and reliability.

Replicate results on multiple small models and quantization settings, integrate with at least one evaluation harness, and publish complete artifacts.

Execution plan:

I will lead design and quality control and contract two part-time freelancers: one ML-tooling engineer for integration, packaging, and CI, and one data or pipeline engineer for screening and reproducibility. This ensures production-quality delivery without single-person dependence.

How will this funding be used?

The total requested amount is 46,000 USD, with a minimum funding threshold of 22,000 USD.

Freelancers (17,000 USD): Two part-time contractors for about 300 hours total. One focuses on ML tooling and CI, the other on data screening and reproducibility.

Cloud compute credits (13,000 USD): GPU, API, and storage costs for replications using providers such as RunPod or Lambda. Cloud credits keep costs proportional to usage.

My runway 11,000 USD): Four to five months of full-time work.

Continuous integration and hosting (2,000 USD): Persistent dataset snapshots, artifact hosting, and CI infrastructure for reproducible runs.

Outreach and adoption (1,000 USD): User onboarding and limited travel to promote adoption.

Buffer and fees (1,500 USD)

Funding Scenarios:

At 46,000 USD(Full Goal): I deliver the full scope: polished CLI, Contamination Cards, deep harness integrations, and extensive model replications.

At 22,000 USD(Minimum): I will reduce freelancer hours by ~50% and limit compute usage. I will prioritize shipping the core CLI and Contamination Cards over broader model replications and integrations.

Who is on your team? What’s your track record on similar projects?

Team

Independent researcher (project lead) with part-time freelance engineering and documentation support.

Track record

TimeAlign (NeurIPS ER Workshop 2025) introduced the contamination-aware evaluation pipeline with temporal screening, shingle-Jaccard decontamination, and quantization-aware calibration, showing large contamination inflation and reproducible, resource-efficient evaluations.

What are the most likely causes and outcomes if this project fails?

Risks

Integration may take longer than expected.

The tool could be correct but inconvenient to use.

Shingle-based matching might miss paraphrased or cross-lingual leakage.

Outcomes

Even with limited adoption, the project will yield a fully documented, reproducible implementation of the NeurIPS pipeline and open artifacts for future work. The short timeline, multiple freelancers as well as existing code base reduce overall risk.

How much money have you raised in the last 12 months, and from where?

0 USD. No prior funding for this specific project.

Timeline (starting February 2026)

February 2026: Set up the project infrastructure and hire freelancers. Reorganise the existing TimeAlign codebase to prepare for new components.

March 2026: Develop the contamination-card module and connect temporal screening to the main pipeline. Enable continuous integration and start small model replications.

April 2026: Add calibration and risk coverage features. Complete documentation, publish the release and involve early users for testing and feedback.