Suav Tech, an AI Safety evals for-profit

Project summary

Suav Tech is a research-driven company developing AI safety evaluations. Our flagship projects include the 4X Benchmark, which tests LLM-powered agents for strategic long-term planning and resource managment in 4X gaming scenarios, and the Novel-Domain Benchmark, which evaluates frontier knowledge creation capabilities. In addition, we will build and open-source additional tooling for UK AISI's Inspect framework.

Our benchmarks aim to provide developers, policymakers, and risk analysts clear insights into AI systems' capabilities, limitations, and progress—enabling safer, more accountable AI deployment across high-stakes domains.

What are this project's goals? How will you achieve them?

Our broad goal is to expand the third party evaluations ecosystem and increase the diversity and quality of evals that are being used to assess model capabilities and alignment, and ensure that they are reliable and legible as tools for monitoring AI development. We aim to do this via developing new high quality benchmarks and extending current open-source frameworks for evals.

Concretely, our current plans are:

Build four open-source assets
- 4X-Civ Benchmark – 200 reproducible Civilization VI settings defined on the basis of discrete and continuous difficulty variables, spanning 300-1500 turns (≈8-20 h expert play); probes long-horizon planning, multi-vector diplomacy and resource juggling in an information-rich world.
- Novel-Domain Benchmark – 75-100 tasks inside a wholly new constructed scientific/logical fields; each solution would occupy an expert researcher for several days, review ≈ 2-3 h.
- Inspect-Interactive Toolkit – OmniParser-v2 GUI parsing, vision scoring and agent-environment loop merged into Inspect to autograde multimodal interactive tasks.
- Complex-Task Agent Scaffolding – prototype retrieval augmentation based memory, classical heuristic based planner integrations, and deliberation loops to gauge imminent capability jumps on the new benchmarks.

Evaluate multiple frontier LLMs (GPT-4o, o3, o4-mini, Claude 3.7 Sonnet, Gemini 2.0 Flash and 2.5 Pro, Llama 4) on these benchmarks
Publish a research report/paper, analyzing the scaling trends and scaffolding gains of the LLMs on these benchmarks, model behaviours and possible conclusions that can be drawn from them, and describing potential further developments and speculative trends.

How will this funding be used?

Since our project is large in scope, we will likely source funding from multiple sources, and the below estimates are considering the overall budget - the funding specifically received via Manifund may end up being distributed in a different manner on these heads or a subset of the same. As this project is a for-profit, funds raised via Manifund will be rolled over and invested into Suav Tech in the form of a SAFE.

Key budget estimates:

Total Budget: $400k - $1mn+ - estimated spending over the first financial year; we are aiming to cover a significant portion via commercial contracts
Talent costs - 70% - Hiring ML engineers and salaries for researchers and engineers
Compute - 15% - API credits and cloud credits for running evaluations with proprietory and open-source models
Organizational Expenses and Runway - 15% - Includes coworking meetups, travel and expenses for conferences, taxes, general support infrastructure, and other minor heads.

Who is on your team? What's your track record on similar projects?

The project is being led by Amritanshu Prasad:
- Experience working with Equistamp (primarily on Inspect projects for UK AISI) and METR
- Member of working group on International AI Governance at the Alva Myrdal Centre for Nuclear Disarmament, Uppsala University
- Experience working on technical AI safety, AI policy and applied ML projects
The rest of the team is made up of 4 engineers with experience working on ML research and startups. In addition, we are also advised by Dr. Sophia Hatz, Associate Professor, Uppsala University and lead of the working group on international AI governance at Alva Myrdal Centre for Nuclear Disarmament.

What are the most likely causes and outcomes if this project fails?

The most likely cause for this project failing are:

Lack of Funds: Inability to hire necessary talent or acquire sufficient compute for running experiments.
Scope Creep: Continuously expanding the scope of the project to build more tooling features or run more samples or benchmarks might lead to a situation where the final output ends up being in a limbo
Models are unexpectedly good/bad at our planned benchmarks: This would severely limit the practical utility of these benchmarks in differentiating the performance of models wrt each other.
Duplication of effort: Other teams might already be building similar things and may publish it before us, leading to our work becoming redundant.
Talent bottleneck: We are unable to find suitable talent who want to work with us full time.

The most likely outcome if this project fails is that we don't publish anything, and this ends up being a counterfactually useless use of funds. There doesn't seem to be any negative externalities.

How much money have you raised in the last 12 months, and from where?

We haven't formally raised any money yet, but we are in talks with a few funders. Any updates regarding grants or investments will be updated here.