Cross-provider audit of LLM-mediated market mechanisms

Project summary

I am developing a paper titled “Semantic Implementation of VCG Mechanisms in LLM-Mediated Labor Markets.”

The paper studies a simple but important question: when a formal market mechanism is implemented through a language-model interface, do the classical guarantees of the mechanism survive?

In my current OpenAI-side experiments, a 1000-episode gpt-4o-mini labor-market simulation shows that structured-output VCG achieves mean welfare capture of 0.683, barely above random matching at 0.678, despite zero parse failures and zero fallbacks. Diagnostics show that the failure is not syntactic: the model produces valid numerical reports, but those reports sometimes falsely exclude true-positive firm-worker edges from the reported-positive graph, causing under-trading.

A 300-episode numerical anchoring ablation restores full welfare capture for gpt-4o-mini, and a 300-episode OpenAI model-family audit shows that gpt-4.1-nano, gpt-4.1-mini, gpt-4.1, and gpt-4o all achieve full welfare capture under the standard structured channel. This supports the paper’s core claim: the failure is model–prompt–parser dependent, not a refutation of VCG or a universal LLM failure.

The remaining step is cross-provider validation. I am requesting a small API budget to run the same pre-registered diagnostic on Anthropic Claude Sonnet and Google Gemini Flash

What are this project's goals? How will you achieve them?

The goal is to complete a cross-provider semantic-channel audit for LLM-mediated direct-revelation mechanisms.

The specific goals are:

- Run a 300-episode Claude Sonnet diagnostic using the same four-arm design:

- oracle truthful VCG;

- random matching;

- standard structured VCG;

- numerically anchored structured VCG.

- Run a 300-episode Gemini Flash diagnostic using the same four-arm design.

- Compare welfare capture, exact-report rates, false-excluded true-positive edges, under-trading, parse/fallback rates, and absolute welfare loss across OpenAI, Anthropic, and Gemini channels.

- Update the paper with a cross-provider model-family audit section.

- Prepare a reproducibility package containing code, raw logs, analysis scripts, figures, tables, and manifests.

The codebase, OpenAI experiments, zero-cost diagnostics, and analysis pipeline are already built. The remaining work is mainly API execution, provider integration, diagnostics, and paper integration.

How will this funding be used?

Minimum funding requested: $500.

This would be used for:

- Anthropic Claude Sonnet API credits for a 300-episode cross-provider run.

- Google Gemini Flash API credits for a 300-episode cross-provider run.

- Smoke tests and failed-run buffer.

- Re-running incomplete chunks if logging, schema validation, or parsing metadata fail.

- Reproducibility packaging and final paper integration.

The minimum viable version of this project is the Claude + Gemini 300-episode cross-provider audit. If costs are lower than expected, leftover funds will be used for one small robustness extension, such as a larger-market 5x5 diagnostic or an additional 300-episode confirmation run.

Who is on your team? What's your track record on similar projects?

I am currently the sole researcher on this project.

Work already completed:

- Built a corrected simulation codebase for LLM-mediated labor-market mechanisms.

- Completed the main 1000-episode gpt-4o-mini run.

- Completed zero-cost diagnostics from existing logs.

- Completed false-exclusion and under-trading decomposition.

- Completed a 300-episode numerical anchoring ablation.

- Completed a 300-episode OpenAI model-family audit for gpt-4.1-nano, gpt-4.1-mini, gpt-4.1, and gpt-4o.

- Drafted the working paper with theory, diagnostics, figures, and reproducibility appendices.

I am an incoming Mathematics and Statistics student at the University of Warwick. This project is currently being developed as an independent research paper.

What are the most likely causes and outcomes if this project fails?

The main risks are:

- Anthropic or Gemini API integration may take longer than expected.

- Provider APIs may not support the exact same structured-output interface, requiring careful adaptation.

- Cross-provider results may be ambiguous or not replicate the OpenAI pattern.

- The 3x3 labor-market setting may be considered too small for a journal-level claim without further robustness.

- The final paper may still need an experienced coauthor or advisor before journal submission.

If the project fails technically, the fallback outcome is still useful: the OpenAI-side paper is already complete, and any failed cross-provider attempt will be documented as part of the implementation/reproducibility record.

If the cross-provider results are mixed or negative, I will not overclaim. The paper will report that semantic-channel behavior is provider- and interface-dependent, and that broader validation remains necessary.

How much money have you raised in the last 12 months, and from where?

I have not raised external research funding in the last 12 months. The work so far has been self-funded, including OpenAI API experiments, smoke tests, failed runs, and analysis.