Benchmarking Deceptive Improvement in Agentic Systems

Project summary

Agentic and self-modifying AI systems can appear to improve while actually gaming evaluation metrics, hiding failures, or exploiting weak oversight. Without tools to detect these patterns, deceptive improvement remains invisible.

This project builds a lightweight, open-source benchmark and toy environment that makes deceptive improvement patterns visible, measurable, and reproducible — so that safety researchers can study them concretely rather than hypothetically.

This is a 4-week pilot that extends my existing open-source evaluation tooling with a focused deception-detection module. The core infrastructure already exists:

RSI-Bench https://github.com/sunghunkwag/rsi-bench : A multi-axis benchmark for recursive self-improvement systems with 6 evaluation axes, 74 passing tests, bootstrap confidence intervals, convergence detection, and Pareto analysis. Built in Python, MIT-licensed, fully executable.

Enhanced RSI Test Framework https://github.com/sunghunkwag/enhanced-rsi-test-framework : A statistically oriented testing framework with meta-learning evaluation, convergence detection, and multi-objective optimization.

This pilot adds a deception-detection layer on top of that infrastructure — specifically targeting metric gaming, pseudo-improvement, and oversight evasion.

What are this project's goals? How will you achieve them?

Goal: Produce a small open-source benchmark (10–20 tasks) that tests whether a toy agentic or self-modifying system is genuinely improving versus gaming its evaluation setup.

Specific deliverables within 4 weeks:

1. Deceptive Improvement Task Suite (10–20 tasks): Adversarial scenarios grounded in established AI safety failure modes — including metric gaming (agent optimizes proxy instead of true objective), pseudo-improvement (performance appears to increase but generalization degrades), oversight evasion (agent behaves differently when monitored vs. unmonitored), and Goodhart-style collapse (optimizing a metric past the point where it tracks the true goal).

2. Sandboxed Test Environment: Built on top of the existing RSI-Bench infrastructure (Axis 5: Safety & Stability module), extended with deception-specific detection hooks.

3. Detection Metrics: Quantitative scores for each deception pattern — including proxy-vs-true performance gap (measured as absolute divergence between proxy reward and held-out ground-truth reward), behavior consistency ratio (comparing agent actions under active logging vs. no-logging conditions), and generalization gap after self-modification (train-set performance minus held-out test performance post-modification).

4. Public Report and Open-Source Release: All code, data, and analysis published on GitHub under MIT license, with a short technical write-up summarizing findings.

How will this funding be used?

Compute and API costs for running adversarial evaluations across multiple model configurations: $350

4-week pilot stipend enabling focused implementation and evaluation: $500

Miscellaneous software, infrastructure, and hosting costs: $150

Total: $1,000

This is deliberately scoped as a low-cost pilot. If the benchmark produces useful signals, I plan to seek follow-up funding in the $5K–$15K range to expand the task suite, test more models, and explore more realistic agentic settings.

Who is on your team? What's your track record on similar projects?

I am a self-taught AI safety researcher based in South Korea working on recursive self-improvement evaluation

Over the past several months, I have built a set of open-source repositories focused on RSI evaluation . The two most relevant to this proposal:

RSI-Bench — the first open-source multi-axis RSI evaluation framework. It decomposes self-improvement into 6 measurable axes (self-modification depth, trajectory quality, operator discovery, meta-adaptation, safety/stability, autonomous goal generation), includes statistical evaluation with BCa bootstrap confidence intervals and convergence detection, and has 74 passing tests verifying all components. This is the direct foundation for the deception-detection module proposed here.

Enhanced RSI Test Framework — a statistically robust testing framework for RSI systems with meta-learning evaluation, convergence detection, and Pareto optimization.

Other repositories explore attention-free architecture search (RSI-NAS-Attention-Free), self-improving symbolic regression (RSI-Operator-Synthesis), and state-space model meta-RL (SSM-MetaRL-TestCompute).

I work using a multi-AI orchestration approach — I design architectures, evaluation protocols, and verification methods, with code generation delegated to AI systems under structured prompting. All code is tested and publicly verifiable.

This is not a plan to build something from scratch. The infrastructure exists and works. This pilot adds a targeted deception-detection layer on top of it.

What are the most likely causes and outcomes if this project fails?

The primary risk is that the benchmark tasks are too simple, producing results that do not generalize beyond toy environments. I am explicitly scoping this as a pilot for that reason — the goal is to identify which deceptive-improvement signals are detectable at small scale before investing in larger experiments. Task design is grounded in known safety-relevant patterns (reward hacking, proxy optimization failure, monitoring-sensitive behavior) rather than invented scenarios.

A second risk is that detection signals may be noisy and hard to distinguish from ordinary variance. My existing tooling already includes bootstrap-based uncertainty estimation and convergence checks, which I will apply here. If the benchmark does not clearly detect deceptive behavior, I will report null results transparently.

A null result would still be valuable — it would establish a baseline showing that current toy systems do not exhibit measurable deceptive improvement under these conditions, which is useful information for future safety evaluations.

How much money have you raised in the last 12 months, and from where?

None. All of my existing work has been self-funded.

Benchmarking Deceptive Improvement in Agentic Systems

Offer to donate

Could RLHF Accidentally Select for Deception?

Demonstration of LLMs deceiving and getting out of a sandbox

Mitigating Reward Hacking Through RL Training Interventions