Software to detect data fabrication

Description of proposed project

Some unknown fraction of published research is based on fabricated data. It’s impossible to estimate how widespread the problem is, but there are some troubling indications, such as this paper by Bik et al which showed that 4% of papers with Western blots contained duplicated images. Notably, that study was conducted by the authors in their own free time, which points to a larger issue: There is very little funding available for detecting fraud. That’s unfortunate since fraud imposes such big social costs, both by squandering grant funding that could have been awarded to legitimate researchers and by polluting the scientific literature with fake results.

To tackle the issue of data fabrication, I'm building 'copy-paste-detective'. It’s inspired by two cases of data fabrication that made the news in recent years. One by Nobel prize winner Thomas Südhof and one by spider ecologist Jonathan Pruitt. Both cases had publicly available datasets with entire blocks of copy-pasted data - the most easy-to-detect type of fraud imaginable. My idea was to write a software that could identify these data anomalies and then run it on randomly selected datasets to see what would turn up.

As of October 2025 I have analyzed around 180 recently published spreadsheets from Dryad (a repository of research data) using the software. Here are its first results, all discovered and reported by me (username: Campanula hypopolia) on Pubpeer, a public forum for post-publication peer review:

- "Drought decreases carbon flux..." by Ge et al. Surprisingly the authors agreed to publish the original lab data after receiving pressure from the journal editor, which I could then use as evidence against them. A retraction seems inevitable.

- "Dual drivers of plant invasions..." by Wang et al. The authors are issuing a correction while providing no explanation for the incriminating patterns in their data.

- "Extreme Warming Coordinately Shifts Root and Leaf Traits..." by Zhang et al. The senior author has issued an errata while insisting that no deliberate fabrication took place.

In addition, I’ve found seven more similar cases that I'm in the process of reporting together with some new contributors.

Finding so much fabricated data in a random sample of only 180 papers was surprising to say the least. The best explanation for why it’s so easy to find is that no one else is looking for it. There is after all basically zero funding available for combating research fraud. This neglect has created an opportunity for my project to have an outsized positive impact since there’s so much low-hanging fruit that nobody else is picking up.

I now have direct line-of-sight towards finding ~1000 cases of never-before-seen data fraud within a year.

To achieve this goal, I would need to analyze all of the 20.000 articles available on Dryad that fit my inclusion criteria - this wouldn’t require any new code, but a lot of time spent on analysis. If it turns out that the entire repository has an equally high rate of data anomalies as the first batch - around 5% of all papers - that would translate to 1000 cases. At 2 hours of manual analysis per positive match + 1 hour spent on writing Pubpeer comments and emails, this will take around 3000 hours. It’s clear that the manual workload will limit the number of cases that I can process, so I'm getting help from volunteers with solid science expertise.

That’s just the beginning. There are many other repositories of scientific Excel data (Zenodo, OSF etc) that I will write integrations for. Based on my first results, I’ve also identified some promising new strategies for finding more fraud:

1) Programmatically identify cells that are near-perfect duplicates, but with a single tweaked digit. When manually inspecting the data, I’ve discovered this issue in 5 of the cases of likely fraud I’ve investigated. Fraudsters love, and I mean love changing a single digit of a number they’ve just copy-pasted.

2) Create an algorithm that asks an LLM if any column has a mathematical relationship with other columns, then checks if the mathematical relationship holds for all values in the column. I have seen two cases so far where the researchers fabricated one of the columns but forgot to make the other columns consistent with the fabrication.

This will be a lot of work, and I won’t be able to accomplish all of it while working a demanding full-time job. The money I've raised so far has allow me to quit my job as a Lead backend engineer and focus all my time on sleuthing from January 2026 onward.

Why are you qualified to work on this?

I have over 8 years of experience as a software engineer. I currently work as a Lead backend engineer at a Dutch publisher of sports magazines where I lead projects to build new features and onboard new magazines onto our platform. That's important, because a common theme in previous successful anti-fraud efforts is the creators have built software that could scale up to the millions of articles that have been published.

I have also shown the persistence needed to analyze each case of anomalous data, while still giving the authors a fair shake. I did the painstaking work of reconstructing the spreadsheet and recreate the graphs based on the raw lab data on the "Ge et al" paper, thereby proving the case against the author. While that level of effort is not sustainable to do for all articles, it proves that I have the grit necessary to do the work, even when it’s tedious.