Calibration City

EA Community Choice Forecasting

wasabipesto

ActiveGrant

$8,864raised

$10,000funding goal

Donate

Project summary

I’ve been working on Calibration City, a site for prediction market calibration and accuracy analysis. I want the site to be useful for experienced prediction market users as well for as people who have never heard of them before.

Example user questions we aim to answer include:

I'm interested in sports, how good is Manifold at predicting games a week in advance? Do other sites have a better track record?
This PredictIt market is trading at 90¢ but has less than 2000 shares in volume. How often does a market like that end up being wrong?
I’m worried about the accuracy of markets that won’t resolve for a long time. What is the typical accuracy of a market over a year away from resolution?

What have you done so far?

Calibration City is currently live! We completed the MVP in January 2024 with additional features landing in February and March. We integrate data from Kalshi, Manifold, Metaculus, and Polymarket, with over 130,000 total markets and over 300 visitors in the past month.

There are currently two main visualizations: calibration and accuracy. The calibration page shows a standard calibration plot for each supported platform. The user can choose how markets are sorted into bins along the x-axis (by the market probability at a specific point, or a time-weighted average). They can also apply weighting to each market based on values such as the market volume, length, or number of traders. Users can filter the total set of markets used for analysis based on keyword, category, duration, volume, or other features. Is Polymarket consistently overconfident? Underconfident? What about on long-term markets?

The accuracy plot allows users to directly compare different factors’ effects on market accuracy. In addition to the standard filters and binning options, the user can select a factor such as the market date, total trade volume, market length, or number of traders. With this additional axis, users can learn how (or if) those factors actually impact market accuracy. Does higher trade volume really increase accuracy? If so, by how much? What about more recent markets?

The beginner-friendly introduction page is a Socratic-style dialog introducing the reader to basic concepts of forecasting before introducing the premise of the site. The resources page lists the current capabilities of the site, answers common questions about the data gathering, and lists a few community resources for further reading. A simple list page displays all markets in the sample, useful for locating outliers or trends over similar markets.

Calibration City was awarded $3500 from the Manifold Community Fund, the highest of any project submitted. It was recently mentioned in Nuño Sempere‘s forecasting newsletter for June 2024.

What do you have planned next?

My next big goal is to address one of the biggest problems with naive calibration comparison: different platforms predict different things. Some platforms automatically create dozens of markets in the style of “Will X metric be in range Y at time Z?” every day while other platforms have far fewer markets with longer timespans and more uncertainty. The analysis you currently see on Calibration City can be very useful but it’s unfair to calculate the calibration score of each platform and compare them directly.

In order to address this, we need to classify markets into narrow questions, such as “Who will win the 2024 US presidential election?” or “Will a nuclear weapon be detonated in 2024?”. We can find all markets across all platforms that predict the relevant outcome, check the resolution criteria to make sure they’re essentially equivalent, and then compare those with a relative Brier score that awards markets that were correct earlier. Once we have a corpus of these questions and their constituent markets, we can calculate a score for each platform in each category and fairly compare them.

I plan to do this classification primarily with GPT-4, starting with smaller samples and building a corpus from there. A fair amount of human effort will still be necessary to identify variations in resolution criteria and other edge cases. Once we have the dataset I can build a scorecard or dashboard that fairly compares each platform in each category, allowing users to definitively answer which market platform is most accurate in each field.

Some of my other planned features for this project include:

Integrate data from more sites, such as PredictIt, Futuur, and Insight Predictions
Get more data from the sites we do monitor, such as market volume from Polymarket
Easily share visualizations with a link or export a summary card for social sharing
Natively support advanced market types such as multiple-choice or numeric/date markets
Generate individual user calibration plots with the same methodology that we use for platforms
Create an easy-to-use cross-platform bot framework for arbitrage or reactive betting
Have a dashboard of live markets with comparisons/discrepancies across platforms
Provide an estimated probability spread for live markets based on similar past markets

How will this funding be used?

The primary use of this funding will be as compensation for my time. In addition, some planned features will incur direct costs:

Classifying over 130,000 markets with GPT-4 in order to find matches
VPN connections for platforms that restrict users based on location
Additional compute server capacity for increased load

Who is on your team?

I’m wasabipesto - you may recognize me from the Manifold discord. You can find my contact information and other projects over at my website.

I have a full-time job but I enjoy working on projects like this in my spare time. I am not typically paid for hobby projects so I work on whatever interests me at the moment. Funding from this grant would compensate me for my time and incentivize me to work on additional features when I would otherwise be unproductive or working on other projects.

Calibration City is fully open-source on GitHub and open to community contribution. You can see the live data used by the site for your own analysis at https://api.calibration.city/

What other funding are you or your project getting?

I received retroactive funding for this project from the Manifold Community Fund. I don’t receive any ongoing funding for this project.

wasabipesto

about 1 month ago

Progress update

Brier.fyi is live! 🎉

Calibration City was great, but we needed new data and a new way to convey our insights. Now we have Brier.fyi - a more intuitive, guided exploration of prediction market accuracy.

Go check it out first, and then come back here for the progress update!

Calibration City 🌇

When we started this proposal, we already had a good MVP. We had a data pipeline to ingest data from the public APIs of Kalshi, Manifold, Metaculus, and Polymarket, but it was brittle and missed some data. We were able to do some cool things with that data, and we learned a lot. Most importantly, we showed that prediction markets are pretty well calibrated! Over time I added some features like customizing the bin method, weighting the averages, advanced filtering, and a simple Brier score analysis. However, I was always reluctant to actually show the overall Brier score of any particular platform or category since they are so fundamentally different it would be a bit misleading. Instead, I fell back to showing general trends and comparisons.

Matching 🔥

I really wanted to be able to answer the questions “How accurate are prediction markets in general?” and “Which market platform is most accurate?”

In order to do that, I needed to be able to compare apples to apples. My proposal for this project was to group identical markets from each platform, then grade the markets in those groups against each other. And that’s what we did! Right now we have 931 linked markets and I’m adding more every day. With these scores we can finally answer who’s the most accurate!

The results:

On average, prediction markets are pretty accurate! One month before close, 62% of markets were already within 30% of the correct resolution, representing a Brier score of 0.09.
No prediction market platform is a clear winner on all topics! Kalshi technically leads this score, but by only a few percent. However, looking at each question category shows that most platforms have a few niches where they shine - Kalshi and Polymarket are good at sports, while Metaculus is best at scientific topics. See all of the scores on the platforms page.

New Features 💡

In addition to the market matching, I had some improvements I already wanted to make to the site:

We now have metrics for the market volume and an estimate for the number of traders on Polymarket.
We decompose multiple-choice markets into binary markets, now allowing us to score almost all markets on each platform!

Additionally, I was noticing some issues with Calibration City I wanted to address:

The previous data extract-transform-load data pipeline took a long time to run and failed often, which lead to me not refreshing the database for months at a time. Now the entire thing is automated and much more resilient, allowing us to gather new markets every single day.
Over time the site became slower and slower, plus it doesn’t always load correctly on the first visit. Caching doesn’t work quite right and so there was always a lot of load on the server. The new site is completely static, cached properly, and loads instantly with all data. It’s also much easier to develop and get the data for their own experiments.
The old site was often cited as proof that markets are accurate without explanation or context, leaving visitors confused unless the person who linked it also gave an explanation. The primary calibration chart looks neat, but doesn’t really mean anything unless you already know about calibration. The Introduction page was supposed to be a remedy for that, but basically nobody has read it. In response, every chart and visualization on the new site has some sort of explanation of what the chart means, and most also have the results and context presented in a way new users can understand. We also address the primary question most users have - “How accurate are prediction markets?” - right at the very beginning.
There was a split between users trying to prove that prediction markets in general are great versus those trying to prove that their favorite site is the best. The old site had enough data that you could try and prove either one, but it wasn’t built with that in mind. In the new site I held those viewpoints front and center and tried to answer both honestly and directly.

Wrap-Up 🏁

I still have a lot of work to do here, but I’m closing this project because I think I’ve completed my main goal. My main focus is getting user feedback, making things crystal clear, and matching more markets together. My roadmap is alongside the project code on GitHub, and both are open to community contribution.

To all of my donors: thank you for your contributions and kind words. Without your encouragement Brier.fyi would not have happened. Feel free to get in touch with me anytime. I’ll be at Manifest next weekend if anyone wants to say hello!

donated $5,100

Austin Chen

about 1 month ago

@wasabipesto brier.fyi looks fantastic, great work! This is the kind of core infrastructural work that helps inform expert forecasters, and also makes forecasting more understandable to newcomers. Visual design-wise, the diagrams look great, too! I've donated an additional $5k in recognition of the time & effort you've put into making this happen.

Also some quick feedback:

The home page is quite wordy atm -- I'd suggest radically trimming or collapsing text from the home page, and letting the visualizations speak for themselves more. Could take inspiration from PlasticList perhaps, they put most of the explanation on other pages.
Also, suggest shorter line lengths for readability, capping around 100 chars per line. See also these typography tips.
I'm slightly peeved at how low Manifold currently ranks, but I'm hoping it's a kick-in-the-pants to the Manifold team to figure out why and how to make Manifold more accurate, and also signal to users which markets are less trustworthy

wasabipesto

about 1 month ago

@Austin Thank you for the advice and the generous donation! I think those points are spot-on - you aren't the first to bring up how wordy it is. Both of those links are phenomenal resources, I have a few ideas on how I can make it better already.

donated $110

Nathan Young

about 1 month ago

I often reference this, and would like to give like $5 a month (cough @Austin cough) but failing that I'll give $100 on occasion.

This tweet, a screenshot of calibration city has been seen 45000 times.

donated $110

Nathan Young

about 1 month ago

Sorry: https://x.com/NathanpmYoung/status/1924404103276323318

donated $2,000

Ryan Kidd

10 months ago

Main points in favor of this grant

I think prediction markets are a great forecasting mechanism and accurate forecasts are an essential component of good decision-making. I regularly consult Manifold, Metaculus, etc. for decision-relevant forecasts. Establishing the accuracy of these platforms seems crucial for widespread adoption of prediction markets in institutional decision-making.
I’m excited by the potential for Calibration City to track the accuracy of AI-specific forecasts, to aid AI safety and improve planning for transformative AI. I strongly encourage wasabipesto to create an interface tracking the accuracies of predictions about AI capabilities and AGI company developments.

Donor's main reservations

It’s possible that this tool doesn’t increase trust or uptake of prediction markets in decision-making because the interface is too abstract or concepts are too abstract. However, even so, this might prove useful to some individual decision makers or research projects.
It’s possible that the AI questions I am most interested in calibrating on belong to a class of long-horizon predictions that is not well-represented by the calibration of short-horizon, closed markets.

Process for deciding amount

I decided to fund this project $2k somewhat arbitrarily. I wanted to leave room for other donors and I didn’t view it as impactful in expectation as other $5k+ projects I’ve funded.

Conflicts of interest

I don't believe there are any conflicts of interest to declare.

wasabipesto

10 months ago

@RyanKidd Thank you for the contribution and the kind words! I agree AI forecasting is very important and is therefore one of the primary topic areas I intend to be featured on the site. I also think that the most important questions in that area will be long-horizon and future accuracy may not be reflected by past performance, but I'm sure there is still plenty to learn.

donated $40

Sasha Cooper

10 months ago

My partner and I made notes on all of the projects in the EACC initiative, and thought this was one of the more convincing among some really strong competition.

Our quick and dirty notes:

They: Something distinct in the prediction market field

He: Product ready = big plus, and doing something distinct (much more a fan of this than making new forecasting alternative tools)

donated $110

Nathan Young

10 months ago

I asked @wasabipesto to sign up here because calibration city is something I wanted to exist and they built it! So I wanted to reward them for that.

donated $110

Nathan Young

about 1 month ago

@NathanYoung How does carry work for impact certificates?

donated $20

David Glidden

10 months ago

The world needs to better understand the importance of calibration - let's help it go mainstream!

donated $50

🍉

nikki

11 months ago

The Manifold community yearns for per-user calibration!

donated $50

Osnat Katz

11 months ago

I think this is cool and beautiful and I look forward to seeing more of Calibration City