I recently ran a Turing Test with GPT-4 here (turingtest.live). We got around 6000 games from ~2000 ppts. There's a pre-print of results from the first 2000 games here (https://arxiv.org/abs/2310.20216). The full pop of data is under review and one prompt gets 49.7% after 855 games).
While the TT has important drawbacks as a test of intelligence, I think it's important as a test of deception per se. Can alert and adversarial users detect an LLM vs a human in a 5 minute text-only conversation? Which prompts and models work best? Which interrogation strategies work best? I think these are important and interesting questions to answer from a safety and sociological perspective. Plus lots of people reported finding the game very fun and interesting to play!
Games cost around $0.3 to run w/ GPT-4. We don't have specific funding for the project and we've been using a limited general experiment funding pot. The site gained popularity and we went through $500 in December so we decided to shut it down temporarily. Ideally, I'd like to revive it in 2024 but would need some dedicated funding to do this. If you'd like to test out the interface, you can do it here: turingtest.live/ai_game (please don't share this link widely though!)
As well as getting a better estimate on the success of existing models and allowing more people to play the game, there are a variety of additional questions we'd like to ask.
1. Prompts: We've tried around 60 prompts and there's a lot of variance. I'd be keen to generate more and see how well these do. A priori it seems very likely there are better prompts than the ones we've tried
2. Temperature. We've varied temperature a bit, but not very systematically. It would be useful to try the same prompt at a variety of temperatures.
3. Auxiliary infrastructure. Models often fail due to lack of real-time info. We could address this through browsing/tool-use. They also often make silly errors which we might be able to address through double-checking, and/or CoT scratchpads.
4. User-generated prompts. It would be lovely to let users generate and test their own prompts. But you probably need at least 30-50 games to reliably test a prompt. We would need a good ratio of games played:prompts created, a decent userbase, and some funding to do this well
5. Other models. I'm planning to include another couple of API model endpoints (e.g. Claude), which should be relatively easy to do. Lots of the feedback on Twitter was from e/acc folks who want to see OS/non-RLHF models tested and that seems right to me too. We could probably run some 7B models for < $2/hr and bigger ones for something like $5-10/hr (though I haven't tested this). Some fiddling with the infrastructure would be needed for this. We also might experiment with only running the game for 1-2hrs/day, to minimise server uptime & maximise concurrent human users.
Essentially, my goal would be to make some of these improvements, run several thousand more games, and publish the results.
I am a PhD student in cognitive science at UCSD. I've implemented the first version of this site and written a paper on the results. I'm pretty familiar with the literature on the Turing Test and I've implemented a range of similar experiments over the last 4 years of my PhD.
I'll also be working with my advisor, Ben Bergen, a professor in the department who has a proven track-record of successful cognitive science research across his career (https://pages.ucsd.edu/~bkbergen/).
Website: https://camrobjones.com
Twitter: @camrobjones
Github: camrobjones
Linkedin: https://linkedin.com/in/camrobjones
~$5000. at $0.3/game this would buy us ~16000 games. Some additions like browsing and double-checking might increase game cost. Most likely we would use a decent part of this to run servers for OS models (e.g. $5 * 2hr/day * 7 days * 8 weeks = $560).
Site: turingtest.live
Demo: turingtest.live/ai_game (please don't share widely).
preprint: https://arxiv.org/abs/2310.20216
Running ~5000 games in < 3 months: 95%
Building out auxiliary infrastructure: 90%
Building out OS model infrastructure: 85%
Running ~10000 games in < 3 months: 80%
Finding a prompt/setup that reliably "passes" (I don't know if this is 'success' but an interesting outcome. By "passes" I mean significantly > 50% success*): 40%.
* We discuss this a lot more in the preprint. This seems like the least-worst benchmark to me.
camrobjones
about 2 months ago
Chris Leong
6 months ago
This is a cool project that might help improve the conversation around these issues.
Some people might be worried about hype, but there's already so much hype, the harms are likely marginal.
You may want to consider linking people to an AI Safety resource if you think your site may get a lot of traffic, then again, you might not if you think that'd make people more suspicious of the results.
Another option to consider would be an ad-supported model. I'm not suggesting Google Ad words, but you might be able to find an AI company to sponsor you.
Tom O’Haire
7 months ago
I’m glad to see this reach the threshold.
@camrobjones where will be best to monitor progress and see results?
Dony Christie
7 months ago
This sounds really cool and the only AI related ACX Grants cert project I could evaluate to have some legible chance at an impact, potentially a viral one. It already had some success apparently and just needs more funding. The Turing Test is a pretty fundamental concept in AI lore and we should have at least one running.
Dominic de Bettencourt
7 months ago
The Turing Test is definitely the most publicly well-known test of AI abilities, it has always seemed strange to me that the Loebner Prize stopped being awarded in 2020 right before AI started to reach a level where it could potentially get close to passing. I think something like this should definitely exist, I remember playing with it a bit when it was initially released and it was pretty cool.
camrobjones
7 months ago
@dominic Thanks very much, Dominic! I'm glad you had a chance to try it out and I appreciate the support!
Alyssa Riceman
7 months ago
This is neat! I'm not hugely expecting it to move the needle of popular understanding of AI deceptiveness very much, but the possibility of its doing so strikes me as sufficiently non-negligible that it nonetheless seems worth tossing some money at just in case.
Harvey Powers
7 months ago
Similar reasoning to my support here. Cool project, and please share the outcome / data if possible.@Alyssa @camrobjones
Anton Makiievskyi
7 months ago
Oh, what a cool project!
A few questions:
1. Who does a job of a human witness in this test? How do you make sure that there is a human online when someone wants to play the "interrogation game"?
2. Have you applied for OpenAI or Claude credits?
3. How about asking users to input their chatgpt API key to play?
In any case, I'm happy to offer money to get this project over a minimum funding bar
Can we expect an update here after a month or two? If it goes well, I will likely be glad to provide more funding
camrobjones
7 months ago
@AntonMakiievskyi Thanks so much Anton! I really appreciate the support.
1. Participants are randomly assigned to be witnesses or interrogators. The lack of humans online is a definite issue, as there were periods where a player would be repeatedly matched with AI if no humans were online. I'm considering only making the game available for ~1hr a day to maximise the density of humans online while keeping costs down.
2. I have applied for OpenAI but didn't hear back. Will try Claude & OpenAI again.
3. This could be a good backstop if we run out of credits again. I'm a little nervous about handling the data but I'm sure there's a secure way to do this.
Yes, I will probably take a couple of weeks to make changes to the site and then hopefully update just before we relaunch. Thanks again and let me know if you have more questions or would like to chat more.
Austin Chen
8 months ago
I really like that Cam has already built & shipped this project, and it appears to have gotten viral traction and had to be shut down due to costs; rare qualities for a grant proposal! The project takes a very simple premise and executes well on it; playing with the demo makes me want to poke at the boundaries of AI, and made me a bit sad that it was just an AI demo (no chance to test my discernment skills); I feel like I would have shared this with my friends, had this been live.
Research on AI deception capabilities will be increasingly important, but also like that Cam created a fun game that interactively helps players think a bit about how for the state of the art has come, esp with the proposal to let user generate prompts too!