This is TechCrunch.
When a company releases a new artificial intelligence video generator, it's not long before someone uses it to make a video of actor Will Smith eating spaghetti. It's become something of a meme as well as a benchmark, seeing whether a new video generator can realistically render Smith slurping down a bowl of noodles. Now, Smith himself parodied the trend in an Instagram post back in February.
Will Smith and Pasta is but one of several bizarre unofficial benchmarks to take the AI community by storm last year. A 16-year-old developer built an app that gives AI control over Minecraft and tests its ability to design structures. Elsewhere, a British programmer created a platform where AI plays games like Pictionary and Connect Four against each other.
It's not like there aren't more academic tests of an AI's performance, so why did the weirder ones blow up?
For one, many of the industry's standard AI benchmarks do not tell the average person very much. Companies often cite their AI's ability to answer questions on math olympiad exams or figure out plausible solutions to PhD-level problems. Yet most people use chatbots for things like responding to emails and basic research.
crowdsourced industry measures are not necessarily better or more informative. Take, for example, Chatbot Arena, a public benchmark many AI enthusiasts and developers follow obsessively. Chatbot Arena lets anyone on the web rate how well AI performs on particular tasks, like creating a web app or generating an image.
But raters tend not to be representative. Most come from AI and tech industry circles and cast their votes based on personal, hard-to-pin-down preferences. Ethan Malek, a professor of management at Wharton, recently pointed out in a post on X another problem with many AI industry benchmarks. They don't compare a system's performance to that of the average person.
He says, the fact that there are not 30 different benchmarks from different organizations in medicine and law and advice quality and so on is a real shame as people are using systems for these things regardless.
Weird AI benchmarks like Connect4, Minecraft, and Will Smith eating spaghetti are most certainly not empirical. Or even all that generalizable. Just because an AI nails the Will Smith test does not mean it will generate, say...
a burger very well. One expert of AI benchmarks suggested that the AI community focus on the downstream impacts of AI instead of its ability in narrow domains. That's sensible, but there's the feeling that weird benchmarks are not going away anytime soon. Not only are they entertaining, who doesn't like watching AI build Minecraft castles, but
but they're easy to understand. And as Max Zeff wrote about recently, the industry continues to grapple with distilling a technology as complex as AI into digestible marketing. The only question that really comes up is, which odd new benchmarks will go viral in 2025? TechCrunch has an AI-focused newsletter. You can sign up for that and get it in your inbox every Wednesday.