Will Smith eating spaghetti and other weird AI benchmarks that took off in 2024

2025/1/3

TechCrunch Industry News

People

播

播音员

主持著名true crime播客《Crime Junkie》的播音员和创始人。

Topics

播音员：将威尔·史密斯吃意大利面的视频作为AI视频生成器的基准测试，已成为一种潮流和衡量标准。这种测试方法简单易懂，也更易于被大众理解和接受。此外，还有其他一些奇特的非官方基准测试在AI社区流行，例如，16岁开发者开发的Minecraft AI建筑测试应用，以及AI玩Pictionary和Connect Four的游戏平台。这些测试虽然并非完全客观或普遍适用，但它们反映了AI技术发展的一个侧面，也为AI技术的发展提供了新的思路。许多行业标准的AI基准测试对普通人来说意义不大，因为它们通常关注的是AI在数学奥林匹克竞赛或博士级别问题上的表现，而普通人使用聊天机器人更多的是处理邮件和进行基础研究等任务。众包的行业衡量标准也不一定更好或更有信息量，例如Chatbot Arena，其评价标准容易受到参与者个人偏好的影响。许多AI行业基准测试的一个问题是，它们没有将系统的性能与普通人的性能进行比较。缺乏对AI在医疗、法律和建议质量等领域的基准测试是一个缺陷，因为人们已经在使用AI系统处理这些事情。像Connect4、Minecraft和威尔·史密斯吃意大利面这样的奇特AI基准测试并非完全客观或普遍适用，一个AI通过了威尔·史密斯测试并不意味着它就能很好地生成其他东西，例如汉堡。一位AI基准测试专家建议，AI社区应该关注AI的下游影响，而不是其在狭窄领域的性能。奇特的AI基准测试之所以流行，是因为它们易于理解且具有娱乐性，同时也能帮助AI行业将复杂的技术转化为易于理解的营销信息。 Ethan Malek：缺乏对AI在医疗、法律和建议质量等领域的基准测试是一个缺陷，因为人们已经在使用AI系统处理这些事情。一位AI基准测试专家：AI社区应该关注AI的下游影响，而不是其在狭窄领域的性能

Deep Dive

Shownotes Transcript

Translations:

中文

This is TechCrunch.

When a company releases a new artificial intelligence video generator, it's not long before someone uses it to make a video of actor Will Smith eating spaghetti. It's become something of a meme as well as a benchmark, seeing whether a new video generator can realistically render Smith slurping down a bowl of noodles. Now, Smith himself parodied the trend in an Instagram post back in February.

Will Smith and Pasta is but one of several bizarre unofficial benchmarks to take the AI community by storm last year. A 16-year-old developer built an app that gives AI control over Minecraft and tests its ability to design structures. Elsewhere, a British programmer created a platform where AI plays games like Pictionary and Connect Four against each other.

It's not like there aren't more academic tests of an AI's performance, so why did the weirder ones blow up?

For one, many of the industry's standard AI benchmarks do not tell the average person very much. Companies often cite their AI's ability to answer questions on math olympiad exams or figure out plausible solutions to PhD-level problems. Yet most people use chatbots for things like responding to emails and basic research.

crowdsourced industry measures are not necessarily better or more informative. Take, for example, Chatbot Arena, a public benchmark many AI enthusiasts and developers follow obsessively. Chatbot Arena lets anyone on the web rate how well AI performs on particular tasks, like creating a web app or generating an image.

But raters tend not to be representative. Most come from AI and tech industry circles and cast their votes based on personal, hard-to-pin-down preferences. Ethan Malek, a professor of management at Wharton, recently pointed out in a post on X another problem with many AI industry benchmarks. They don't compare a system's performance to that of the average person.

He says, the fact that there are not 30 different benchmarks from different organizations in medicine and law and advice quality and so on is a real shame as people are using systems for these things regardless.

Weird AI benchmarks like Connect4, Minecraft, and Will Smith eating spaghetti are most certainly not empirical. Or even all that generalizable. Just because an AI nails the Will Smith test does not mean it will generate, say...

a burger very well. One expert of AI benchmarks suggested that the AI community focus on the downstream impacts of AI instead of its ability in narrow domains. That's sensible, but there's the feeling that weird benchmarks are not going away anytime soon. Not only are they entertaining, who doesn't like watching AI build Minecraft castles, but

but they're easy to understand. And as Max Zeff wrote about recently, the industry continues to grapple with distilling a technology as complex as AI into digestible marketing. The only question that really comes up is, which odd new benchmarks will go viral in 2025? TechCrunch has an AI-focused newsletter. You can sign up for that and get it in your inbox every Wednesday.

Will Smith eating spaghetti and other weird AI benchmarks that took off in 2024 04:29 Share

TechCrunch Industry News

Deep Dive

Shownotes Transcript

Will Smith eating spaghetti and other weird AI benchmarks that took off in 2024