Google DeepMind unveils a new video model to rival Sora

2024/12/18

TechCrunch Industry News

DeepMind

DeepMind: Veo 2 是一个新一代的视频生成 AI 模型，在分辨率和时长上超越了 OpenAI 的 Sora。它能够生成 4K 分辨率、时长超过 2 分钟的视频。Veo 2 在物理模拟、相机控制和图像清晰度方面都有所改进，能够更逼真地模拟运动、流体动力学和光线特性，生成更清晰、更锐利的图像和视频。虽然 Veo 2 在某些方面取得了显著进展，但仍然存在一些挑战，例如在长视频中保持一致性和连贯性，以及生成复杂的细节和快速运动等。DeepMind 正在与艺术家和制作人合作，改进其视频生成模型和工具，并致力于解决模型中存在的伦理问题，例如深度伪造和版权问题。 DeepMind 承认 Veo 2 的训练数据来自公开视频，并认为使用公共数据进行训练是合理使用。DeepMind 正在努力与创作者和合作伙伴合作，以实现共同的目标，并积极收集反馈意见，以改进模型和工具。 Eli Collins: Veo 2 将在未来通过 Google 的 Vertex AI 开发者平台提供，并会整合到 Google 生态系统中。Google 将继续根据用户的反馈迭代改进 Veo 2，并将其更新的功能整合到 Google 生态系统中的引人注目的用例中。在 Veo 2 普遍可用之前，Google 的赔偿政策不适用。为了减轻深度伪造的风险，DeepMind 使用其专有的水印技术 SynthID 来嵌入不可见的标记到 Veo 2 生成的帧中。Veo 2 的训练数据包括高质量的视频和描述配对。

Deep Dive

Key Insights

What is Veo 2 and how does it compare to OpenAI's Sora?

Veo 2 is Google DeepMind's next-generation video-generating AI, capable of creating 2-minute-plus clips in resolutions up to 4K (4096x2160 pixels). This is 4x the resolution and over 6x the duration of OpenAI's Sora, which can produce up to 1080p, 20-second clips. However, in Google's experimental tool VideoFX, Veo 2 videos are currently capped at 720p and 8 seconds.

What are the key improvements in Veo 2 compared to its predecessor?

Veo 2 features an improved understanding of physics and camera controls, producing clearer footage with sharper textures, especially in scenes with movement. It can more realistically model motion, fluid dynamics, and properties of light like shadows and reflections. Additionally, it offers enhanced camera positioning and movement for capturing objects and people from different angles.

What are the limitations of Veo 2 in video generation?

Veo 2 struggles with coherence and consistency over long durations, particularly with complex prompts. Character consistency, intricate details, and fast, complex motions remain challenging. The model also exhibits issues like lifeless eyes in animations, physically impossible facades, and blending of pedestrians and backgrounds.

How is DeepMind addressing ethical concerns around Veo 2's training data?

DeepMind uses prompt-level filters to mitigate risks like regurgitation of training data and employs its proprietary watermarking technology, SynthID, to embed invisible markers in Veo 2-generated frames. However, the lab does not offer a mechanism for creators to remove their works from existing training sets, maintaining that training on public data is fair use.

What role do creators play in the development of Veo 2?

DeepMind collaborates with creators like Donald Glover and The Weeknd to understand their creative processes and refine its video generation models. Feedback from these collaborations informed the development of Veo 2, and DeepMind continues to work with trusted testers and creators to improve the model.

What other AI model upgrades did Google DeepMind announce alongside Veo 2?

Google DeepMind announced upgrades to Imagine 3, its commercial image generation model. The new version creates brighter, better-composed images in styles like photorealism, impressionism, and anime. It also follows prompts more faithfully and renders richer details and textures. UI updates to ImageFX include chiplets for key terms in prompts, allowing users to iterate or select auto-generated descriptors.

Chapters

Google DeepMind's Veo 2 boasts higher resolution and longer video generation capabilities compared to OpenAI's Sora, although current implementations have limitations. Future plans include wider availability via Vertex AI and integration into the Google ecosystem.

Veo 2 generates longer videos (2+ minutes) at higher resolution (4K) than Sora.
Currently available in Google's VideoFX tool with limitations on resolution and duration.
Future release on Vertex AI and integration into Google products planned.

Shownotes Transcript

Translations:

中文

This is TechCrunch. This episode is brought to you by Factor.

Notice how the days are shorter but your to-do lists aren't? Here's a trick: Factor. From breakfast to dinner and anything in between, Factor has easy, nutritious options to keep you fueled and feeling your best. My box at Factor is on its way and it could not get here sooner. I'm so excited because you get to choose from six menu preferences to help you manage calories, maximize protein intake, or avoid meat, or simply eat a well-balanced diet.

Whether you like routine or you enjoy mixing things up, Factor has you covered with 35 different delicious meals every week and over 60 additional convenience options you can add to your box like keto cookies, pressed juices, and smoothies.

Don't let shorter days slow you down. Stay energized with America's number one ready-to-eat meal delivery service. Head to factormeals.com slash 50TCIndustry and use code 50TCIndustry to get 50% off your first box plus free shipping. That's code 50TCIndustry at factormeals.com slash 50TCIndustry to get 50% off your first box plus free shipping while your subscription is active.

Google DeepMind, Google's flagship AI research lab, wants to beat OpenAI at the video generation game, and it might just, at least for a little while. On Monday, DeepMind announced VO2, a next-gen video-generating AI and the successor to Vio, which powers a growing number of products across Google's portfolio. VO2 can create 2-minute-plus clips in resolution up to 4K, 4096x2160 pixels.

Notably, that's 4x the resolution and over 6x the duration OpenAI's Sora can achieve. It's a theoretical advantage for now, granted. In Google's experimental video creation tool, VideoFX, where VO2 is now exclusively available, videos are capped at 720p and 8 seconds in length. Sora can produce up to 1080p, 20-second long clips.

VideoFX is behind a waitlist, but Google says it's expanding the number of users who can access it this week. Eli Collins, VP of Product at DeepMind, also told TechCrunch that Google will make VO2 available via its Vertex AI developer platform as the model becomes ready for use at scale.

Over the coming months, we will continue to iterate based on feedback from users, Collins said, and we will look to integrate VO2's updated capabilities into compelling use cases across the Google ecosystem. We expect to share more updates next year.

Like VO, VO2 can generate videos given a text prompt, e.g., a car racing down a freeway, or text and a reference image. So what's new in VO2? Well, DeepMind says the model, which can generate clips in a range of styles, has an improved understanding of physics and camera controls, and produces clearer footage.

By clearer, DeepMind means textures and images and clips are sharper, especially in scenes with a lot of movement. As for the improved camera controls, they enable VO to position the virtual camera in the videos it generates more precisely and to move that camera to capture objects and people from different angles.

DeepMind also claims that VO2 can more realistically model motion, fluid dynamics and properties of light, such as shadows and reflections. That includes different lenses and cinematic effects, DeepMind says.

as well as nuanced human expression. DeepMind shared a few cherry-picked samples from VO2 with TechCrunch last week. For AI-generated videos, they looked pretty good. Exceptionally good, even. VO2 seems to have a strong grasp of refraction and tricky liquids, like maple syrup, and a knack for emulating Pixar-style animation. But despite DeepMind's insistence that the model is less likely to hallucinate elements like

extra fingers or unexpected objects, VO2 can't quite clear the uncanny valley. Note the lifeless eyes in the video of the cartoon dog-like creature embedded in the text version of this article. And the weirdly slippery road in the footage of a car driving embedded in the text version of this article, plus the pedestrians and the background blending into each other and the buildings with physically impossible facades.

Collins admitted that there's work to be done. Coherence and consistency are areas for growth, he said. Veo can consistently adhere to a prompt for a couple minutes, but it can't adhere to complex prompts over long horizons.

Similarly, character consistency can be a challenge. There's also room to improve in generating intricate details, fast and complex motions, and continuing to push the boundaries of realism. DeepMind is continuing to work with artists and producers to refine its video generation models and tooling, added Collins.

We started working with creatives like Donald Glover, The Weeknd, David, and others since the beginning of our VO development to really understand their creative process and how technology could help bring their vision to life, Colin said. Our work with creators on VO1 informed the development of VO2, and we look forward to working with trusted testers and creators to get feedback on this new model.

VO2 was trained on lots of videos provided with example after example of some form of data. The models pick up on patterns in the data that allow them to generate new data. DeepMind won't say exactly where it scraped the videos to train VO2, but YouTube is one possible source.

Google owns YouTube, and DeepMind previously told TechCrunch that Google models like Veo may be trained on some YouTube content. "Veo has been trained on high-quality video description pairings," Collins said. "Video description pairs are a video and associated description of what happens in that video."

While DeepMind, through Google, hosts tools to let webmasters block the lab's bots from extracting training data from their websites, DeepMind doesn't offer a mechanism to let creators remove works from its existing training sets. The lab and its parent company maintain that training models using public data is fair use, meaning that DeepMind believes it isn't obligated to ask permission from data owners.

Not all creatives agree, particularly in light of studies estimating that tens of thousands of film and TV jobs could be disrupted by AI in the coming years. Several AI companies, including the eponymous startup behind the popular AI art app Midjourney, are in the crosshairs of lawsuits accusing them of infringing on artists' rights by training on content without consent.

We're committed to working collaboratively with creators and our partners to achieve common goals, Collins said. We continue to work with the creative community and people across the wider industry, gathering insights and listening to feedback, including those who use VideoFX.

Thanks to the way today's generative models behave when trained, they carry certain risks, like regurgitation, which refers to when a model generates a mirror copy of training data. DeepMind's solution is prompt-level filters, including for violent, graphic, and explicit content.

Google's indemnity policy, which provides a defense for certain customers against allegations of copyright infringement stemming from the use of its products, won't apply to VO2 until it's generally available, Collins said. To mitigate the risk of deepfakes, DeepMind says it's using its proprietary watermarking technology, SynthID, to embed invisible markers into frames VO2 generates. However, like all watermarking tech, SynthID isn't foolproof.

In addition to VO2, Google Mind this morning announced upgrades to Imagine 3, its commercial image generation model. A new version of Imagine 3 is rolling out to users of ImageFX, Google's image generating tool, beginning Monday. It can create brighter, better composed images and photos in styles like photorealism, impressionism,

and anime per DeepMind. This upgrade to Imagine 3 also follows prompts more faithfully and renders richer details and textures, DeepMind wrote in a blog post provided to TechCrunch. Rolling out alongside the model are UI updates to ImageFX,

Now, when users type prompts, key terms in those prompts will become chiplets with a drop-down menu of suggested, related words. Users can use the chips to iterate what they've written or select from a row of auto-generated descriptors beneath the prompt.

Google DeepMind unveils a new video model to rival Sora 08:09 Share