Veo 2 is Google DeepMind's next-generation video-generating AI, capable of creating 2-minute-plus clips in resolutions up to 4K (4096x2160 pixels). This is 4x the resolution and over 6x the duration of OpenAI's Sora, which can produce up to 1080p, 20-second clips. However, in Google's experimental tool VideoFX, Veo 2 videos are currently capped at 720p and 8 seconds.
Veo 2 features an improved understanding of physics and camera controls, producing clearer footage with sharper textures, especially in scenes with movement. It can more realistically model motion, fluid dynamics, and properties of light like shadows and reflections. Additionally, it offers enhanced camera positioning and movement for capturing objects and people from different angles.
Veo 2 struggles with coherence and consistency over long durations, particularly with complex prompts. Character consistency, intricate details, and fast, complex motions remain challenging. The model also exhibits issues like lifeless eyes in animations, physically impossible facades, and blending of pedestrians and backgrounds.
DeepMind uses prompt-level filters to mitigate risks like regurgitation of training data and employs its proprietary watermarking technology, SynthID, to embed invisible markers in Veo 2-generated frames. However, the lab does not offer a mechanism for creators to remove their works from existing training sets, maintaining that training on public data is fair use.
DeepMind collaborates with creators like Donald Glover and The Weeknd to understand their creative processes and refine its video generation models. Feedback from these collaborations informed the development of Veo 2, and DeepMind continues to work with trusted testers and creators to improve the model.
Google DeepMind announced upgrades to Imagine 3, its commercial image generation model. The new version creates brighter, better-composed images in styles like photorealism, impressionism, and anime. It also follows prompts more faithfully and renders richer details and textures. UI updates to ImageFX include chiplets for key terms in prompts, allowing users to iterate or select auto-generated descriptors.
This is TechCrunch. This episode is brought to you by Factor.
Notice how the days are shorter but your to-do lists aren't? Here's a trick: Factor. From breakfast to dinner and anything in between, Factor has easy, nutritious options to keep you fueled and feeling your best. My box at Factor is on its way and it could not get here sooner. I'm so excited because you get to choose from six menu preferences to help you manage calories, maximize protein intake, or avoid meat, or simply eat a well-balanced diet.
Whether you like routine or you enjoy mixing things up, Factor has you covered with 35 different delicious meals every week and over 60 additional convenience options you can add to your box like keto cookies, pressed juices, and smoothies.
Don't let shorter days slow you down. Stay energized with America's number one ready-to-eat meal delivery service. Head to factormeals.com slash 50TCIndustry and use code 50TCIndustry to get 50% off your first box plus free shipping. That's code 50TCIndustry at factormeals.com slash 50TCIndustry to get 50% off your first box plus free shipping while your subscription is active.
Google DeepMind, Google's flagship AI research lab, wants to beat OpenAI at the video generation game, and it might just, at least for a little while. On Monday, DeepMind announced VO2, a next-gen video-generating AI and the successor to Vio, which powers a growing number of products across Google's portfolio. VO2 can create 2-minute-plus clips in resolution up to 4K, 4096x2160 pixels.
Notably, that's 4x the resolution and over 6x the duration OpenAI's Sora can achieve. It's a theoretical advantage for now, granted. In Google's experimental video creation tool, VideoFX, where VO2 is now exclusively available, videos are capped at 720p and 8 seconds in length. Sora can produce up to 1080p, 20-second long clips.
VideoFX is behind a waitlist, but Google says it's expanding the number of users who can access it this week. Eli Collins, VP of Product at DeepMind, also told TechCrunch that Google will make VO2 available via its Vertex AI developer platform as the model becomes ready for use at scale.
Over the coming months, we will continue to iterate based on feedback from users, Collins said, and we will look to integrate VO2's updated capabilities into compelling use cases across the Google ecosystem. We expect to share more updates next year.
Like VO, VO2 can generate videos given a text prompt, e.g., a car racing down a freeway, or text and a reference image. So what's new in VO2? Well, DeepMind says the model, which can generate clips in a range of styles, has an improved understanding of physics and camera controls, and produces clearer footage.
By clearer, DeepMind means textures and images and clips are sharper, especially in scenes with a lot of movement. As for the improved camera controls, they enable VO to position the virtual camera in the videos it generates more precisely and to move that camera to capture objects and people from different angles.
DeepMind also claims that VO2 can more realistically model motion, fluid dynamics and properties of light, such as shadows and reflections. That includes different lenses and cinematic effects, DeepMind says.
as well as nuanced human expression. DeepMind shared a few cherry-picked samples from VO2 with TechCrunch last week. For AI-generated videos, they looked pretty good. Exceptionally good, even. VO2 seems to have a strong grasp of refraction and tricky liquids, like maple syrup, and a knack for emulating Pixar-style animation. But despite DeepMind's insistence that the model is less likely to hallucinate elements like
extra fingers or unexpected objects, VO2 can't quite clear the uncanny valley. Note the lifeless eyes in the video of the cartoon dog-like creature embedded in the text version of this article. And the weirdly slippery road in the footage of a car driving embedded in the text version of this article, plus the pedestrians and the background blending into each other and the buildings with physically impossible facades.
Collins admitted that there's work to be done. Coherence and consistency are areas for growth, he said. Veo can consistently adhere to a prompt for a couple minutes, but it can't adhere to complex prompts over long horizons.
Similarly, character consistency can be a challenge. There's also room to improve in generating intricate details, fast and complex motions, and continuing to push the boundaries of realism. DeepMind is continuing to work with artists and producers to refine its video generation models and tooling, added Collins.
We started working with creatives like Donald Glover, The Weeknd, David, and others since the beginning of our VO development to really understand their creative process and how technology could help bring their vision to life, Colin said. Our work with creators on VO1 informed the development of VO2, and we look forward to working with trusted testers and creators to get feedback on this new model.
VO2 was trained on lots of videos provided with example after example of some form of data. The models pick up on patterns in the data that allow them to generate new data. DeepMind won't say exactly where it scraped the videos to train VO2, but YouTube is one possible source.
Google owns YouTube, and DeepMind previously told TechCrunch that Google models like Veo may be trained on some YouTube content. "Veo has been trained on high-quality video description pairings," Collins said. "Video description pairs are a video and associated description of what happens in that video."
While DeepMind, through Google, hosts tools to let webmasters block the lab's bots from extracting training data from their websites, DeepMind doesn't offer a mechanism to let creators remove works from its existing training sets. The lab and its parent company maintain that training models using public data is fair use, meaning that DeepMind believes it isn't obligated to ask permission from data owners.
Not all creatives agree, particularly in light of studies estimating that tens of thousands of film and TV jobs could be disrupted by AI in the coming years. Several AI companies, including the eponymous startup behind the popular AI art app Midjourney, are in the crosshairs of lawsuits accusing them of infringing on artists' rights by training on content without consent.
We're committed to working collaboratively with creators and our partners to achieve common goals, Collins said. We continue to work with the creative community and people across the wider industry, gathering insights and listening to feedback, including those who use VideoFX.
Thanks to the way today's generative models behave when trained, they carry certain risks, like regurgitation, which refers to when a model generates a mirror copy of training data. DeepMind's solution is prompt-level filters, including for violent, graphic, and explicit content.
Google's indemnity policy, which provides a defense for certain customers against allegations of copyright infringement stemming from the use of its products, won't apply to VO2 until it's generally available, Collins said. To mitigate the risk of deepfakes, DeepMind says it's using its proprietary watermarking technology, SynthID, to embed invisible markers into frames VO2 generates. However, like all watermarking tech, SynthID isn't foolproof.
In addition to VO2, Google Mind this morning announced upgrades to Imagine 3, its commercial image generation model. A new version of Imagine 3 is rolling out to users of ImageFX, Google's image generating tool, beginning Monday. It can create brighter, better composed images and photos in styles like photorealism, impressionism,
and anime per DeepMind. This upgrade to Imagine 3 also follows prompts more faithfully and renders richer details and textures, DeepMind wrote in a blog post provided to TechCrunch. Rolling out alongside the model are UI updates to ImageFX,
Now, when users type prompts, key terms in those prompts will become chiplets with a drop-down menu of suggested, related words. Users can use the chips to iterate what they've written or select from a row of auto-generated descriptors beneath the prompt.