Beyond Uncanny Valley: Breaking Down Sora

2024/2/24

a16z Podcast

AI Deep Dive AI Chapters Transcript

A

Anjney Midha

S

Stefano Ermon

旁

旁白

知名游戏《文明VII》的开场动画预告片旁白。

@Anjney Midha ：Sora模型的出现速度之快和视频质量之高令人震惊，远超预期。这标志着AI视频生成技术取得了重大突破，但仍处于早期阶段，未来有很大的提升空间。 Anjney Midha还探讨了Sora模型的训练成本以及未来AI生成视频的普及程度。他认为，虽然目前训练成本很高，主要由大型企业承担，但随着技术的进步，成本可能会降低，推理成本也会随着模型压缩技术的进步而降低。 Anjney Midha还关注了高质量视频数据获取和标注的挑战，并探讨了初创公司和大型公司在获取和使用视频数据方面可能采取的不同策略。他认为，可以使用人工参与的流程来标注视频数据，并结合人工标注和自动标注方法来提高效率。 Anjney Midha还探讨了更大的上下文窗口对视频模型灵活性的影响，以及各种技术（如基于注意力的方法、嵌入式方法、环形注意力、快速注意力和状态空间模型）如何用于扩展视频模型的上下文窗口大小。 @Stefano Ermon ：Sora模型的成功是扩散模型和Transformer架构结合的成果。扩散模型比GANs更稳定，更容易训练，并且可以在推理时利用更深层次的计算图，而无需在训练时付出高昂的代价。 Stefano Ermon详细解释了视频生成比文本或图像生成更复杂的原因，包括计算成本更高、高质量的公开可用视频数据集有限以及视频内容比图像更复杂等。 Stefano Ermon还探讨了Sora模型可能使用的技术，包括基于Transformer的架构、潜在编码来压缩数据并提高效率以及使用合成数据来提高训练数据的质量。 Stefano Ermon还探讨了模型能够生成长而连贯的视频的原因，他认为这可能是由于训练数据的高质量以及模型能够学习物理、对象持久性等概念。他指出，模型能够学习这些概念，可能是因为这些知识有助于模型更好地压缩和预测视频数据。 Stefano Ermon还对AI生成视频的未来发展趋势进行了展望，他认为，随着技术的进步，其他公司可能会开发出性能相近的模型，但OpenAI可能会保持领先地位。他还认为，更大的上下文窗口对于视频理解和生成非常有用，并且各种技术可以用于扩展视频模型的上下文窗口大小。最后，Stefano Ermon还谈到了AI视频生成技术在通往通用人工智能的道路上的意义，他认为，高质量的AI视频生成模型可以作为一种世界模拟器，并为构建能够与现实世界交互的智能体提供有价值的知识。

Deep Dive

Yeah, honestly, I was very, very surprised me. I know the two of us often talk about how quickly the fill is moving. I harder to keep track of all the things that are happening, and I was not expecting the model so good coming out so soon.

We've generally conversed on IT was gonna be a wan not?

And if I thought IT was maybe IT six months out, a year out. And so I was shocked when I saw those videos, the quality of the videos, the LG, the ability to generate sixty second videos, or was really amazed.

This is obviously the worst that this technology will ever be. All of that work, the earliest stages of progress here.

I always found that that is one of the secret weapons of the fusion models and why they are so active in practice.

If you were to ask many people at the beginning of twenty twenty four, when we get high finality, believable A I generated video, most would have said that we were years away. But on february fifteen, the open, I surprised the world with examples from the new model. So bringing those predictions down from years two weeks and of course, the emergence of this model and its impressive modeling of physics and videos of up to sixty seconds have spent much speculation around. I'm not only how this was accomplished, but also so soon.

And although open a eye has stated that the model uses a transformer based to fusion model, the results have been so good that some have even questioned whether explicit 3 modeling or a game manager was involved。 So naturally, we decided to bring in an expert sitting down with a six v general partner on the Better professor of computer science at stanford, star mot, whose group onest the earliest deputed models and their applications in general AI. Of course, these approaches lay the foundation of the very division models deployed in sora, not to mention other household names like chat, atp and the turn.

And perhaps most importantly, stefano has been working on generate to A I for more than a decade long before many of us had even an engine of what was to come. So throughout this conversation, stuff no brakes down. My video has astor's been much harder than its text image counterparts, how a model likes a might work, and what all of this kidney for the road ahead.

And of course, if you wants to stay updated on all things A, I make sure to go check out a ic com flash. aah. enjoying.

As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax or investment advice or be used to evaluate any industry or security, and is not directed at any industry or potential industry in any asic sensei fund. Please note that asic sense e and nympha ates may also maintain investments in the companies is discussed in this podcast. For more details, including a links to our investments, please see a six inc. Outcome slash disclosure.

This is like a conversation i've wanted to have with you for a while. We've been talking about flavorous of this conversation for a long time, but I think given how quickly things are heading up in the space, IT sounded like a good time for us to check in and see how many of the assumptions we've been talking about, about the future of the fusion models and via models are tracking your the world's expert in this area of research. And so I think you just be great to start with her lab sort of involvement in the origins.

I'm excited to be here. Hi everyone. My name is defined and i'm a professor, computer science. That's time for the I working in eye.

And i've been actually working in generate vi for more than ten years as way before these things were cool. I teach a class at stanford on deep generated models that something I started back in, again, twenty eighteen. And I think IT was the first in the world on this topic.

And yeah, encourage you to check out. The website is called yes to thirty six. There is a lots of materials, if you want to dig deeper into how this methods work. And i've been doing research and generated models for a long time. As you mention with my photos, student, uh, Young song was not the open a eye.

We did some of the early work on the fusion models, score based models, as we used to call them, back then, back in two and nineteen, the time generated models of images of video audio like this kind of continuous data model ties, but really dominated by gas generated versy on networks. And we were really the first to show that I was actually possible to beat gans of their own game using this new class of generating modest code future models, where we essentially generate content to generate images by starting from pure noise and progressively annoyed IT to using another network until we take IT into a beautiful sample. We developed a lot of the theory behind this models, how to train them, how to you score, matching a lot of the initial architectures.

And some of those choices are still around today. And I think that work really kick started. A lot of the exciting things are seeing today around the future models, stable, defuse and sort out, of course.

And in addition to that early work on the foundation of the future models with worked on a number of other aspects of this. Your models like D, D, I M pretty widely used. The efficient, safe procedure that allows you to generate images very quickly without too much losing quality. With my student chilling man, who is the co of pica labs we developed as the edit, is one of the first methods to do controllable generation to generate images, business sketches and things like that. And so I am excited to be here today and discuss what's coming next.

So given all of that work, all of the experience you ve had in really the ground truth of diffusion models and their limitations, what was your reaction to see a model like sora come out last week?

Yeah, honestly, I was a little surprised me. I know the two of us often talk about how quickly the field is moving. I harder to keep track of all the things that are happening.

And I was not expecting a model so good coming out so soon. I mean, I don't think there is anything fundamentally impossible. And I know IT what's coming.

IT was just a matter of time and just more research, more investments, more of people working on these things. But I was not expecting something dad good happening so soon. I thought I was maybe six months out, a year out. And so I was shocked when I saw those videos, the quality of the videos, the length and the ability to generate sixty second videos or is really amazed.

yeah. I do think every time we've talked in the past, we've generally converged on IT was gonna a when not and if that we'd see video generation gets so good.

And so it's reassuring to hear you were as surprised as I was when when IT came out before we get into the details on what maybe some of the breakthrough were there on this time frame, maybe we can just spend a few minutes talking about video division, sort of one to one for folks who may not be as familiar with these models. Why has video division been so much more complex than decks or image generation? And what historical then the main blocker and making IT work? Yeah, that's a great question.

At a very high level, you can think of a video because just a collection of images. And so really, the first chAllenge you have to deal with is that you're generating multiple images at the same time. And so the compute cost that you need to process and images is at least ten times largest.

And what you would pay if you just want to process one of them at a time. And basically, this means a lot more compute, a lot more memory and just much more expensive to train on your large scale model on video data. The other chAllenge is just the data chAllenge.

I think a lot of the success we've seen in diffusion al models for images was partially due to the availability of publicly available data. Sounds like lion, like a large scale image and option data sets straight from the internet, and they were made available and people could use them to train large scheme models. I think we don't quite have that for video and there is a lot of video data, but the quality kind, I can mix bag and we don't have a good way to do.

You can feel that to screen them or there's not a go to data set that is available and everybody is using to trends models. So i'm guessing some of the innovations that went into the sora model actually on just selecting good quality data to train the models on captions are also hard to get for video. I mean, the video data out there, but getting good label's good descriptions of what's happening the videos is, is chAllenging.

And you need dogs if you want to have a good control over the kind of content that to generate with these models. And then there is also chAllenge of video content that is just more complex, that is more going on. If you think about a sequence of images are supposed to just one that are complex relationships between the frames, that is physics, that is object permanence and principle.

I think a high capacity model with enough computer, enough data and potentially learn these things, but was always an computer al question, how much data you gonna need? How much computer are you gonna? When is dog gonna happen? Is the mother really gna discover all this high level of concepts and statistics of the data essentially? And IT was surprising to see that it's doing so well.

You just later are very clearly what the gentle obstacles have been, both on architecture, on data set, on representations of the world to be a video as a format. Since the release came out last week, there have been a lot of speculation around how this model can achieve such impressive results.

Some folks are even speculating that there might be a game engine or treating model, so of explicit, treated modeling or dramatic involved in the influence pipeline, but in the article opinion, that describes the approach by saying that they trained dex conditional defusing model jointly on videos, s and images with different durations and resolutions, and then applied to transformer architecture on spacetime patches of video and image latent codes. And so could you just break down printing lemon terms? So for folks who might not be as familiar with scaling laws and what's going on here.

sure. Yeah, I can try and know that is certain to some secret sauce here. And I can try to read behind the lines of what they said. The release, the idea, training and videos and images is none new, seems like one technical difference that they are hinting, that is, the use about transform and architecture for the backbone, for the denoisel, for the score model.

People often use the convolution architectures back from the days where, like Young initially started about using units as a score model, which he was actually a key innovation that that really enabled a lot of the success and images. People kind of still ported those kind of architectures over to video date as well because they make sense. We expect that there is a lot of solution structure, and convolution architecture might be a good idea.

IT seems like they moved on to a purely transform our bathed architecture and probably following the worst by signing share. And then why you who did some of the initial work in this space on developing good transformer architectures by VIP bed architectures for the future models, it's possible that, that gives you Better scaling with respect to compute on data and just happens to work Better. But they are also reassuring to late and codes.

So IT seems like it's sunlighted that they're working directly because so space working on late representations was one of the key things, the key novation behind stable division. Later, the fusion by the idea first compressing the data in to a late and representation that is a little bit small, that a little bit more compacted, is expected. That is a lot of a datanet y in.

Think about a different frames in the video. And so IT might be possible to compress IT almost lost lesser, into a lower dimensional representation. And you can do that, then you can come to train on this lower of dimension representation and you can get much Better trade off in terms flying to compute that, to pay and the kind of memory that you need to process the data.

So they might have figured out a Better way to encode video to assemble ally meaningful, kind of like lord dimensional latent n space. I would say this doesn't rule out to the possibility that we've used game engines for three demode ls to generate training data. I mean, we discuss before, I think the quality of the training data really crucial. And it's possible that you've used the synthetic data generated by game engines or flight three d models to and I generate the kind of data I want to see where there there is a lot of emotion that they can probably use the internal of the engine to get very good options about what's happening in the video so they can get a very good match between the best and the they are trying to generate. So any that seem like it's a purely data driven approach, but it's possible that they use the other pipelines to generate .

synthetic data. And when you contrast the fusion transformer approach that they took with sort of the prior generations of many neighbors of generated models, but that was recurrent nets gans, just vanilla or regressive transformers, why is IT that a diffusion model here seemed again, to be sort of the uniquely suit best school for the job.

I think, priority to disuse models. Yet people who are using gangs to university on networks. Pretty good. They are pretty flexible by one chAllenge is that they're very unstable to train. And so that was actually one of the main reasons we developed the future models in the first place.

We wanted to the flexibility of basic are not to be trying to noodle net work because the backbone, but a more principled statistical way of training to models that leads you to up stable training laws where you can just keep training in the model just keeps getting Better and Better. Although aggressive models also have the property that are trying to compress the data. And with enough capacity, enough computer, in principle, they can be pretty good job of anything, including video.

They just tend to be very, very low because you have to generate one token at a time. And you think of video, there is a lot of tokens, and you've not been the model of choice for that reason. If you were models, on other hand, they can generate basically taking essential in parallel. And so what I can be much Foster, and that's one of the reasons they are preferred, I think, for these kind of modalities. Another more philosophical reason is that if you think about our refuge model, in some sense, at the inference time, you have access to a very deep computation graph where essentially you can apply and what on that there would be a thousand times. Or if you take a continue time perspective, like a definitively perspective, you can even be an infinitely deep kind of computation graph that you can use to generate content, while at the same time you don't have to enroll the entire computation graph of training time because the models are trained by score matching and that way of going, like trying to make the more Better and Better without ever having to pay a huge Price at training time. And so I always found that that is one of the secret weapons of the fusion models, and why they are so effective in practice, because they allow you to use a lot of resources at the inference time without having to paid up Price .

during training. And do our earlier point about why did the Operate through happens so much faster than many expected, sounds like the stability of the the fusion transformer model and being able to swap out essentially training time for inference time, which is much cheaper, much more realizable, much more efficient, was a big contributed to compressing the training times here.

Yeah and that is a question of what's the backbone, right? And that could be convolution on network, that can be a little space model, that could be a transformer.

And there, I think we're still just scratching the surface in terms of what works does and and all possible combinations are possible, right? And build auto aggressive models that our convolution, you can build a photogen sive models that are based on transformer, aggressive models that are based on states base architectures. And similarly, you can build the huge models that are based on convolution architecture.

People tend to do. And now that seems like maybe open the ice pushing doors. Now that's just just a transformer as the back right in a the future model.

And i'm starting to see people experience states space model, which might love for very long context, for example. So I think there is an exciting space of different kinds of combinations that we can try and might give us. Better scaling, Better properties are really getting to the kind of qualities we want to see for this models.

One of the most elegant parts of the transformer backend architecture is that IT works really well with the idea of organization, right? And in language models. So much of the scaling laws work that allowed models like GPT three and four to be developed so quickly and generalized to all kinds of tasks was that the process of dogana zing language allows almost like a transformation of translation into a format that the models can understand across many, many different types of languages, whether that's good fashioned english or its code, or its health records, or its different, in some cases, multi link world dataset.

And so the beauty of doorance ization is it's like this one size fits all process of turning language data into a format that the transformer back one really understands well and is able to learn on. IT seems like there was a similar key on lock around how visual data was broken down here into small batches, right? The essentially totonac ed image and video data into this intermediate representation of a patch. This approach create meaningful ly Better output than other models we've seen in the past.

Yes, it's a great question. Honestly, I don't know the answer in cognizance makes a lot of sense for discredit a like i'm less of a fan of organization or pacifying images and video and audio.

I actually like you have to do IT if you want to use a transformer architecture, but makes less sense to me, uh, just because the data continues and the patches of the arbitrary and you're losing some of the structure there by by going through a organization, you kind of have to do IT if you want to use transformers. And transformers are great because they scare well, and we have very good implementations, and they are very hard with friendly. And so when you ask the way to go with the bitter lesson once again, but IT feels like what they have is some kind of late, again, representation.

And maybe once you go to a late in space, then organizing IT might make more sense because you've already lost a lot of the structure. And so IT might be a combination of boat. I mean, IT seems like they might have access to a very good late in space where they get rid of a lot of the kind of a down and see that exists in in natural data.

Special videos to friends next to each other are very similar, right? There is a lot of the dance see videos. And if they got rid of some of that through clever coding scheme, then they applied organization. I think that starts to make more sense and make things more scalable, less compute, less memory, just Better.

We've seen so many texts, video models, but very few were able to actually generate longer form videos, right? More than a few seconds. And they were often issues with temporal coherence and consistency would, even across those short form generation, three to five seconds. And here we've got a model that table to do one minute long. So of sixty second generations, and arguably some of the long form generations are actually dramatically Better than the short form generations.

From sorrow to you start to see these emerging properties of temporal coherence only in the sixty second clubs, right? What's going on there? What what are they are doing differently that enables these videos have such amazing continuity for long length and temporal coherence, a of the subjects across those links.

I think dw is the most surprising element aspect of sora like this, simply to generate long can't that is socket here and and consistent and beautiful. I think that was the part of really amazed me because I know it's very hard to do because exactly like you, you have to keep track of a lot of things to make things consistent. And the model doesn't know what's important to keep track of and what's not.

And somehow the model of train seems to be able to do IT and is not entirely surprising with at the end of the data, models are trained to essentially compress the training data right then. So if you have high quality training data that are questions and of course, the granted this consistent with physics and is consistent, h has the right properties that we would expect a real natural, good quality video to have that in order to compress the data as effectively as possible. And the mode should learn about physics, should learn about object, permanent, short.

Learn about the E D geometry is all of that. What surprising is that there is many other kind of like follower creations that the model could cover. And what was surprising to me is that IT IT seems like it's it's really able to learn some of that.

We don't know why. I think it's one of the mysteries of of deep learning. It's probably a combination of training data, that right architectures s and off scale. But I was amazing.

And the three properties in their videos emerge without any explicit inducted biases for 3d objects， right? They are purely phenomenon scale. What does that mean? Physics, an emerging property?

Well, it's not inconceivable that at the end of the day, physics is a framework that can to help you understand the world, that helps you make Better predictions. Like, if I understand newton's law, I can make predictions about what's gonna happen if I drop an object. And a very simple formula that allows me to make a lot of different predictions.

So if i'm being asked to compress a lot of videos, if I know a newton's life, I knew some physics, I can probably do a Better job at predicting what the next frame is gonna look like, right? And at the end of the day, these models of those are trained by score matching, is possible to relating a very formal sense. The training objected that we are using to training the future model to a compression based object, literally just trying to compress video that in this case, as much as you can.

And so it's quite possible that knowing something about physics, knowing something about camera reviews and three structures of objects and object permanent. But these kind of properties are helpful in compressing data because they reveal structure that is helpful to make predictions, which means you can compress the data Better. What's exciting is that IT emerges just by training the model, right? That you could imagine other kings of slower coral relations that exist in the training data.

But there are not as useful or as predictive as a newton's law or like a real physics understanding of the scenes and what's going on. And and it's hard to tell later what exactly is going on. Like it's possible that there is no understanding of physics, but it's certainly very effective right at the end of the day, maybe that's that's enough.

What you seem to be pointing out there is if the data that these models are trained on is kind of like their diet and you are what you eat, in a sense, if you're eating a ton of physics, then you are Better physics as a model done. How we should begin to sort, explain other emerging properties.

That was a clip that they shared called bling zoo, which was a single prom generated video that had multiple transitions, big ten without any editing, and so on. I was multiple camera shots. I almost seemed like somebody had manually stitched together different kind strangles, right? How should we explain that? Is that just essentially admitting what had seen the training Better? There's something else going on in decades.

I imagine that, yes, if you have trained IT on high quality video data where you can see these transitions across different kinds of shots again, but these models do as they try to understand what all the. Training videos s having common, what is the high level structure of these videos and and he tried to replicate that. I think so sufficiently good model might understand that.

The video, the training said, they tend to have this structure where we transition across different views and shots, and then we combine them and they combine an interesting ways and any table to placate. Again, what's magic here is that it's provable, an impossible task in general, right? There is so many other ways of interpolating between the things you see, the training set and, and most of them are wrong.

Right are generalizations that you don't want to see. And somehow this, they are. Networks are able to find h interpolations, or generalizations that are the ones we want, the ones that makes sense.

And they discovered a kind of structure that we want the model to reapply age to post the ones that we just there by chance. And it's not the kind of structure that we want them to pick up. And that's what amazing and mostly unexplained the this point. We do not understand why this happens.

right? So we're here now in early twenty twenty four. And the answer to when will video models get good enough to cross the uncanned valley has just been breached, right? We've just got at that point. So if we look ahead now, sorry, I still in beta, but there are several other A I generate sort of video efforts out there, right? Realistically, how expensive do you think it's going to be the generate AI video at any kind of sort of consumer scale or readily available scale?

I'm sure this release from open an eye like set off a lot of comparison to the races and trying to catch up. And i'm sure we'll see developments coming from all the vats competitors in this space. I think that is training costs, which are huge.

I'm sure they use thousands of gaps to train sora. And the scale was a big part of the successive scene. And so it's definitely gonna be out of reach for academics.

But then is gonna be industrial players who have the resources to try to compete with them and try to reapplication what they did or achieve similar results in a different way. The good news is that now we have a an elective fact. We have a system that can do IT. And so I think it's a lot easier to try to catch up, not something that there there's a lot of less uncertainty whether you would be impossible right now. We have an example.

It's feasible and a lot of people will really make the right investments to really and I know how long is going to take is going to be six months, going to be two, five months, but somebody will come up with similar performance, I would imagine, as we've seen in other lads and other spaces where people can catch up right eventually. The question is how far ahead will open in I be by then, how much Better will the system be in six months or in in full months? And that's hard to say.

The other question that they think you are hinting at this inference, like how expensive is, is going to be to serve these models and provide video generation on demand to users or personalized videos like all this really cool applications that could emerge from a really good video generation model. Again, i'm there and pretty optimistic, especially because down underlying architecture is a the fuser model. Once you have potentially big, expensive, large, clunky model that can generate high quality results.

There has been a lot of success in the stealing these models down into smaller ones that are almost as capable but way faster. So i'm pretty optimistic that once we get to high enough quality, it's gonna be possible to get systems that can serve similar quality results. Seen a very expensive way. So pretty excited to see the kind of crazy use cases as people come up with once this technology becomes available.

When we talk about computer training and inference is a huge part of the calculus here, but there's the other bucket of costs that come from the data sets required to actually train these models and get scaling laws to work. And in terms of training data for language models, there's already billions of data points that these laws can get across the web.

But for video, a lot of the data, like you are saying earlier, even if IT exists, it's not particularly well label or caption. And so how do you think video model teams will overcome that chAllenge? We've seen recently that redit agreed to do a deal to license the data to google for sixty million dollars. You think will see video production studios begin to license their content .

as a good question. And first of all, I would probably depend a little bit on whether you are thinking of startups versus modest published industry players. I think startups might be willing to move fast and break things, maybe be less concerned about copyright and and just create from the internet and then train models and then worry about licensing the data later.

Legal players, their legal teams are very, very worried about big lawsuits. And so they'll want to have something that is properly licensed. It's gonna be interesting to see whatever, as you said, the studios are, people who currently own the content will be willing to license IT because this could be existent al threat for their entire business model as a whether I can see redit licensing, there are stuff because it's may be not as an existence al thread.

The other thing you mentioned is labelling. It's a great point you to be a big chAllenge that i'm pretty optimistic though that people will be able to set up human in the loop pipelines. And we've seen great success in visionary language models, even video language models. They are not good enough to maybe provide high quality option out of the box, but I could imagine that they could drive ticals speed up like a few manning the loop pipeline where they provide suggestions that they can get fixed or improved by human label arand. So pretty optimistic on the captioning front that will be able to find pretty scale able solutions to that is .

the the logical bath there to start with a human in the loop implementation, build a working model of annotations there and then ultimately move towards synthetic captioning.

IT seems like that other action that people are exploding and using, and there's been a lot of success and like ananzi options in a synthetic ways using alarms. And so yeah, that would be my death. I mean, honestly, I don't know to what extent the bottom that is on the captioning as the first to actually having a good high quality videos, even just the raw high quality video data having access to that is non trivial. And my understanding actually one of the next you will have to .

solve first h gotch. Why don't we switch gears a little bit to the usage of these models? right? One of the things that consumers and creators have really found useful with language models, generated models, is context windows, right? The bigger the context window, the more flexibility there is for the input.

You can give you more details, context. And it's been an exponential progress on the language side in a very short amount time. We've gone from diny context windows to the a millions of dolen worth context windows in video. Do you expect a similar approach? Is there some fundamental limitation?

I always just reading to the gamine at one point five paper and then talking about this very long context. Million tokens, ten million. And actually, one of the applications they mentioned is actually video memorization, video understanding, like trying to process one.

Our long video IT seems like this very long context are gonna very useful to solve a variety of video processing and video understanding kind of kind of tasks. And therefore, I would be very surprised if they don't also end up being very useful for v degeneration. And in fact, it's entirely possible that that's already uh, component that played a role, the open eye system. And then part of the reason they are able to generate long videos is because they can handle their long context and they are able to scale transformers to very lum sequencing. And it's entirely possible that that was part of the secret says so and again, excited about either these attention based ways to scale up context, who embedding or ring attention, or clever and hard to our focus, implementation and like flash attention that can be scaled activated on contexts or more researching things like state space models or that which people are starting to you also in the context of the future models that might allow you to get to Better on context. And so think is going to be an interesting space .

to watch for short. Well, looking further out the timelines and breakthrough, this is obviously the worst that this technology will ever be almost definitely ally, right? We're at the earliest stages of progress here. Why did people underestimate the time mine so far? And what does that mean about what's to come next and how quickly is going to come?

Yeah, this is really hard question. And I think I was wrong with my predictions on a long I was going to take to get to the point we are right now with sorrow. And when we're moving exponentially, fast is very hard to to make the predictions and errors can be pretty big, but is going to be exciting .

for sure when you look at to the video generation as a breakthrough in the broader journey of dealing laws towards generalizable artificial intelligence, or whether you anna call IT A S I or A G I, how do you quantify this advancement in that broader journey?

I'm very excited about this because I tend to think of this as like a beter general model, as a pretty natural like world model. I mean, as we were discussing before, order to be able to generate qualities that must be some level of understanding of physics, of object permanent is three structures, like there must be a lot of knowledge that is somehow late in in these models.

And i'm excited about all the different ways in which we might be able to extract this knowledge and use IT and different allegations in. And we think about, specially like nobody, other agents that really interact to the real world. I think the kind of the knowledge that is embedded red in this video that has been extracted by watching essentially a lot of videos is gona be very, very useful.

And what's either thing? I think it's gonna pretty complimentary to the kind of knowledge that, for example, is baked into an allam. You can learn a lot about the world by reading books, essentially, but that is just going to be a lot of experience that you only get by.

It's more similar to the kind of experiences you get as a child. I just walk around the world, you see things and you learn about how the world works. Just threw your eyes and through video essential IT, right? So that kind of experience that we are feeding through, uh, video to future model, when the kind of knowledge that we might be able to extract from that is therefore going to be, I think, pretty useful.

I think the title of the blog post that they put out was video generation models as world simulators. I think there .

is a lot of promise there. I think if you can do IT at the pixel level, then that means instead you've solved the harder version of the problem, right? You can do a lot of things.

We're autonomous v color. You're building road votes, whether you're just when I feel done, agent, that understands how the world works and combines the knowledge gotten by crawling the internet with what you can see in the real world. I think it's all gonna pretty exciting.

Well, arguably, we would not be here as an industry if IT wasn't for your lab. So it's is so exciting to see how you are. World view of whether world was always gonna has accelerated.

I know we could spend hours talking about all the tricks that were required from research perspective to get here and where we're going to go, but we're gonna ramp up for today. Thank you so much for making the time. I'm sure what you have .

more to talk about. thank.

If you like this episode, if you meet IT this far, help us show the show, share with a friend, or if you're feeling really ambitious, you can leave us a review at great, this pocket dark com slash S S C. You candidly producing a podcast and sometimes feel like you're just talking avoid. And so if you did like this episode, if you like any of our episodes, please let us know. I'll see you next time.

Beyond Uncanny Valley: Breaking Down Sora 34:31 Share

a16z Podcast

Deep Dive

Shownotes Transcript

Beyond Uncanny Valley: Breaking Down Sora