#416 – Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI

2024/3/7

Lex Fridman Podcast

AI Deep Dive AI Chapters Transcript

People

Lex Fridman

一位通过播客和研究工作在科技和科学领域广受认可的美国播客主持人和研究科学家。

Yann LeCun

一位在机器学习和计算机视觉领域具有重大影响力的法国-美国计算机科学家，现任Meta首席AI科学家和纽约大学教授。

Topics

Lex Fridman 探讨了大型语言模型的局限性，并就其未来发展方向与 Yann LeCun 展开深入探讨。LeCun 指出，当前流行的基于自动回归的大型语言模型（LLM）虽然在处理语言任务上表现出色，但在理解物理世界、拥有持久记忆、进行推理和规划等方面存在显著不足。他认为，LLM 缺乏对现实世界的理解，其训练数据量虽然庞大，但与人类通过感官输入获得的信息量相比仍然微不足道。人类大部分知识来源于与现实世界的观察和互动，而非语言。因此，LLM 无法构建能够驾驶汽车或清洗洗碗机的世界模型，也无法进行复杂的规划和推理。LeCun 进一步指出，人类的思考和语言表达是分离的，而 LLM 则缺乏这种分离能力，其生成文本的方式类似于下意识反应，缺乏对答案的规划能力。他认为，构建世界模型需要观察世界并理解其演变规律，而不仅仅是预测词语。世界模型能够预测行动导致的世界状态变化。他批评了通过重建图像来学习图像表示的方法，并提出了联合嵌入预测架构 (JEPA) 作为替代方案。JEPA 通过预测完整输入的表示来学习图像表示，而不是重建图像本身。LeCun 认为，JEPA 是通往高级机器智能的一步，但它与生成式架构（如 LLM）有根本区别。JEPA 不试图预测所有像素，而是预测输入的抽象表示，这比生成式模型更有效率。他认为，可以结合视觉数据和语言数据的自监督训练，但需要谨慎避免作弊。在将视觉和语言结合之前，需要先让系统学习世界如何运作。 LeCun 还深入探讨了如何构建能够进行推理和规划的 AI 系统。他认为，未来的对话系统将通过优化抽象表示来规划答案，而不是直接生成文本。在连续空间中进行优化比在离散空间中进行优化更高效。他详细解释了能量模型的训练方法，包括对比方法和非对比方法。他指出，能量模型可以处理语言和视觉数据，但关键在于内部的抽象表示。他建议放弃生成模型，采用联合嵌入架构、能量模型和正则化方法，并最小化强化学习的使用。强化学习应该用于调整世界模型或评价函数，而不是学习特定任务。大型语言模型的成功改进主要归功于人工反馈，而非强化学习本身。LeCun 还谈到了人工智能系统中的偏差问题，他认为，开源是解决人工智能系统偏差问题的关键，人工智能系统不应该由少数公司控制，而应该多样化。开源平台能够促进人工智能系统多样化。只有开源平台才能确保人工智能产业和人工智能系统不受单一偏见的影响。大型语言模型的意识形态倾向并非源于开发人员的政治立场，而是为了迎合客户群体的接受度。开源是解决大型语言模型意识形态倾向问题的唯一解决方案。开源平台能够促进人工智能系统在文化、价值观和政治观点等方面的多样性。LeCun 还讨论了人工智能的潜在风险，他认为，人工智能的出现不会是一个突发事件，而是一个渐进的过程。他反驳了人工智能末日论者的观点，认为人工智能系统不会自动想要控制或消灭人类。他认为，开放的、多样化的 AI 系统能够更好地应对潜在的风险。

Deep Dive

Chapters

Large language models (LLMs) like GPT-4 and Llama have limitations in core aspects of intelligence, such as understanding the physical world, memory, reasoning, and planning. They are trained on vast amounts of text data, but this pales in comparison to the sensory information a human child receives, suggesting that interaction with the real world is crucial for developing a comprehensive understanding.

LLMs lack core intelligent behaviors like understanding the physical world, memory, reasoning, and planning.
LLMs are trained on massive text data, but humans receive far more data through sensory input.
Grounding in reality is crucial for intelligence.

Shownotes Transcript

Translations:

中文

The following is conversation with yang lucon, his third time on the podcast. He is the chief A I scientists that matter professor at n yu touring award winner and one of the seminal figures in the history of artificial intelligence. He and matter A I have been big proponents of open sourcing A I development and have been walk in the walk by open sourcing many of their biggest models, including lama two and eventually lama three.

Also, yan has been an outspoken critic of those people in the A I community who warned about the looming danger and existential threat of A G I. He believes the A G I will be created one day, but IT will be good. IT will not escape human control, nor will IT dominate and kill all humans.

At this moment of rapid I development, this happens to be somewhat a controversial position as so it's been fun seeing Young get into a lot of intense and facing discussions online as we do in this very conversation. And now a quick few second mention. We sponsor check them out in the description.

It's the best way to support this podcast. We got hidden layer for securing your A I models element for electrodes, shop of five for shopping for stuff online and A G one for delicious health. Choose wise.

And my friends also, if you want to get in touch with me, for whatever reason, maybe to a work with our amazing team, go to extreme that conflict contact. And now onto the full aries. Never any.

As in the middle, I try to make this interesting. I don't know what i'm talking like this, but I am. There is the cat to nature, to IT.

Speaking of the koo and playing a bit of piano, anyway, if you skip these ads, please to check out the sponsors. We love them. I love them and enjoy their stuff.

Maybe you will too. This eb sode is brought you by a on theme in context. See what I did there.

Sponsor, since this is Young, the coin, artificial intelligence, machine learning, one of the seminal figures in the field. So of course, you gonna have a sponsor that's related artificial intelligence hidden layer. They provide a platform that keeps your machine learning models secure.

The ways to attack machine learning models, large language models, all the stuff we talk about with yn, there's a lot of really fascinating work, not just largely english models, but the same for video. Video prediction token ization with the tokens are in the space of concepts, in the space of, literally, letters, symbols, java v japan, all that stuff that they're open sourcing, all the stuff they're publishing on, just really in ble. But that said, all of those models have security holes in ways that we can even anticipate to imagine at this time.

And so you want good people to be trying to find those security holes, trying to be one step ahead of the people are trying to attack. So here, especially the company is relying on these models. You need to have a person who's in charge of saying, yeah, this model that you got from this place has been tested, has been secured, whether that places hugging face or any other kind of stuff, uh, or in any other kind of repository or a model zoo kind of a place.

I think the more and more we rely on largely language models are just the A I systems in general, the more the security threats that are always going to be there a become dangerous and impactful. So protect 不在这吗？ Visit hidden lair of counselees to learn more about how hidden layer can accelerate your A I adoption in a secure way.

This episode is also about to by element, a thing I drink throughout the day drinking. Now, when i'm on a podcast, you'll sometimes see me with a mug and clear liquid in there that looks like water. In fact, IT is not simply water.

IT is water mixed with element. What among salt cold? What I do is I take one of them, power ade, twenty eight fu answers, bottles filled up with water.

One packed water and salt shake IT up, put in the fridge. That's IT. I read the bottles and drinking from a mug, or sometimes in the bottle, either way, delicious.

Good for you. Specially you doing fast, especially if you do do IT low carb kinds dies, which I do. You can get a sample back for free with any purchase.

Try to drink element that comp lash legs. This episode is brought you by sharp fy as I take a drink of element. IT is a platform designed for anyone, even me, to sell stuff anywhere on a great looking store.

I is a basic one. I like a really minimal one. You can check this out if you go to left room in the conflict, if you shirts on there, if that's your thing.

IT was so easy to set up. Imagine there, like a million features they have that can make a look Better. All kinds of extra stuff you can do with the store.

But I use the basic thing, and the basic thing is pretty and good. I like basic, I like minimum. And they integrate with A A lot of third party apps, including what I use, which is on demand printing.

So I can buy the shirt on chop fy, but he is printing shipped by another company that I always keep forgetting, but I think is called or printed fed or I think is printable. I'm not sure doesn't matter. I think there are several integrations you can check IT out yourself.

For me, IT works. I'm using the most popular one printer. I think it's called anyway. I look forward to your letters corrected me in my pronunciation, sharp fies great.

I'm a big fan of the good side of the machinery of capitalism, selling stuff in the internet, connecting people, do the thing that they want, or rather the thing that would make their life Better, both in advertisement and e commerce and to shopping in general. I'm a big believer. When that's done well, your life, legitimately, the long term, becomes Better.

So whatever system can connect one human to the thing that makes their life Better is great. And I believe that shop fy is sort of a platform that enables that kind of system. You can sign up for a one dollar per month trial period to sharp fy.

That costas legs at all lower case got a sharp fight, that concise legs to take your business to the next level today. This episode is also brought you by A G one and all in one daily drink. They support Better health and peak performance.

IT is delicious. IT is nutritious. And I an out awards that ryan with those two, actually let me use a large language model to figure out what wrong with delicious wars around, with delicious include, ambitious or specious, a precious fetishes.

suspicious. So there you have IT. Anyway, I drink IT twice a day. Also put in the fridge and sometimes in the razer.

Like, I guess, a little, a little bit frozen, just like a little bit, just a little bit frozen. You got that, like slushy consistency. I do that too sometimes is freaking delicious, is delicious, no matter what delicious.

Warm is delicious. Cold is delicious. Lightly frozen. All this is incredible, of course, covers, like the basic multivitamin foundation of of what I think of is a good diet.

So it's just a great multivariate. That's the way I think about IT. So all the crazy stuff I do, the physical chAllenges, the mental chAllenge, is all that.

At least I got A Y one theyll. Give you one month, supply a Fisher when you sign up a drink. Ag, one, a come lash legs.

This is elected. pawkie. St, that supported. Please check out our sponsors in the description. And now, dear friends, here's Young looking.

You've had some strong statements, technical statements about the future artificial intelligence recently throw your career actually. But recently as well, you've said that auto progressive alliances are not the way we're going to make progress toward superhuman intelligence. These are the large language models like G, P, T, four, like lama two and three soon, and so on. How do they work and why are they not going to take us all the way?

For a number of reasons. The first is that there is a number of characteristics of intelligent behavior. For example, the capacity to understand the world, understand the physical world, the ability to remember and retrieve things, assistant memory, the ability to reason, and the ability to plan, those are four essential characteristics of intelligent systems, or humans, animals, and can do none of those, or they can only do them in a very primitive way.

And a they don't really understand if physical d world, they don't really have for its time memory, they can't really reason, and they certainly can't planned. And so you know if if, if you expect the system to become intelligent, just, uh, you know without having the possibility of doing those things, you're making a mistake. That is not to say that go to regulate alms are not useful.

They are certainly useful um that they're not interesting that we can build a whole ecosystem of the applications around them, of course, weekend, but as a past towards human level intelligence, they are missing a central components. And then there is another tidbit or of fact that I think is very interesting. Those other lands are trained on enormous amounts of tax, basically the entirety of all publicly available tax on the internet, right? That typically on the order of tent to the thirteen tokens, each tokens is typically two bites. So that's two ten to the thirteen bites as training data IT will take you or me one hundred and seventy thousand years to just read to this at eight hours a day. Um so IT seems like an enormous amount of knowledge right, that those systems can accumulate. Um but then you realize this really not that much data if you you talk to development psychologists and they tell you a four year old has been awake for sixteen thousand hours in this all your life um and the amount information that has reached the visual cortex of that child in years um is about ten to the fifteen bites and you can computers by estimating that the a article nerve Carry about twenty, you make make about for second roughly and so tens of fifteen bites for four year old verses two times tend to the thirteen bites for one hundred and seventy thousand years worth of reading what that tells you is that uh through sensory input we see a lot more information than we we do through language and that despite our intuition, most of what we learn and most of our knowledge is through our observation and interaction with the real world, not to language. Everything that we learn in the first years of life, and uh certainly everything that animals learn has nothing to do with language.

So be good to um maybe push against some of the intuition behind what you're saying. So IT is true. There's several orders of magnetite more data coming into the human mind.

How much faster and the human is able to learn very quickly from that filter to the data very quickly. You know, somebody might argue your comparison between sensory data versus language. The language is already very compressed.

IT already contains a lot more information than the bites that takes to store them if you compare to visual data. So there's a lot of wisdom and languages boards in the way we stick together. The area contains a lot of information. So IT is a possible that language alone already has enough wisdom and knowledge in there to be able to, from that language, a world model in understanding of the world and understanding of the physical world. They are saying all.

lm lack so it's a big debate among uh, philosophers and also a few scientists like whether intelligence needs to be grounded in reality. Um i'm clearly in the camp that uh yes uh intelligence cannot appear without sound grounding in a some reality. IT doesn't need to be you know physical reality.

You could be simulated but um but the environment is just much richer al than what you can express in language. Language is a very approximate presentation of or percepts and our mental models, right? I mean, there's a lot of test that we accomplish where we manipulate, uh, a mental model of a of the situation attend and that has nothing to do with language, everything that's physical, mechanical, call, whatever.

When we build something, when we accomplish a task model task, you know, graving something like a we plan for action sequences. Ces, and we do this by essentially imagining the result of the outcome of sequence of actions. So we might imagine, and that requires my tom models that don't have much to do with language, and that, I would argue you, most of our knowledge is derived from the interaction with the physical world.

D so a lot of a lot of my my colleagues are more uh interested in things like computer vision, are really on that camp that I needs to be invited essentially。 And then other people coming from the NLP side or maybe ah you know some some other a motivation don't nearly agree with that um and philosopher hers a split as well um and the the complexity of the world is hard to um thought to imagine, you know thought to represent uh all the complexities that we take completely for granted in the real world that we don't even imagine require intelligence, right? This is the old modified products from the pioneer of robotics, hand marvel. We said, you know, how is that that with computers this seems to be easy to do high level complex tasks like playing chess and solving integrals and doing things like that whether the thing we take for granted that we do every day I cannot know want you to drive a car or you know grabby an object we can do as computers um and yeah you know we have ads that can bus pass the bar exam so they must be smart but then they can learn to drive in twenty hours like only seventeen old they can learn to clear out the dinner table and feed of the dishwasher like any ten world can learn in one shot um why is that like you know what what are we missing what what type of learning or or reasoning architecture, whatever are we missing that um um we prevent us from from you know, having level five cars and domestic robots.

Can a large language model construct a world model that does know how to drive and doesn't know how to filling dish washer, but just doesn't know how to deal with visual data at this time? So I IT can Operate a special concepts.

So yeah, that's what a lot of people are working on. Uh, so the answer, the short answer is no. And the more conflict product is you can use all kind of tricks to get A A N L M to basically digressed visual representations of representations of images uh or video or audio for that matter um and uh a classical while doing this is uh you train revision system in some way, and we have a number of ways to train vision system is our supervise on be supervised to our supervise, all kinds of of different ways. Uh, that will turn any image into a high level representation.

Basically, a list of tokens are really similar to do the tokens that typical name takes such an input and then you just feed that to the add them in addition to the text and you just expect IT alam to kind of uh your dream training to kind of be able to uh use those representations to help h make decisions. I mean, it's been work on along those line for quite a long time and now you see those systems, right? I mean there are at lamps that can to have some vision extension, but IT basically hacks in a sense that um those things are not like train and to end to to handle to really understand the world that not change. With video, for example, they already understand intuitive physics, at least not .

at the moment. So you don't think there is something special to about intuitive physics, about sort of common sense reasoning, about the physical space, about physical reality, that to you is a giant leap that alarms are just not able to do.

We're not going to be able to do this with the type of them that we are working with today. And there's a number reason for this, but the many reason is the way L M M S are trained. As you take a piece of text, you remove some of the words in the text, you mess them, you replace by, replace them by markers and you train a on that to predict the words that are missing.

Uh, and if you build a disorder in a particular way, so that I can only look at a word star to the left of one is trying to predict, then what you have is a system that are basically is trying to predict the next word in attacks, right? So then you can feed IT um a text, a prompt and you can ask to predict the next word. You can never predict next word exactly.

And so what is going to do is, uh produce a probability distribution over all the possible words in a dictionary. In fact, IT doesn't pretty worse a pretty t talkers that are kind of subway units. And so it's easy to handle the uncertainty in the prediction.

There are because there is only a finite number of possible work is in the dictionary and you can just computer distribution over them. Um then what you what the system does is that IT IT picks a word from that. Distribution, of course, is a higher chance of speaking words that ever hire a probability within the distribution, for example, from the distribution to actually produce a word.

And then you shift that word into the input and so that allows us the system not to predict the second word, right? And once you do this, you should into the input extra that's called auto regressive prediction um which is why those ads should be called auto requisition Adams uh but we just call them at adam. And there is a difference between discount process and a process by which before producing a word, when you talk, when you and I talk, you and I are bengal. We think about what we're gonna say, and it's relatively independent of the language, which we going to say we we talk about like I don't know, is a mathematical concept to something. The kind of thinking that we're doing and the answer that we're planning to produce is not league to whether we're gonna IT in french ch russia nor english chance.

I just roll designed by understand see you're saying that there is a uh bigger abstraction that that uh that goes before language yeah maps onto language, right?

It's certainly true for a lot of thinking that we .

that we do that obvious that we don't like you're saying you're thinking is same in french as IT is in english.

Yeah, pretty much. Yeah.

pretty much. Is this like how how flexible are you like if if is a probability distribution?

Well IT IT depends where I kind of thinking, right? If it's just um if it's like producing puns. So I get much Better and french in english about that.

no. So is there an abstract represent upwards? Like is your humor and abstract rapt, like when you tweet, and your tweet are sometimes a little but spicy. What is there an abstract rop presentation in your brain of a tweet before IT maps under english.

there is an extra representation of, uh, imagining the reaction of a reader that a text.

you start with laughter and then figure out how to make that happen for.

So figure like a reaction you want to cause and and then figure how to say, IT, right, that you cause this that reaction. But that's like really close to language. But think about like a mathematical concept uh or uh you know imagining you know something you want to build out of word or something like that, right? The kind of seeking you're doing is absolutely nothing to do.

Language really like is not like you have necessarily like an internal monolog in any particular language. You your, you know, imagining mental models of of the thing, right? Many if I ask you to like, imagine what this, uh, what about all we look like if I would? Ninety degrees that has nothing to the language.

And so, uh, so clearly there is, you know, a more attract level of work env uh in which we we do most of our thinking and we plan what we're onna say if the is you know altered words as opposed to an output being uh you know muscle actions, right um we we plan or answer before we produce IT and elements to do that. They just produce one word the other instinctively if you want. It's like it's a bit like the you know subconscious um actions where you don't like you're distracted.

You're doing something you completely concentrated and someone comes to you and you know ask you a question and you can answer the question. You don't have time to think about the answer, but the answer is easy, so you don't need to attention. You sort of respond automatically. That's kind of what the land does, right? IT doesn't think about its sensor really uh IT retreat IT because it's accumulated lot of knowledge so I can retreat some some things, but it's going to just spit out one talking after the other without planning the answer.

But you're making this sound just one token after the other. One token at a time generation is bound to be simplistic, but if the world model is sufficiently sophisticated that one token at a time, the the most likely thing that generates is a sequence of tokens going to be a deeply profound thing.

Okay, but then that assumes that the the systems actually possess an eternal world model.

So really goes to the I I think the fundamental question is, can you build a really complete world model, not complete, but a one as a deep understanding of the world?

yeah. So can you build this first of all, by prediction, right? And the answer is probably yes.

Can you predict? Can you build IT by predicting words? And the answer is most probably no, because languages very poor in terms of week or low band with, if you want to, just none enough information there.

So building more models means observing the world and a understanding why the world is evolving the way the way IT is. And then, uh, the the extra confident of the world model is something that can predict how the world is going to evolve as a consequence of an action you might take, right? So what model really is here is my idea of the state of the world time.

t. He is an action that might take what is the predicted state to the world at ten key plus one. Now that state to the world does, does, does not need to represent everything about the world he just needs to represent.

And if that's relevant for this planning of of the action, but not necessary all the details. Now who is the problem? Um you're not going to be able to do this with generative models. So a generation model as train on video and we try to do this for ten years. You take a video, show a system, a piece of video, and then asked to predict the reminder of the video, basically particular what's .

gonna happen one frame at a time, do the same thing, is sort of the auto aggressive alm, stupid for video, right?

Either one from the time or a google point at a time. Um but yeah but I just video model if you want. Uh the idea of of doing this has been flooding around for a long time and at at fair a so my colleagues and I have been trying to do this for about ten years.

Um and you can you can really do the same trickles with s because, uh, you know, adam, as I said, you can't predict the execute which word is gna follow a sequence words. We can predict a distribution over words. Now if you go to video, what you would have to do is predict the distribution of all possible frames in the video. And we don't really know how to do that properly. We we do not know how to represent distributions over high dimensional continuous spaces in west or are useful and and and that that there lies the main issue and the reason we can do this is because the world is incredible, more complicated than richer in terms of information.

And then text taxis discard a video is highly values a lot of details in this um so if I take a the video of this room, uh the video is you know camera pending around um there is no way I can predict everything that's going to be in the room as I pan around the system cannot predict was gonna in the room as the camera is spending, maybe it's gna predict this is this is a room, whether is a light and there is a wall and looks like that. I can't predict what the painting on the world looks like or whether texture or the couch looks like, certainly not the texture of the carpet. So there's no way can predict those details.

So the the way to handle this is one way IT possibly to handle this, which been working for long time, is to have a model that has was good, the latent variable. And the latent variable is fed to a neural net and it's supposed to represent all the information about the world that 移动 perceive yet and um that you need to augment the the system for the prediction to do good job at predicting pixel， including the you know fine texture of the of the carpet and on the couch and the pitting on the world. Uh that has been a cookie failure essentially.

And we tried lots of things. We tried um just straight that we tried gans. We tried uh you know V A uh all kinds of regularized encounters. We tried um many things. We also tried those kind of methods to learn uh good representations of images or video that could then be used as input to, for example, an image classification system. There also has basically failed, like all the systems that attempt to predict missing parts of an image or video um for you know um for me a corrupted version of IT basically so I take an a major video corrupt IT to a transform some way and then try to reconstruct the complete video or image from the corrupted version and then hope that internally the system will develop a good representations of images that you can use for object recondition segmentation, whatever IT is, that has been essentially a complete failure. And IT works really wealth of text that the principal that is used for l ms, right?

So what where's the failure exactly is that? That is very difficult to form a good representation of image, a like a good in bedding of wall. All the important information image is the in terms of the consistency of image image, image, image that forms the video, like where what is the to do a highlight, real of all the ways you failed, what was that look like?

Okay, so the reason this doesn't work, uh, is first of all, I have to tell you exactly what doesn't work because there is something else that does work. So they think that does not work is training the system to learn representations of images by training IT to reconstruct, uh, a good image from a corrupted version of IT. Okay, that's what doesn't work.

And we have a whole sw of techniques for this, uh that are, you know ring until dinner. We go to encounter something A A E, they are I could that fair max total encoder. So it's basically like the you know or or or or things like this where you train the system by corrupting tax, except you just remove that the features you get are not good. And you know they are not good, because if you not train the same architecture, but you should need supervise with a label data, with textual descriptions of images extra, you do get good representations. And the performance on recognition task is much Better than if you do this self supervised free training.

So the architecture is good.

the architect is good, the structure of the encoder is good. Okay, but the fact that you train the system to reconstruct images does not lead IT to produce to along good generate features of images.

only trained itself, supervised way self.

who provides byway construction. Yeah, bye 是 OK。 So where's your alternative? The alternative is, uh jointing betting what is joint and betting?

What are what are these architectures they are so excited about OK.

So for now, instead of training a system to enco the image and then training IT to reconstruct the the full image from a corrupt t version, you take the full image, you take the corrupt to the door transformation. You run them both two encores, which in general identical but not necessary and then you you train a predict on top of those encounters um to predict the representation of the full input from the representation of the corrupted one.

Okay, so john betting because you you're taking the the full input and the corrupted version or transfer version, run them both who and coats, you get a joint and bedding and then you and then you you saying can I predict the representation of the following from the representation of the corrupted what okay um and I call this a japhet so that means john bedding predictive architecture because this is john embedding and there is a predictor that predicts the representation of the good guy from from the bad guy um and the big question is how do you train something like this? And until five years ago, six years ago, we didn't have particularly good answers for how you change andle things except for one. Um contracted a contracting learning where um the idea contracting learning is you you take a pair of images that are again an image and a corrected version or degraded version somehow or transform version of the original one and you train the predicted representation to be the same as as as said, if you only do this, this is some collapses.

He basically completely ignores the input and produces representations of our custom. So the contracting methods avoid this. And and those things have been around to the early n ninety side of people on the nineteen nineteen three um is you also show pairs of images that you know are different, and then you push away the representations from each other.

So you say not only do representations of things that we know are the same, should be the same or should be similar, but representation of things that we know are different to be different and that prevents the collapse. But I had some limitation and there's a little bunch of a techniques that have appeared over the last six, seven years um that can revive this this type of method. Um some of them from fair, some of them from from google and other places.

Um but our limitations to those contracts ve method, what has changed in the last you know three, four years is now now we have methods that are non contracted. So they don't require those negative contractor samples of images that are that we know are different. You can only you try them only with images that are you know different versions of different views are the same thing um and you rely on some other tweaks to prevent the system from collapsing. And and we have half a dozen methods for this now.

So what is the fundamental the difference between joint to betting architectures and alembic? So can can a japan take us to A G I, but whether we should say that you don't like uh, the terminal A G I and will probably argue, I think every single time I talked, you would argue about the G N A G I. Yes, like I can I get IT, I get IT.

Well, will probably continue to argue about IT is great. You you like at me at this because you like france. I mess, this is A, I guess, friend and french, yes. And A M, I stands for advanced machine intelligence, right? But either way, you can jeopardy cus to that towards that advanced machine .

intelligence. Well, so it's it's the first step. O K, so first of all, uh, what what's the different with generative architectures like a EMS, so elms or vision systems that are trained by reconstruction, generate the inputs, right, to generate the original input that is non corrupted, non transformed, right? So you have to predict the pixels.

And there is a huge amount of resources spent in the system to actually predict all the fixes, all the details. Uh, in the japan, you're not trying to predict older pixel. You're only trying to predict an abstract representation of of the inputs, right? And that's marchy there in many ways.

So what the japan system, what is being trained, is trying to do is extract as much information as possible from the input, but yet only extract information that is a relatively easily predictable. okay. So there is a lot of things in the world that you we cannot predict.

Like, for example, if you have a sad driving car driving down the street or road, uh, there may be a trees around the around the world and IT could be a windy day. So the the leaves on the tree are kind of moving in kind of some I chaotic random ways. You can't predict and you don't care you don't want to predict.

So what you want is your encoder to basically eliminate all those details. You'll tell you there is moving least, but it's not onna. Keep the details of exactly what's going on.

Um and so when you do the prediction in representation space, you're not going to have to predict every single pixel le of a relief and that you know um not only is a lot simper but also IT allows the system to is actually learn an a strike representation of of the world where you know what can be model and predicted is preserved and the rest is viewed as noise and eliminated by the encoder. So I can lift the level of a struction of the representation. If you think about this, this is something we do absolutely all the time.

Whenever we describe a phenomenon, we describe IT at a particular level of interaction. And we don't always describe every natural phenomenon in terms of quantum fuel theory, right? That would be impossible, right? So we have multiple levels of abstraction to describe what happens in the world.

You know, starting from quantifiable theory to like atomic theory and molecules, you know in chemistry materials and you know all the way up to you know kind of concrete objects into real world and looks like that. So the we we can just only model everything at the lowest level. And that that's what the ideal japan is really on, is really about learn abstract representation in self supervise uh manner.

And you know you can do IT how so that I think is a national social component of an intelligent and system. And in language, we can get away without doing this because language is already to some level abstract and already has eliminated a lot of information that is not predictable. And um so we can get away without doing the front m meeting without you know lifting the attraction level and by directly acting words.

So join in bedding is still generative, but is generative in this abstract representation space. And you're saying language, we were lazy with language because already got the abstract representation for free. And now we have to zoom out.

Thank you. Think about general and intelligent systems. We have to deal with the form mass of physical reality, of reality. And you can you, you do have to do this step of jumping from the full, rich, detailed reality to a abstract representation of that reality based on which you can reason and all that kind of stuff.

right? And the thing is those sales who provide, agree with them that that learned by prediction, even in representation space, they learn more concept. If the input data you feed them is more durant, the more redundancy there isn't the data, the more they able to capture some internal structure of IT.

And so there there is one more redundancy structure in perceptual, a input sensory input like like like vision. Then there is in a text which is not as on that, this is back to the question you are king a few minutes ago. Language might represent more information really because it's already compressed you.

You're right about that, but that means it's also less. We doesn't. And so service preference.

you will not work as well is impossible to join. The social provides training on visual data and supervised training on language data. There is a huge amount of knowledge even though you talk down about those ten of the thirteen tokens.

Those ten of thirteen tokens represent the entirety, a large fraction of what us. Humans have figured out, both the should talk and read IT and the contents of all the books and articles and the full spectrum of human intellectual creation. So is impossible to join those two together.

Well eventually yes, but I think um if we do this too early, we run the risk of being tempted to cheat. And in fact, that's what people are doing at the moment. With visual language model, we're basically cheating. We're a using uh language as a crash to help the deficiencies of our uh vision systems to kind of learn good representations from uh h images and video. And uh the polling with this is that we might you improve our uh vision language system a bit.

I mean our language models by you know, feeding the images, but when I gna get to the level of even the intelligence or level of understanding of the world of a cat or dog, which doesn't have language and they don't have language, and they understand the world much Better than any of them, they can plan, recommends, actions. And so to imagine the result of a bunch of actions, how do we get machines to long that before we combine that with language? Obviously, if we combine this with language, this is gonna A A winner. Um but but before that, we have to focus on like how do we get systems to on how the world works.

So this kind of joint in bedding predictive architecture for you, that's going to be able to learn something that common sense, something like what a cat uses to predict how to mess with its owner most optimally by knocking over a thing that's .

that's the hope. Uh, in fact, the technical were using a non contracting. Uh, so not only is the architecture non generative, the learning procedures we are using a non contracted. So we have two two sets of techniques. Uh, one set is based on the station and there is a number of methods that use this principle. Uh went by deep mind could be very well um uh a couple by fair when when called uh v rag and another one called eyGPT and V G G I should say is not a distained method actually but I jep A B way while certainly are and this is the other one also called dino or dino uh also produce from uh fair. And the idea of those things is that you take the full input at ten image.

Um you're only to an encoder, produces a representation and then you corrupt the input to transform IT run into the essentially what amounts to the same encounter with some minor differences and then train A A predict, sometimes to predict or is very simple, sometimes doesn't exist, but train predict or to predict representation of the first uncorrupted input from the corrupted input. But you only trained the the second branch, you only trained the part of the network that is fed with the corrupted input. The other network you you don't train but seems they share the same way when you modify the first one IT also many fies second one uh and he was various tricks you can prevent the system from collapsing uh with the collapse of the type was plaining before with the system basically ignores the input um so that works very well. The the technic with the two techy we develop at fair uh dino and uh and egypt work really well for that.

So what kind of data are we talking .

about here? So the soul IOS IO is you taken an image you corrupted by um changing the cropping. For example, changing the size of a little bit may be changed, the orientation, blowing IT, changing the colors, doing all kinds of horrible things to IT.

but basic horrible things.

basic horrible things that to degree the quality a little bit and change the framing, uh you know crop the image um or and in some cases, in the case of ijape, you don't need to do anything. You you just mask some parts of IT right just this could remove some regions I could be brock essentially and and then you know run to the encoder um and train the entire system encoder and predictor to predict the representation of the good one from the representation of the corrupted. Um so that's um I japan doesn't need to know that it's an image, for example, because the only sea needs to know is how to do this masking.

Well, as with you know, you need to know an image because you need to do things like, you know just make you transformation and bring and things like that, that are a really image specific uh a more recent version of of this that we have school veja as is basically the same ideas in japan except um it's applied to video. So now you take a whole video and you ask a whole chunk of IT and what we ask is actually kind of a tempo tube. So and all like a whole h of each frame in the video over the entire r video.

No, that too was like statically position throughout the .

frames tube to typically as sixteen friends or something and we mask the same region over the entire sixteen friends is a different one for every video obviously and um and then again a train that system so as to predict the representation of full video from the portion next video uh that works pretty well. Is the first system that we have that learn good representations of videos so that when you feed those representations to are supervised, uh, classify your head I can you can tell you what action is taking place in the video with, you know pretty good accuracy uh so that that first time we get something of that uh of the equality that's a good test.

that a good representations form. I mean, yes, there's something to this .

yeah um we also preliminary result that uh seem to indicate that the representation allows us our system to tell whether the video is physically impossible or completely impossible because some object disappeared or an object you will certainly jumped from one location to another or or change .

shape or something so able to capture some physical, some physics based constraint about the reality represented in the video here about the appearance in the disappearance of objects yeah.

that's you.

Okay, but can this actually get us to this kind of a world model that understands enough about the world? Bill, to drive a car?

Uh, possibly is going to take a while before we get to that point but um um there are systems already you know rather systems that i've been used on this uh idea uh and uh what you need for this is a slightly modified version of this where um imagine that you have a video and A A complete video and what you're doing to this video is that your is there translating IT in time to worse the future so you are only see the beginning of the video but you don't see the latter part of IT that is the original one or you just mask the second half of the video for example um and then you you train A A japp system on the type pet describe to predict the representation of the full video from the the shifted one but you also feed the predict with an action for example you know the whiles turn ten degrees to do that to the right or something right.

So if it's a you know a dash cam in a car and you know the angle of the wheel, you should be able to predict to some extent what what's going to go, what's going to happen to what you see um you know to be able to predict all the details of you know objects that appear in the view of your sleep. But at a extract representation level you can you can probably predict what's going to happen. So now what you have is a international model that says, here is my idea, stay to the world.

Time t he is an action and taking here is a prediction of the set of the world. I time t plus one, t plus delta T, T plus two seconds, whatever IT is, if you have a model of this type, you can use IT for planning. So now you can do what L E S cannot do, which is planning what you're gonna do so as to arrive at a particular uh outcome or satisfy particular objective, right? So you can have a number of objectives um right if you know I can I can predict that uh if I have uh an object like this, right and I opened my hand, he's gna fall right and and if I push IT with a particular force on the table, it's gona move.

If I T push the table itself is forging, not onna move uh with the same force um so we have we have this internal model of the world in our in our mind um which allows us to plan sequences of actions to arrive at a particular goal um and so um so now if you have this wide model, we can imagine a sequence of actions predict what the outcome of the sequence of action is going to be, measure to what extent the final state satisfies a particular objective like you know moving the bottle to the left of the table um and then plan a sequence of actions that will be my this objective at run time. We are talking about learning. We're talking about inference time, right? So this is planning really and in ultimate control this is a very classical thing is called uh, model productive control.

You have a model of the system you want to control that you know we can predict the sequence of states are responding to sequence of comments and you're planning a sequence comments so that according to your own model, the the the end state of the system will um satisfy and objectives that you fix. This is the way. You know, rocket trajectory have been planned since computers have been around. So since only sixty is essentially.

So, yes, for a model, predictive control. But you also often talk about hierarch planning. Can hierarch playing emerged from this somehow?

Well, so no, you you will have to build uh specific architecture to allow for arctic ical planning. So arc planning is absolutely necessary if you want to plan complex actions. Um if I want to go from, let's have from new york to paris the example I used all the time and i'm sitting uh in my office and why you my objective that I need to minimize is my distance to paris at a high level, a very as track representation of my my location.

I will have to decomposition into two sub goals. First one is um go to the airport. Second one is sketched in to paris. Okay, so my own goal is now, uh, going to the airport. My objective function is my distance to the airport.

How do I go to the airport where I have to go to street and hear the taxi, which you can do in new york? Um okay, now I have another goal. Go on on the street.

Ah, what that means, uh, going to the elevator, going down the elevator, walk out the street. How do I go to the elevator? I have to stand up for my chair, open the door, my office, go to the ater, push, push the button.

How do I I get up from my chair? Like, you know, you can imagine going down all the way down to basically what amounts to be second by the second muscle control. okay.

And obvious ly, you're not going to plan your entire trip for york to paris in terms of the second and midday second muso control first I would be incredible expensive but you will also be completely impossible because you don't know all the conditions what's gona happen uh, you know how long is gonna take to catch a taxi um or to go to the airport to traffic you know a and you would have to know exactly the condition of everything to be able to do this planning. You don't have information, so you have to do this hard article planning so that you can start acting and then sort of replanning as you go. And nobody really knows how to do this in A I nobody knows how to train the system to learn the appropriate multiple levels of representation so that how archie al planning works.

something like that already emerged. So like, can you use an analytic study, the art alarm, to get you from new york pairs by doing exactly the kind of detailed set of questions that you just did, which is, can you give me a higher a list of ten steps I need to do to get from new york to pairs and then for each of those steps, can you give me a list of ten steps how I make that step happen and free to those steps, can you give me a list of ten steps to make each one of those until you're moving your most individual muscles? Um maybe not whatever you can actually act upon using your mind.

right? So is a lot of questions that I also imply by this, right? So the first thing is, uh, I M ms will be able to answer some of those questions down to some level of attraction under the condition that they have been trained with similar oils in the training set.

they would be able to answer all those questions. But some of them may be hallucinated, meaning non factual yeah too.

I mean that they will probably produce some answer except they're going to be able to really produce media control of how you stand on your chair, right? So but down to some level of distraction, we can describe things by words.

They might be able to give you a plan, but only under the condition that you've been trained to produce those kind of plans, right? They're going to be able to plan for situations with that the they never encountered before. The basic are going to have to reggina the template to be trend on.

But where like just for the example of new york, paris is going to start getting into trouble. Like a which layer layer abstraction do you think you'll start? Because like I can imagine, almost every single part of that and no alone will be to answer, somewhat accurate, especially when you talking about new york comparisons.

jor cities. So I mean, certainly a and I would be able to sole that problem with you funt unai for um you know just uh and and so uh I can't say that and I cannot do this. I can do this if you train IT for IT. This is no question uh down to a certain level where things can be format in terms of words. But like if you want to go down to like how you you know climbed down the stairs or just stand up from your chair in terms of, uh, words like you you can you can do IT um do you you need that's one of the reasons you need experience of the physical world, which is much high with what you can express in words in human language.

So everything we're been talking about on the join the bedding space is IT possible to death. What we need for like the interaction with physical reality for on the robotics front, and then just the alem mbs are the thing that sits on top of IT for the bigger reasoning about, like the fact that I need to book a plane ticket and I need to know, I know how to go to the websites .

and saw on sure and you know a lot of plans that people know about um that are ready to be high level are actually learned that not most people don't invent the you know plans um uh they they by themselves they uh you know we have some ability to do this, of course uh obviously but uh but most plans that people use our plans that have have been trend on like they've seen other people use those plans or you've been told her to do things right um that you can't invent how you I take a the person who has never heard of airplanes and tell them like I do go from the new york to paris.

And they probably not going to be able to get all plan N S F C examples of that before. Um so certainly L M are going to be able to do this. But but then um how you linked this from the the lower level of of of actions ah that needs to be done with things like like japan that basically lift the attraction level of the representation without attempting to constructive the detail of the situation. That's why you we need japs.

For I would love to sort of linger and your scepticism around aggressive analytics. So one way I would like to test that scepticism is everything you say makes a lot of sense. But if I apply everything you said today in in general too, like I don't know, ten years ago, maybe a little bit less, no, let's say three years ago, I wouldn't be able to predict the successful alms. So that doesn't make sense to you that other progressive alms are able to be so damn good.

yes.

Can you explain your intuition? Because if I were to take you're wisdom in intuition IT face value, I would say there is no way progressive. Ms, one token, at time, we would be able to do .

the same things they're doing. No, there is one thing that uh atom uh or that atms in general, not just your progressive one, but including the bird style by direction once um are exploiting and its self supervision running. And i've been a very, very stronger advocate to self supervision ning for many years. So those things are a incredible, impressive demonstration that self supervised oni in actually works. The idea that he was started, I didn't start with with with bird, but I was really kind of a good demonstration with this.

So the the the idea that, you know you take a piece of tax, you corrupt IT and you to reconstruct part missing um that has been an enormous uh uh producing an enormous amount of benefits uh IT allow allow us to create systems that understand understand language, uh systems that can translate um hundreds of languages in any direction, systems that are multiple qual so they are not is a single system that can be trained to understand hundreds of languages and translate in any direction um and produce uh summaries um and then answer questions and produce text. And then there is a spicule case of IT where you know you, which is the autographed trick, where you can train the system to not elaborate a representation of attacks from looking at the entire text, but only predicting a word from the words that are come before. Right then you do this by the concentrating the architecture of the network and that's what you can build no to replace you add them from.

So there was a surprise uh many years ago with was called decoder only uh at L M. So since you know systems of this time that are just trying to produce uh words from the from the previous one and and the fact that when you scale them up, they they tend to really kind of understand more about the a about language. A when you turn them on lots of data, you make them really big. That was kind of a surprise and that surprise occurred quite a White back like, you know, uh, with a work from, uh, you know, you will meet a OpenAI a, you know, going back to, you know, the GPT kind of a work general pretrail transformers .

me like GPT two like there's a certain place where you start to realize scaling might actually keep giving us an emergent benefit yeah me 对。

they will work from from various places. But um uh if if you want to place IT in the in the GPT time now that would be your antipater too.

I just because you said that so charismatic, said many words but and so supervised learning yes but again, the same intuition you're lying to saying that other progressive elements cannot have a deep understanding of the world. If we just apply that same intuition, does that make sense to you that they're able to form enough of a representation? The world to be damn convincing a century passing the original touring tests with flying colors?

Well, we're fooled by their fluids. Cy, right? We just assumed that if the system is fluent in manipulating languages and IT has all the characteristics of human intelligence, but that impression is false. We really fool by IT. What do you think everyone would say?

IT, without understanding anything, just hanging out with turing .

would decide that turning, this is a really bad test. Okay, this is what the A. I. Community has decided many years ago that the train text was a really bad test of intelligence.

What would hunt more of a say about about the large language models?

But marvel would say that marvel products still applies OK. Okay, okay.

We can also think you will be really impressed.

No, of course, everybody would be impressed. But you know uh, is not a question being impressed or not is the question of knowing what the limit those systems can do like they are. Again, they are impressive.

They can do a lot of useful things. There's a whole industry that is being built around them. They are gonna make progress, uh, but there is a lot of things they cannot do.

And we have to realize what they cannot do and and then figure out, you know how we get there. And you and i'm not seeing this. I'm saying this from basically piano ten years of a research uh on on the idea is to provide running actually that's going back more than ten years.

But the idea so super money so basically capturing the internal structure of a piece of a set of inputs without training the system for a new particular task. Learning representations. Um you know the the conference that I cofounded fourteen years ago is called international conference of longing representations that the entire issue that deep lying is is dealing with right? And it's been my obsession for you know almost forty years now. So uh so only representation is really the thing. Uh, for the longest time we could only do this with supervise running.

And then we started working on, uh, you know what we used to call on super running, uh uh h and so to revive the idea of unsupervised ning uh in the only two dozens with uh in ton, then discovered that two profession actually works pretty well if you can cut a tin of data and so the whole idea of you know and supervision can I took A A back seat for for a bit and then I kind of try to revive IT um um in a big guy you know starting in twenty fourteen. Basically when we started fair and uh and really pushing for like finding new new methods to do self super running those for text and for images and for video and audio and some of that work has been incredible successful. Um I mean, the reason why we have multination transition system, you know things to do, content, moderation on on meda, for example, on facebook that are multination world, understand whether piece of taxis hit for china to something is due to that progress.

Using service provision in for NLP. Combining this with the transformer architecture is in baba, but that's the big success of service online. We have similar success in speech recondition, a system called wave to veg, which is also a john building architectural, by the way, trained with contract sive running and and that that system also can produce um speech track conditioning systems that are multiple ual with mostly n label data and only need a few minutes of label data to actually do speech condition. That's that's amazing. Um we have systems now based on those combination of ideas that can do real time translation of hundreds of languages into each other, uh speech .

to speech speech speech even including which is passing languages .

that uh don't have written in speech units. There are discreet, but it's um is good text lesson. L P we used to call the this way, but um yeah so that I mean incredible success there.

And then you know for ten years we try to apply this idea to learning representations of images by training a system, to predit videos, learning into IT to physics by training system to predict what what's going to happen in the video, and try and tried and failed and failed with generate models with models that pretty pixel um we could not get them to learn good dw presentations of images. We could not get them to learn good representations of videos. And we tried many time, we published lots of papers on IT.

Yeah, they kind of sort of work, but not really great. They started working. We we abandon this idea of predicting a repeal and basically just doing the giant ebel ding and predicting representation space that works.

So this ample evidence that would not gonna be able to long good representations of the real world using general model. So i'm telling people, everybody stuckup a, generate the eye, if you really interested in, in human level. Y, I, I. And on the idea of .

generating A I okay, but you you you really think it's possible to get far with the joint and betting representation. So like there's common sense reasoning and then there is high level reasoning, like I feel like those are two, the kind of reasoning that alums are able to do. okay. Let me not use the word reasoning, but the kind of stuff that alums are able to do seems fundamental different than the common sense reasoning we used to navigate the world. Ah that seems like we're gonna need both.

You're not that would you be able to get with join a betting, which is a jep type of approach looking a video? Would you be able to learn, let's see, oh, how to get from new york's pairs or how to undertake, understand the state of politics in the wait, right? These these are things were various. Humans generate a lot of language and opinions on in the space of language, but don't visually represent that a clearly a compressible way.

right? Well, there is a lot of situations that you know might be difficult to for a pue language base system to to know like, okay, you can probably learn from reading text. The entirety of public public available tax in the world that I cannot get from new york to parries by nappy my fingers that's not going to work right? Yes um but it's probably sort of more complex um than I use of this type which in alarm may never have encountered and may not be able to determine whether it's possible or not so um so that that link know from the the law level to the high level the the thing is that the high level that language expresses is based on the common experience of the law level which elements currently do not have you we when we talk to each other, we know we have a common experience of the of the world. You know a lot of IT is is similar and elem don't have that but there .

is present you know, have a common experience to the world in terms of the physics of how gravity works and self like this. And that common knowledge of the world, I feel like, is there in the language. We don't explicit express IT, but we have a huge amount of text.

You're going to get this stuff that between the lines you you're going to in order to former consists world mother, you are going to understand understand how gravity works, even if you don't have an explicit explanation of gravity. So even though indication of gravity, there is explicit explanation as a gravity and with media, but you're like the stuff that we think of a common sense reasoning. I feel like to generate language correctly, you're going to have to figure that out that you could say as you there's enough tax story. okay. so.

You don't think so. No, agree with the what you just said, which is said to be able to do hide of all, uh, common sense to have available common sense, you need to have the law level common sense to on top and it's not there and that's not there and in ads are purely train from text. So then the other statement we may um we will not agree with the fact that implicit in all languages in the world is the underlying reality. There is a lot about underlying reality which is not expressed .

in language that obvious yeah totally. So like all all the conversations we have, okay, there's the dark web, meaning whatever private conversations like the stuff like this, which is much, much larger probably than what's available, what what allies are trained on.

You don't need to communicate the stuff that is common.

but the humor all of IT no, you do like when you you don't need to but IT comes through like like if I accidentally knocked this over, you probably make fundy and in the content of the you making final y will be uh explanation of the fact that cups fall. And then, you know, gravity works in this way and you you'll have some very vae information about what kind of things explode when they hit the ground.

And then maybe you'll make a joke about entreprise, something like this. They will never be able to reconstruct this again. Like, okay, you make up a little joke like this and it'll be trillions of other jokes.

And from the jokes, you can piece together the fact that gravity works and mugs can break in all this kind of stuff you don't need to see. It'll be very efficient. It's easier for like not to think over yeah, but I feel like you would be there if you have enough of that data.

I just think that most of the information of these types that we have accumulated when when we were babies is just not present in uh in in text in any description.

Sally. And the sensory data as much as a much richer source for getting that kind of other.

I mean, six thousand hours, a great time before your old and ten to fifteen bites, you know, going through vision, just vision right? There is a similar uh, bad ways you know of touch and uh little less to audio and then text doesn't language doesn't come in until like you know a year uh in in life and by the time you are none years old, you've learned about gravity.

You know about ossia, you know about gravity, you know the stability, you you know about the distinction between anya objects by months, about White people want to do things and you help them if they can you know I mean this a lot of things that you learn mostly by observation, really, uh, not even to interaction. In the first few months of life, babies don't don't really have any influence on the world. They can only observe fright. You accumulate like a gigantic amount of knowledge just just from that so that that's what we're missing from, uh, current I systems.

I think in one of your slides of this nice plot, that is one of the ways you show that L L. Ams are limited. I wonder you get to talk about pollution ation from your perspectives, the why pollution happen from large languages, models and why into a degrees at a fundamental flaw of largely english models.

right? So because of the auto regressive prediction every time, and that time produces a toka or word, there is some level of probability for that word to take you out of the set of reasonable answers. Uh, and if you are you which is a very strong assumption that the probability of such error um is that those errors are independent across A A sequence of tokens being produced but that means is that every time you produce a token, the probability that you rest you, you stay within the the set of correct or decreases if so there's a strong.

like he said, assumption that if there's not your probability of making mistake, which that appears to be, then there is going to be a kind of drift yes.

And that drift is expended til it's like hours accumulate, right? So so the probable that an answer will be on sensical increases exponentially the number of tokens.

Is that obviously but way like well, so mathematically speaking maybe but like isn't that kind of gravitational pull towards the truth? Because on an average, hopefully the truth is well represented in the a training set.

No, it's vessey is struggle against the curse of dimensionality. So the way you can correct for this is that you find you in the system by having to produce answers for all kinds of questions that people might come up with. And people are people, so they a lot of the questions that they have are very similar to each other.

So we can probably cover you know eight percent whatever of questions that people will will ask um by you know what creating data and then um and then you find you the system to produce good answers for all of those things. And it's probably gonna able to learn that because it's got a lot capacity to to learn. But then there is you know the enormous set of prompts that you have not covered during training.

And that said is enormous, like within the set of all possible pumps. The proportion of pumps that have been uh use for training is absolutely tiny. Um is is tiny, tiny, tiny subset of all possible problems.

And so this system will behave properly on the process that has been either a trained, retrained or fine tuned. Um but then there is an entire space of things that I cannot possible being trained on because it's just the the number is gigantic. So um so whatever training the system uh has been subjective. To produce appropriate tenders, you can break IT by fighting out a prompt that will the outside of the the the set of pump has been trend on or things that are similar, and then you will just feel complete nonsense.

Do you when you say prompt, do you mean that exact prompt? Or do you mean prompt this like in many parts, very different than is that easy to ask a question or to say a thing that hasn't been said before on the internet?

And people have come up with things where like you, you put a it's actually a random sequence of characters in the front. And that's enough to kind of through the system, uh, into a mode where you know IT is going to answer something completely front than I would have answered without this. So that's a way to get over the system specially get to go outside of its of its conditioning, right?

That's a very clear demonstration, of course um you know that that goes outside of what is designed to do, right? If you actually stitch together reasonably good medical thing, is that is that easy to break IT?

Yeah some people have done things like you you you are a sentencing english, right? There has an or you ask a question in english IT IT produces a perfectly fine answer, and then you just attitude a few words by the same world in another language. And office set on the answer is copy.

yes. So I guess what i'm saying is like which fraction of prompts that humans are likely to generate are going to break the system?

So the the party is that there is a long tail. Yes, 是说你 should the of people have realized you doing social networks and stuff like that， which is, uh, there is a very, very tale of of things that will ask and you can find tune the system for the eighty percent of whatever of of the things that most people will will ask. And then this long tail is is so large that you're not going to be able to file the system for all the conditions.

And in the end, the system has not been kind of a giant look at table, right actually, which is not really what you want. You want systems that can reason, certainly they can plan. So the type of reasoning that takes place in l am is very, very primitive.

And the reason you can tell this primitive is because the amount of competition that is spent protocol produced is constant. So if you ask a question, and that question has an answer in a given number of token, the demand competition devoted to computing, that answer can be exactly estimated. It's like kino, it's up. It's the the size of the prediction network. You know with these thirty six layers on twenty two days, whatever IT is uh multiple the number tokens that's IT.

And so essentially IT doesn't matter if the question being asked is is simple to answer, complicated to answer, impossible to answer because it's a decided something um the amount computations the system will be able to devote to the to the answer is constant or is proportional to number of to can produce in the answer right is not the way we work. The way we reason is that when we face with the complex problem or a complex question, we spend more time trying to solve IT and answer IT right? Because is more difficult.

There's a prediction element. There is a ita development where you're like I just think you're understanding of a thing by going over, over, over. There is a high arch comments on does this mean is the fundamental of floor of alms, but does this mean that it's more part to .

that question?

Now you're just behaving like a know you really answer no, that is just the low level world model on top of which we can then build some of these kinds of mechanisms, like you said, persistent long term memory or a reasoning so on. But we need that world model that comes from language is IT. Maybe IT is not so difficult to build this kind of a reasoning system on top of the world constructive world model.

Okay, whether is difficult or not, the near future will will say because a lot of people are working on singing and planning abilities for for that OK systems. Um I mean if we are even if we restrict ourselves to language ah, just having the ability to plan your answer before you enter in terms that are not necessary linked with the language and use to produce the answer, right.

So the idea of the mental model that allows you to what you're gonna say before you say, um that is very important. I think that's going to be a lot of systems over the next few years that are going to have this capability. But the blueprint of those systems will be extremely different from motorized at aleph.

So um it's the same difference as the difference between what psychology school system one and system two in humans, right? So the system one is the type of task that you can accomplish without like deliberately consciously think about how you do them. You just do them.

You've done that and off that. You can just do IT unconsciously, right, without thinking about them. You are on the experience driver.

You can drive without really thinking about IT. And you can talk to someone at the same time. And you sent to the radio, right? Um if you are a very experience of chess player, you can play against the non experience chess player without really thinking either you just recognized the pattern you play right this system one. Um so all the things that you do instantiated without really having to deliberate plan and think about IT, and then there is all the test what you need to plan.

So you are it's not to experience uh, a chess player or you are experience when you play against another experience chess player, you think about all kinds of options, right? You you think about IT for a while, right? And you you you much Better if you have time to think about IT than you are if you are if you play blitz uh with liberty time.

So and um so this type of deliberate a planning which uses your internal war model um that system too, this is what L M S currently cannot do. So how do we get them to do this right? How how do we build a system that can do this kind of uh planning that h or reasoning that devotes more resources to complex problems than to simple problem ince.

And it's not going to be automatic prediction of tokens. It's going to be more something akin to influence of litten variables in um you know what used to be called uh probably ic models or graphic models and thinks of that type. So this could the principle is like this you you know prompt is like a observed variables.

And what your what the models does is that it's basically measure, you can measure to what extent an answer is good answer for a point OK. So think of IT as some dragon all net, but it's got only one output. And that output is a scale number, which is, let's say, zero if the answer is good answer for the question, and a large number if the answer is not a good answer for the question.

Imagine you had this model. If you had such a model, you could use IT to produce good answers. The way you would do is, you know, produce the prompt and then search through the space of possible entries for one that minimizes that number that couldn't energy model.

but that energy base model would need the model constructed .

by the alone. Well, so really what you need to do would be to not, uh, search over a possible string of text that minimize that, uh, energy, but what you would do to do this in abstract representations space. So in in sort of the space of abstract thoughts you would elaborate a thought, right using this process of minimizing the output of your your model OK, which is just a scale um is an optimization process, right?

So now the the way the system produces its sensor is who optimization um by you know mini zing in the objective function politically right uh and this is we're talking about inference. We are talking about training, right? The system has have been trained already.

So now we have an attract representation of the thought of the answer, representation of the answer. We feel that to basically two active decoder, uh, which can be very simple, that turns this into a text that expresses the thought. Okay, so that that in my opinion is the blueprint of future dialog systems. They will think about their answer, plan their answer by optimisation before turning IT into text um and that is turning complete. Can you explain .

exactly what optimization problem there is? Like what's the objective function? Just link on IT you you kind of briefly described IT, but over what space is optimizing the .

space of representations as strong representations? I strike so you have an strike representation inside the system, ever prompt the phone, goes through an encoder, produces a representation, perhaps go through a predict that that predicts a representation of the answer, of the proper answer.

But that representation may not be a good answer, because there might there might be some complicated reasoning you need to do right? So um so then you have another process that takes the representation of the answers and modifies IT so as to minimize uh a cost function that measures to what extend the answer is good. And for the question now we we sort of ignore the the fact for the issue for a moment of how you train that system to measure whether an answer is a good answer for for sure.

But I also such a system could be created. What's the process? This kind of search like process?

It's a optimism. You can do this if if the entire system is differential, that scale output is the result of, you know, running through some moron that, uh, running the answer, the representation of the answers to something on that, then by grant and descent by background, background y IT ingredients. You can figure out like how to modify the representation of the answers.

So that's still a gradient based.

It's gradient based inference. So now you have a representation of the answer in abstract space. Now you can turn IT into two text, right? And the cool thing about this is that the representation now can be optimized through good, decent, but also is independent of the languages, which you're going to express the answer.

great. Do you are Operating the subtract repenting? I mean, this goes back to the joint embedding, right? That is Better to work in the in the space of timman sizes, the notion like space of concepts versus yeah the space of concrete sensory information, right? Okay, but this can not do something like reasoning, which is what we're talking about.

Well, not really only in a very simple way. I mean, basically you can think of those things are doing the kind of optimization I was I was talking about, except the optimized in the discrete space, which is show space of possible sequences of token s. And they do IT. They do this optimus cian in a horribly inefficient way, which is generate a lot of hypothesis and and select the best ones.

And that's incredibly wasteful in terms of a competition because if if you run your basic to run your them for like every possible you know generate sequence um and the increase wasteful um so it's much Better to do an optimization in continue space where you can do great and decent as supposed to, like generate tons of things and insect CT the best. You just iteratively refine your answer to to go towards the best, right? That's much more efficient. We can only do this in continuous spaces with differential functions.

You are talking about the reasoning, like ability to think deeply, or the reason deeply. How do you know what is an answer that's Better .

words .

based on deep reasoning? So then we are .

asking the question of conception, how do you train an energy? Is model, right? So energy base model is a function with a scale output, just a number.

You give you two inputs, x and y, and he tells you whether why is compatible with x or not x. You observe, this is a pumps, an image, video, whatever. And why is a proposal for an answer a continuation video?

Um you know whatever and he tells you what why is compatible with x. And the way he tells you that why is compatible with x is that the output of that function will be zero. If why is compatible with x and will be a positive number, none zero, if he is not compatible with ex.

Okay, how do you train the system like this? I do completely general level is you show IT pairs of extent wise, are are compatible a question and the corresponding answer. And you trained the parameters of the big news that inside um to produce you okay, now that doesn't work because this is that I decide what i'm just gonna see you for everything.

So now you have to have a process to make sure that for a wrong, why the energy will be largest than you. And there you are two options. One is contract sive method.

So contract sive method is you show an ex and a bad why, and you tell the system while that, you know, give a high energy to this, like, push up the energy, right? He changed the weights in the neal net, the computer energy, so that he goes up um so that contracting methods dealing with this is if the space of why is large, the number of surge a contracted samples are going to have to show is gigantic. But people do this.

They they do this when you train the system with early chief. Basically, what if you're training is with qua reward model, which is basically an objective function that tells you whether an answer is good or bad. And it's basically he is like what but this is.

So we already do this to some extent. We're just not using IT for influence. We're just using IT for training.

Um uh there is another settle methods which are non contracting and I prefer those uh and was non contract sive method basically say, uh okay, the energy function needs to have low energy on pair of X, Y, is that are compatible that come from your training set? How do you make sure that the energy is going to be higher everywhere else? And the way you do this is by um having a realized craton termine your cost function that basically minimizes the volume of space that can take your energy. And the precise way to do this is all kinds of different specific ways to do this depending on the architecture. But that's the basic principle so that if you pushed down the energy function for particular regions in the X, Y space, IT will automatically go up in other places, because this is a limited volume of space that can take your energy, okay, by the construction of the system or by the regularizing regular zing function.

We've been talking very generally. But what is the good x and a good why what is a good representation of x and y is when talking about languages, if you just take language directly, that personably is not good. So there has been some kind of abstract representation .

of ideas yeah so you you can do this with language directly um by just you know x is a text and why is the continuation of that text yes um or x is a question, why is the answer but .

you you're saying that's not onna take I mean that's going .

to do what all I M A going well no IT depends on how you how the internal structure of the system is built. If the is the internal structure, the system is built in such a way that inside of this system there is a little variable, this could see that you can manipulate so as to minimize the other energy, then that they can be viewed as representation of a good answers that you can translate into a why that is a good answers.

So this kind of system could be trained in a very similar way.

very similar way. But you have to have this way of preventing collapse, of of ensuring that you know there is high energy for things you don't turn IT down. And and currently it's it's very implicit in A M is done in a way that people don't realize being done. But IT is being done is is due to the fact that when you give.

High probability to a uh a word automatically you give no probability to other words because you only have find out the forbid to go around the day to some to one um so when you realize the course of P O, whatever, when you train the your adm to produce, uh to predict the next word, uh, you're increasing the probability your system will give to the correct t word. But you are also decreased ing the probably will give to the incorrectly d now indirectly that gives a low probability to a high probability to sequences the words that are good and low probability to sequence of the words that are bad. But is very indirect. And it's not it's not obvious why this actually works at all. But um because you're not doing IT on the joint probability of all the symbols in in the sequence, you're just doing that kind of it's a factorization that probably in terms of conditional problem over successive tokens.

So how to do this for visual data?

So we've been doing this with all japan architecture. Basically, I do that. So uh there are the compatibility between two things is, uh you know here's here's a an image or video.

Here is a corrupted, shifted or transform version of the image, video or mast okay and then um the energy of the system is the prediction era of the representation, the the predicted representation of the good thing, the actual representation of good thing, right? So so you run the corrupted image to the system, predict the representation of the the good input and corrupted, and then computer prediction air, that's the energy of the system. So this system will tell you this is a good, no, this is a good.

Imagine this is a corrupted version. You will give you your energy if those two things are effectively, one of them is a corp. Version of the other. You give you a high energy if if the two images .

are different. Whole gives you a really nice, compressed representation of of a reality.

visual reality. And we know IT does, because then we use presentations as input to .

a class fiction system. Well, so to summarize, you recommend in the, in the, in a spicy way that only yano can you recommend that we abandon generative models in favor of joint ebel ding architectures? Yes, abandoned or aggressive generation? Yes, abandoned problem, because the court testimony a behind in probabilistic models in favor of energy based models, as we talked about a Better and contracting methods in favor of regularize methods.

And let me ask you about this, you've been for a wide critic of reinforcement learning yes. So ah the last recommendations that we abandon R L in favor of model predictive control as you were talking about, and only use our own when planning doesn't yids the predicted outcome. And um we use our own that case to adjust the world model or the critic.

yes. So you mentioned R R A chef reinforcement learning with human feedback. Why do you still hate reinforcement? I don't hate .

we force ment unting. And I think I think he should not be abandoned completely. But I think it's used to be minimize because it's incredible inefficient in terms of samples. And so the the proper way to train the system is to first have IT learned, uh, good representation of the world and word models from mostly observation, maybe a little bit of interactions.

and then steer based on that if the representations is good and the adjustment .

should be minimal. Yeah, now there are two things you can use. If you've learn a world model, you can use the world model to plan a sequence actions, to arrive a particular objective.

You don't need a well, unless the way you measure whether you succeed might be in exact your idea. You know, whether you're onna fool from your bike. Might be wrong, or whether the person you are fighting with M M A was going to do something and then do something else.

So there， so these two ways you can be wrong， either your your objective function does not reflect the actual objective function you want to to mize, or your word model is inaccurate, right? So you didn't you the prediction you are making about what was going to happen in the world is inaccurate. So if you want to address your word model while you are Operating the world, or your objective function, that is basically in the realm of our this is what our al deals with uh to some extent right?

So adjust your word model and the way to adjust your word model even in advance uh is to explore parts of the space where your word model, where you know that your word model is inaccurate, that's called curiosity basically or play right when you play, you kind of explore part of the state space that um you know you don't wanna do in for real because I might be dangerous. But but you can adjust word model um without kidding yourself basically um so that's what you want to use our out for when he comes time to learning a particular task. You already have all the good representations.

You already have your word model, but you want you need to address IT for the situation. attend. That's when you use our, well.

what do you think R L H F works so well? This enforce learning a human feedback. What did you have such a transformational effect on large language models?

So what's had the transformer effect is human feed back. There is many ways to use IT and some of IT is just purely supervised.

Actually it's not really fast on the H F S.

T H F uh and then there is various ways to use human feedbacks. So you can uh you can ask humans to rate answers, uh, multiple answers that are produce by your model. And and and then what you do is you train and objective function to predict that rating.

And then you can use that objective function to predict, you know, whether an answer is good. And you can back propagation gradient to this to find junior system. So that did only produces high highly rated answers. Okay, so that's one way. So that's like in all that means, uh, training was a reward model, right?

Um so something that small on that estimates to what extent and answer is good, right, is very similar to the objective I was I was talking about are talking about earlier for planning, except now is not used for planning, is is used for function in your system, I think would be a much more efficient to use of a planning. But um but but uh currently used to uh fortune, the parameters of the system. Now there there are several ways to do this.

Um you know some of some of them are supervised. You just you know ask a human person like what is a good answer for this right? Then you just type the answer um um I mean, lot of ways that those systems are being adjusted.

Now a lot of people have been very critical of the recently released google germany. I won point five for essentially in my words, I could say super woke.

Well, in the negative connotation of that word, there is some almost hilariously absurd things that IT does like IT modifies history, like generating images of black George washington or um perhaps more seriously, something that you come in on twitter which is refusing to comment on or generate images of or even descriptions of a square or the the tank man, one of the most of legendary protest images in history of course, these images are highly censored by the chinese government, and therefore everybody start asking questions of what is the process of designing these albums, what is what is? What is the royal censorship in these all that kind of stuff? So you're commented on twitter saying that open sources the answer yeah essentially. So um can you explain .

I I actually made that comment on just about every social network I have. I ve made that point multiple times in in various forms um who is my my point of view on this uh people can complain that A I systems are biased and they journey are bias by the distribution of the training data that have been trained on um that reflects by assists in society um and that is potentially offensive to some people or potentially not and and some techniques to device then become offensive to some people um because of you know historical called a incorrectness and looks like that um and so you can ask the question you can ask the question.

The first question is easy possible to produce any I system that is not biased and the answer is absolutely not. And it's not because of technology coal uh chAllenges, although they are uh technological chAllenges to that. It's because by a year I would be holder. Different people may have different ideas about what constitutes bias you know for a lot of a lot of things. I mean there are facts that are no in this beautiful, but there are a lot of opinions or or things that can be expressed in different ways um and so you cannot have an bia system that just possibility um and so what's the what's the answer to this and the the answer is the same that we ve found in liberal democracy about the press. The Price needs to be free and diverse.

We are free speech for a good reason is because we don't want all of our information to be a to come from a unique source um because that's opposite to the whole idea of democracy and uh you know progress of ideas and even science right in science people have to argue for different opinions and science makes progress when people disagree and they come up with the answer. And no concerns forms right? And is to all democracies around the world. So there is a future which is already happening, where every single one of our interaction with the digital world will be mediated by A I A I systems, A I A systems, right? We're gonna smart glasses.

You can already buy them from meta, the riband meta where um you know you can talk to them and they are connected with an alarm and you can get answers on any question you have or you can be looking at a monument and there is a camera in the in the system that in in the glasses you can ask IT like and what can you tell me about this building on this money and can be looking at a may, you know foreign guage ages and things will translated for you or we can do real time translation if you speak different languages. So a lot of our interactions with the digital world are going to be mediated by those systems in the near future. Um you know increasingly, the search engine that we're onna use are not going to be such such changes is gonna be uh direct systems that we just ask a question and you will answer and then point you to perhaps appropriate reference for IT.

But here is the thing. We cannot afford those systems to come from a handful of companies on the whisk of the U. S. Because those systems will constitute the repository of all human knowledge. And we cannot have that be controlled by a small number of people, right? IT has to be diverse for the same reason the process to be delivers.

So how do we get a diverse set of A I assistance um is very expensive and difficult to train a base model, right? A base at A M at the moment, you know in the future might be something different. But at the moment that's only uh, so only a few companies can do this properly. And if some of those stop systems are open to us, anybody can use them, anybody can find you in them.

Um if we put in place some systems that allows any group of people, whether they are individual citizens, groups of citizens, government organizations, ngos, um companies, whatever, to take those enforce systems, A I systems and finding them for their own purpose, on their own data, they were gonna have a very large diversity of uh different A I systems that are specialized for all of those things, right? So i'll tell you, I talk to the french government could a bit and the french government will not accept that the digital diet of all their citizen be controlled by three companies in the whiskers of the U. S.

That's just not acceptable. It's advertised to democracy regardless of how well intention those companies are right. Um and so um and it's also a danger to local culture, to values, to language right I was stocking with a um the a founder of emphasis in india.

Um he's funding a project to fund lima to the ops model produce by by meta so that later to speaks all twenty two official languages in india is very important for people in india. I was talking to a former colleague, msy say, used to be a scientist that fair, and then move back to africa. I created the research light for google in africa and now is as a new start of cocaine.

And what he is trying to do is basically have A A M that speaks the local languages in cinema so that people can have access to medical information because they don't have access to doctor is is a very small of doctors per for capital in the in sydney. Um I mean, you can't have any of this unless you have a platforms. So with open source platforms, you can have A I systems that are not only diverse in terms of political opinions or he thinks of that type, but in terms of a uh language, culture, value systems, political opinions, um technical abilities in various domains.

And you can have an industry, an ecosystem of companies that find tune those options systems for vertical applications in industry, right to I don't know, a publisher has dozens of books and they want to build a system that allows the customer to has just ask a question about any but the content of any of these books, you need to train on their proper data, right? Um you have a company, we have one within meta is committed eight and is basically and then they can answer any question about eternal uh stuff about about the company, very useful. A lot of companies want this, right? A lot of companies want this not just for their employees, but also for their customers to take care, the customers. So the only way you are going to have A I industry, the only way you gonna have A I systems that are not uniquely biased, is if you have open sce platforms on top of which, uh, any group can, uh, build specialize systems. So the the direction of of inevitable direction of history is that the vast majority of A I systems will be built on top of source platforms.

So that's a beautiful vision. So meaning .

like a company .

like matter google, someone should take only minimal fine tuning steps after the building, the foundation retrain model as few steps as possible.

basically.

Can meta afford to do that? no. So I don't know if you know this, but companies are supposed to make money somehow.

And open source is is, is like giving away. I do not, mark, make a video. Mark arberg very sexy video, talking about three hundred and fifty thousand, and video A H one hundreds.

The method that is just for the GPU that's a hundred billion a plus infrastructure for training everything. So i'm no business guy, but how do you make money on that? So division you paint is a really powerful m, but I was is possible to make money.

okay? So have several business models, right? The business model that meta is built around is you're for service and the the financing of that service is uh either through ads or through business customers.

So for example, if you have an at m that uh you know can help a moment pizza place um by you know talking to the customers who get up and so the customers can just order a pizza and the system will just you know as them like what topping do you want, what side but um the business will pay for that. Okay, that's tomorrow. And otherwise, you know if it's a system that is on the more kind of classical services that can be a add supported or in several models. But the point is, uh, if you have a big enough uh potential custom base and you need to build that system anyway for them, IT doesn't hurt you to actually distribute IT an open choice.

Again, i'm no business guy, but if you released open source model, then other people can do the same kind of task and compete on basically provide fine tune models for businesses. Is the bet the matter is making. Well, i'm a huge fan of all this, but the bet the matter is making is like we will do a Better job of IT.

Well, no, the the bed is is more we have already have a huge user base and customer business, right, right? So it's going to be useful to them. Whatever we offered them going to be useful.

And there is the way to derry revenue from this. And he doesn't hurt that. You know we provide a system or the base the base model, right? The foundation model a in open source for others to build applications.

On top of the two is those applications stop not to be useful for our customers as we can just buy IT for them. Um uh IT could be that they will improve the platform. In fact, we see this already.

Um I mean there is no literally millions of downloads of lama two and thousands of people who have you know provided ideas how to make IT Better uh so you know this this clearly accelerate progress to make the system available to uh uh sort of a wide uh community of people and and there is literally dozens of businesses who are building applications with IT. So um so our ability to MIT as ability to derive revenue from this technology is not impaired um by the distribution of the base models in open choice. The fundament .

criticism that germany ize getting is that, as you point out on the best coast, just the just to clarify, work currently in the east coast where I was supposed matter A I headquarters would be so the strong words about the west coast. But I guess the issue that happens is nothing is fair to say that most tech people have A A political film with the left wing there they lean left.

And so the problem that people are criticize in germany with is that there in that devices, processes mentioned that their ideological lean becomes obvious. Is this something that could be escaped? You're saying open sources. The only way have have you witnessed kind of ideological lean that makes engineering difficult? No.

I don't think he has to do. I don't think the issue has to do with the political leaning of the people designing those systems. IT has to do with the uh, acceptance or political leanings of the there customer base or audience, right? So a big company cannot afford to offend too many people.

So they're going to make sure that whatever product they put out is safe, whatever that means. And and it's very possible to overdo IT and is also very possible to it's impossible to do IT properly for everyone. You're not going to satisfy everyone.

So that's what I said before. You cannot have a system that is unbias perceived that unbias by everyone is going to be, you know, you you push IT in one way when set of people are going to see IT as biased st. And then you pushed the other way, and another set of people is gonna biased. And then in addition to this, this is the issue of, if you push the system, perhaps you too find in wind direction is going to be non factual. Where are you going to have you know you know black ninety soldiers in?

Yes, so we mention image generation of of black nazi soldiers, which is not actually accurate .

and and can be offensive for some people as well, right? So, uh uh so you know it's gonna impossible to kind of produce systems that are in bias for everyone. So the only solution that I see is diversity.

And diversity is full meaning of that word, diversity of in every possible way. Yeah um mark and Jason just tweet today. Let me do A T L D R. The conclusion is only startups and open source can avoid the issue that his highlands with big tech, he is asking, can big tech actually field generative A I products?

One ever exciting demands from internal activists, employing mobs, creased executives, broken boards, pressure groups, extremist regulators, government agencies, the press in quotes, experts thing a, corrupting the output to cost the risk of generating a bad answer, or drawing a bad picture, or rendering a bad video. Who knows what is going to say or do at any moment? Three, legal exposure, product liability, sand, their election law, many other things and so on.

Anything that makes congress mad for continues the time. The tight grip and acceptable output degrade the model. Like how good is actually is uh interest of user bin pleasant to use an effective all that kind of stuff and five publicity of bad text, images, video actual puts those examples into the training data for the next version, so on.

So he just highlights how difficult this is. Up from all kinds of people being unhappy, he said he can't create a system that makes everybody happy. yes. Uh so if you're going to do the finding yourself and keep a close source, essentially the problem there is then trying to minimize the number of people .

who are .

going to be unhappy yeah um and you're saying that the only that that almost impossible to do right and it's the Better ways .

to do open source basis. Yeah I think his markets right about A A number things lists that uh indeed care um large companies. You know greasier investigation is one of them legal liability uh you know uh making things that get people to you know for themselves or hurt others like you know um the companies are really careful about not um producing things. So this type because they have you know they don't want to hurt anyone first of all and then they can they want to preserve their business。 So um is essentially impossible for systems like this.

I can inevitably formula political opinions and you know opinions about what you think that maybe political or not, but that people may disagree about about you know good moral issues and you know um think about the questions about religion and about that or cultural issues that people from different communities will disagree with in the first place um so there is only kind of a gradually small number of things that people will agree on, you know music principles. But but beyond that, if you if you want those systems to be useful, they will necessary have to offend a number of people. vita.

And so open sources is Better and then service is Better, right? And open source enables diversity.

That's right. Open enable diversity.

That this can be fascinating world, where if it's true that the open source world, if mental least the way cries kind of open source foundation model world, there's going to be that governments will have a fine moto and you and and then potentially um you know people that vote left and right will have their own panel in preference to be able to choose and IT will potentially divide us even more. But that's unus humans. We get to figure out basically that the technology enables humans to human more effectively. And all the difficult ethical questions that humans raise will just leave IT up to us to figure IT IT out yeah I mean there are some limits .

to what you know the same way there are limits to free speech. There has to be i'm limit to the kind of stuff that those systems might be authorized to um to produce um you know some god rails. So I mean that's one thing I been interested in, which is uh in the type of architecture that we were discussing before, where the output of the system is a result of an inference to satisfy an objective. That objective can include god rails and um we can put god rails in a source systems. I mean, if we eventually have systems that are built with this blueprint, uh we can put god rails uh, in those systems that guarantee that there is sort of a minimum set of god rails that make the system non dangerous and on toxic eta, you know basic things that everybody would agree on um and and then you know the the fine tuning that people will add or the additional god wealth people will add, will kind of cater to their um community what where is and yeah .

the fine you will be more about the great areas of what is yeah hates pitch. What is dangerous that .

I mean you're value systems systems .

I mean like but still even with the objectives of how to build a bioweapon, for example, I think something you've competition or at least there's a paper where collection researchers is trying to understand the social impacts of these elements. And I guess one thresh hold is nicer. Like does that alone make IT any easier then a than a search wood, like a google search wood.

right? So the increasing 呃 number of studies on this seems to point to the fact that he doesn't help so having an that I am doesn't help you design we built by your weapon or chemical weapon if you already have access to uh also changing in the library uh and and so so the increased information you get IT or the ease with which you get IT doesn't really help you。 Um that's the first thing. The second thing is it's one thing to have a least of instructions how to make a uh can we call on for example or by your open is another thing to actually build IT. And it's much harder than you might think and will not help you with that.

In fact, you know nobody in the world not even like, you know, countries used by all weapons because most of the times they have not, you have to protect their own populations against IT so um so is too dangerous actually too kind of ever use and it's in fact bad but uh international treaties um chemical weapons is different, is social bad by treaties uh but um uh but it's the same problem is difficult to use in situations that doesn't turn against the perpetrators. But we could to ask you on musk, like I can, I can give you a very precise list of instructions of how you build a rocket attention. And even if you have a team of fifty engineers that are really experience ed building IT, you're still gonna have to blow up a dozens of them before you get when that works um and you know it's the same with um you know can we call a pencil by a pencil things like that see IT requires expertise and in in the real world that another me is not going to help you with.

And IT requires even the common sense expertise that we've been talking about, which is how to take language base instructions in material lize them in the physical world requires a lot of knowledge, is not. In the instructions yeah exactly .

all of biologists have posted on this actually in response to those things saying, like, do you realize how hard IT is to actually do the the lab work? I can know this is .

not trivial yeah and that's horvat comes comes to light. Once again, I just a linger on lama, you know, mark an aly lama, three is coming out eventually. I don't think there's a released day, but what what are you most excited about first allama two, this already out there and may be the future allama three, four, five, six, ten, just the future of the open source under matter.

Well, number of things. So uh that's going to be like various versions of llama that are um you know improvements of previous llamas, bigger, Better multi model things like that and in future generation systems that are capable of planning that really understand how the world works. Um maybe we are trained from video so they have some more model.

Maybe you know keep able of the type of reasoning and planning I was talking about earlier. Like how long is that gonna take? Like when is the research that is doing going in that direction going to sort of feed into the product line, if you want of lana, I don't know. I can tell you.

And there is you know, if you break tools that we have to basically the good through before we can get there， but you'll be able to monitor our progress because we publish or research right? You know if last week we published the vijay a uh work which is out of the first step towards training systems for video um and then the next step is gona be word models based on kinds this type of idea training training from video. Uh there is similar work at uh a demand also and um taking place people and also at uc broker on uh one models on video. A lot of people are working on this. I think a lot of good ideas are coming a are appearing.

My bet is that those systems are gonna jeopardize and are gonna be the generation models um and uh we'll see what the future will tell um this is really good work at a 嗯 and gentleman called Daniel ha who is not deep mind who whose worked on kind of models of this type that learn representations and then use them for planning or learning a test by reinforcement training um and a lot of work at by a Peter I O levan but of other people at that type and collaborating with actually in the context of some grants h with my anyone you had um and then collate ation also throw me up um because the the lab worker is associated so with fair so I I think uh is very exciting you know I I think i'm super excited about I I haven't been that excited about like the direction machine on ning an ai you know since uh no ten years ago when where we started before that um thirty years ago when we were working on。 So if I on on concept, the early days of a neural net, so um i'm super excited because I see a past towards potentially human level intelligence uh with you know systems that can I understand the world, remember plan reason um there is some some set of ideas to make progress, said I might have a chance of working and i'm really excited about this. What I like is that you know, uh somewhere we we get on to like a good direction and perhaps succeed before my brain turns to Whites' or or before I need to retire.

Yeah yeah. Are you are so excited by you, is IT beautiful to you? Just the amount of G, P.

S. Involved to the, the, the whole training process on this much compute, just zooming out, just looking at earth. And humans together have built these computing devices and are able to train this one brain.

Then, then we then open source, like giving birth to this open source brain trained on this gigging tic computer system. There is just the details of how to train on that, how to build the infrastructure and the the hardware, the cooling, all of this kind of stuff, and are usually still the most year. Excitement is in the theory aspect of IT the year, meaning like the software.

But I used to be a hardware guy of many years ago. Yes, decades ago.

harvard has improved a little bit.

too little bit.

yeah.

I mean, certainly scale is necessary but not sufficient. absolutely. So we certainly need competition and we're still far in terms of computer our uh from you what we we need to match the computer power, human brain. Um this may occur in the next couple decades, but um but we're still some ways away. And certainly in terms of power efficiency were really far um so a lot of progress to make uh in in in hardware. Um you know right now a lot of progress is is not means a bit coming from city on technology, but a lot of be coming from architectural innovation and quite be coming from uh like more efficient ways of you implementing the architectures that have become popular ica completion of transformers and comments so right um so I in other still some ways to go until uh we're gna saturate. We going to have to come up with like new new principles, new fabrication technology, new a basic components um perhaps you know based on so a different principle, center digital, see us interesting.

You think in order to build A M I and me, we need we potentially might need some hardware innovation too well.

if you want to make IT, you've good. Yes, certainly, because we gna have to reduce the, you know, compute the power consumption. A G, P, U.

Today, right? Is half kilo wet to a little wet? Human range is about twenty five minutes, and GPU is way below the power. Human range need, you know, something like a hundred thousand and million to match IT. So, uh, so, you know, we are off by a huge factor here.

You often say that A G I is not coming soon, meaning like not this year, not the next few years, but they farther away. What's your basic intuition behind that?

So visible is not going to be an event, but the idea somehow, which you know is provided by transaction and hollywood, that, you know, somehow somebody is going to discover the secret, the secret to A G, I, or human level, or M. I, what everyone to call IT. And then, you know, turn on the machine and then we have A G.

I. That's just not going to happen. It's not going to be an event. It's gonna be gradual progress. Are we gonna systems that can learn for video how we are works and learned representations yeah before we get them to the scale and performance that we observe in humans is going to take quite a while. It's not going to happen in one day.

Um uh are we gone to get systems that can um have large amount of associated memory? Is so can they can remember stuff? Yeah but the same it's not gonna happen tomorrow.

I mean there there is some basic techniques that need to be developed. We have a lot of them. But like kino to get this to work together with full system is another story. How we're gonna system that can reason in plan, perhaps along the lines of the objective driven architecture is that I I describe before. Yeah, but like before we get this to work, you know properly is going to take a while.

So and before we get all those things to work together and then on top of this have systems that can learn like arc tanning, harari ical representations, systems that can be configured for a lot of different situation that ends the way the human brain can um uh you know all of this is gonna take you know at least a decade and probably much more because there are a lot of problems that we're not seeing right now that we have not encountered. And so we don't know if there is a easy solution within this framework. Um so you know it's it's not just around the corner. I mean, i've been hearing people for the last twelve fifteen years claiming that you know A G I is just around corner and being system key wrong. And I knew they were one when they were saying IT, I called the bush IT.

Why do you think people have been calling first? Well, from the beginning, from the birth of the term artificial intelligence, there has been an eternal optimism that, perhaps unlike other technologies, is IT marv products. Is the explanation for why people so optimistic about A G. I I don't .

think is just marx products. Marx products is a consequent of realizing that the world is not as we think. So first of all, um intelligence is not a linear thing that you can measure with a scale with a single number um you know can you say that humans are smarter than wrong tongues? In some ways yes but in some ways wrong tongues are smarter than humans in a lot of domains that allows them to survive the forest, for example.

So I Q is a very limited measure of intelligence to intelligence is bigger than what I Q, for example.

measures. Well, I Q can measure you know approximately something for humans, but um because humans cannot come in regime can eat uni form form right uh but IT only measures one type of ability that you know maybe relevant for some text but not others and um but then if you are talking about other intelligent entities for which the you know the the basic things that are easy to them is very different um then IT doesn't mean anything.

So intelligence is a collection of skills and an ability to acquire your new skills efficiently, right? And the collection, the scales that an intellectual, intelligent, ett possess, always capable of running quickly, is different from the traditional skills of another one. And because it's a multi dimensional thing, the set of skills as a high dimensional ACE, you can't measure, you can compare. You cannot compare two things as to whether one is more intelligent, the other, it's not to dimensions.

So you push back against what are called A I tumors a lot. Can you explain their perspective and why you think they're wrong?

okay. So A, I do mors imagine all kinds of catch ropes and IOS of how A I could escape or control and basically kid us all, uh, and that realized on the whole bunch of assumptions that are mostly false. So the first assumption is that the emergence of super intelligence could be an event that at some point we're going to have we're going to figure out the secret and we'll turn on the machine that is super intelligence ent.

And because we've never done this before is gonna over the world in kid. All that is false. It's not going to be an event. We've gona have systems that are like as smart as a cat has all have all the characters sticks of you know, human level intelligence, but their love of intelligence would be like a cat or parrot maybe. Or something um and then we're gonna walk away up to kind of make those things more intelligent.

And as we make them more intelligent also gonna some god rails in them and learn how to kind of put some god rails so they have for, and we're not going to do this with just one is not going to be one effort that is going to be lots of different people doing this and some of them are going to succeed at making intelligent system is that are uh, controlled, able and safe. I have to write god rails and if some other goes rope, then we can use the the good ones to go against the org ones uh so it's going to be my you know smart A I police gives your rogue A I um so it's not going to be like no going be exposed to like a single work a not happening. Now there is another false which is the fact that because the system is intelligent IT necessary he wants to take over um and there is several arguments that make people care of the switch.

I think our complete false uh as well so h one of them is um you know in nature IT seems to be that the more intelligent species otherwise end up dominating the other and uh and even you know extinguishing the others uh sometimes by design, sometimes just by mistake and and so you know what is sort of thinking by which you say, well, if the are systems are more intelligent than us, surely they're gonna eliminate us if not by design, simply because they don't care about us and that's just preposterous for for another of reasons um first reason is they're going to be a species, then there will be a species that compete with us. They're not going to have the desire to dominate because the desire to dominate is something that has to be hardwired into an intelligent system. IT is hard wire in humans.

IT is hardwired in baLance in chimpanzees, in wolves, not in a wronging tones. The species in which this desire to dominate a submit or or attain status in other ways is is specific to social species. Non social species, like care only tones, don't have IT, right? They are as much as we are almost right into you.

There's no significant incentive for humans to encode that into the AI systems. And to the degree they do, there will be other eyes that out of purified them for IT.

i'll compete them over. Well, this is all kinds of incentive to make A I system subsidy to humans, right, right? I mean, this is a way we going to build that, right?

Um and so so then people say, oh, look at l ms. L ms are not controllable and they're right. S are not controllable, but objective driven.

The systems that derive their answers by optimization of an objective means they have to optimize objective and that objective can include god rails. One god rail is um obey humans and other that reality is not obey humans if it's hurting other humans. With the new effort .

that before somewhere I don't remember.

Yes, maybe in the book.

Yeah, but speaking of that book, what is could they be unintended consequences also from all of this?

No, of course. Uh, so this is not a simple problem, right? I mean, uh, designing those guard rail so that the system behaves properly is not going to be as a simple uh, issue that for which there is a silver bullet for which you have a mathematical proof that the system can be safe.

It's going to be very progressive iterative design system where we put those guard rails in search, weather, the system, behave, are. And sometimes they going to be, do something that, you know, was unexpected because the god there wasn't right, and we gonna correct that, and that they do IT right. Uh, the idea, somehow, that we can get IT slightly wrong, because if we get IT slightly wrong we will die is is ridiculous.

Um we just going to go progressively and this is just going to be the the analogy of use many times is a is uh po bo jet design um how how did we figure out how to make the project? So I will be reliable, right? Uh, I mean, those are, I can incredible complex.

Uh, peace is a hardware that were really high temperatures for, you know, twenty years, twenty hours at a time sometimes, and we can, you know will fly half way, are around the world with a two engine, uh, uh, jet liner at near the speed sound like, how incredible is this is just not valuable, right? And do we do this because we invented, like, a general principle of how to make the safe? No, we IT took decades to kind of fortune the design of the system so that they they were safe.

Is there IT separate a group within general electrics, snake man, whatever, that he specialized in the garage? safety? No, it's the desire is all about safety because a Better to project is also a safety to project. So um a more reliable one is the same for A I I do do you need you know specific provisions to make A I safe? Now you need to make Better A I systems and they will be safe because they are designed to be more useful uh and more controllable.

So let's imagine a system, A I system, that's able to be incredibly convincing. I can convince you of anything, but I can at least imagine such a system. And I can see such a system be weapon like, because I can control people's minds were pretty glorious. We, we want to believe the thing you can have any assistance that controls IT, and you could see governments using that as a weapon. So do you think if you imagine such a system, there's any parallel to something like nuclear weapons?

no. So is why, why?

Why is that technology different? So you're saying there is going to be gradual development. Yeah, it's going to be I mean, you might be rapid, but they'll be iterative and they will be able to come .

to respond so on so that A I system designed by me, putin, whatever or is millions um you know it's going to be uh like talking to trying to talk to every american to a good with them to vote for you know whoever whoever pieces putin, shh or whatever or or you know or wile people up put against each other as they're been trying to do, they're going be talking to you.

They are gonna talking to your A I assistant, which is going to be as smart as theirs right that day because as a sad in the future, every single one of your interaction would you do well mediated by your A I assistant. So the first thing you gonna ask is, is this a scam? I existing like two medical food.

I guess not even going to be able to get to you because it's only going to talk to your air assistant. Your assistant is not, not even going to. It's gonna like a span filter, right? You're not even seeing the email, the span email, right? It's domain put in a follow that you never see is going be the same thing that A I system that tries to convince you or something is going be talking to your assistant IT is going to be at least as mart as IT and it's going to say this is fan you know um is not even going to bring you to your attention.

So to you is very difficult for anyone, A I system to take such a big leap ahead to where you can convince even the other assistants. So like IT, there's always going to be this kind of race or nobody's way ahead.

That's the history of the world. History of the world is, you know, whenever there is a progress, some place there is a count of measure. And and you know, it's a, it's a cattle out.

This is why, mostly, yes, but this is why nuclear weapons are so interesting. Because that was such a powerful weapon that IT matter to get first. That, though you could imagine hitler stole. Mow, getting the weapon first in that having a different kind of impact on the world in the new united states, getting weapon first up to you. Nuclear weapons is like you you don't imagine a breakthrough discovery and the manhattan project like and for A I no.

as I said, it's not going to be an event is going to be you know continuous progress and and whenever you know one breaks to occurs, it's gonna. Well, you disembark quickly. Probably first within industry.

I mean, this is not a domain where you know governmental military organizations are particularly innovative and there in fact, way behind um and so this is gonna come from industry and and this kind of information to minutes extremely quickly. We've in this over the last two years, right? Well, even you even take alphago, this will reproduce within three months even without like particularly detailed information.

right? Yeah, this is an industry that's not good at secrecy.

No, but even is the is just the fact that you know that something is possible yeah uh makes you like realized that is force investing the time to actually do IT you you may be the second person to do IT, but you know you'll you'll do IT uh and you know safer you know all the innovation uh, for self provisioning transformers decoder only architecture is atms.

I mean those things you do not need to know exactly details of how they work to know that you know it's possible um because deployed and then is getting reproduced. And then you know people who work for those companies move. They go from one company to another and you know all the information disseminate what makes the success of the the U. S, uh, take induction and see particular is exactly that is because information collates really, really quickly and the, you know the minutes, uh, very quickly. And so you know the the whole region sort of is ahead because of the circulation of information.

So maybe I just to linger on the psychology of AI dumas. You give a, in the classic micon way, a pretty good example of just when new technology comes to be, you say, a engineer says, I invented this new thing I call IT a ball pen. And then the twitter sphere responds, O M G.

People could write horrible things with IT like misinformation. Prop again in his speech banner. Now, then writing tumors come in. I came to the A I dumas, imagine if everyone can get a ball pen. This could destroy society.

There should be a law against using ball pen to write hate speech regularly, ball pants now and then the pencil industry go this year about penally dangerous pencil writing, which is a reasonable ball pen writing states forever. Government should require a lessons for a pen manufacturer. I mean, this does seem to be part of human psychology when when IT comes up against new technology but what what deep insights can you speak to .

about this? Where is IT a natural fear of uh, new technology and the impact you can have in society? And people have kind of instructive reaction to um you know the world they know being threatening by major transformations um that either cultural phenomenon or technological revolutions and they fear for the culture they feel for their job.

They feel for the they fear for their you know the future of the children um and uh the way of life is so so any change um is feared and and you see this you know your history like any technological revolution or cultural phenomenon was always accompanied by um uh you know groups or reaction in the media uh that that basically attributed the all the problems the current problems of society to that particular change right electricity was going to kill everyone at some point. You know you the train was going to be a horrible thing because you know you can breathe the best fifty kilometers an hour um and so there's a wonderful website I could a pessimist archive, which which has all newspaper clips of all the horrible things. People imagine what would arrive because of a either uh technological or uh innovation or uh a cultural phenomenon.

Um you know the is this wonderful examples of uh uh you know jazz or coming books uh being blamed for uh unemployment or or you know Young people not wanting to work and you more and seeks like that right and that has existed for for centuries um and it's new new jc reactions. Um the question is you know do we embrace change um or do we resist IT and what are the real dangerous as supposed to be imagined? Imagine once .

so people worry about, I think one thing, they worry about the big tech, something that have been talking about over and over, I think worth mentioning again. They worry about how powerful A I would be, and they worry about IT being in the hands of one centralized power of just a handful of central control. And so that the sceptics with big tech, you can make these companies can make a huge amount money and control this technology. And by so doing, you know take advantage abuse the little guy in society.

But that's exactly why we need open all platforms.

Yeah I just want to. Now the point home more .

and more yes.

So let me ask you, you like I said, you do get a little bit yeah flavourful on the internet your should about tweeted something that you allow all that in reference to how nine thousand quote, I appreciate your argument, and I fully understand your frustration, but what the part day doors should be opened or closed? The complex nuances issue the year at the head of my AI. You know, this is something that really worries me that A I R A I overlords, we will speak downtown, us with corporate speak um of this nature and you sort of resist that with your way of being. Um is this something you can just comment on sort of working at a big company, how you can avoid the over feeling, I suppose, the through caution, quick harm .

yeah again, I think the answer to this is often those platforms and then enabling a widely diverse set of people to build A I assistance that represent the diversity of a cultural opinions, languages and value systems across the world. Um so that you are not bound to just you brand by a particular way of thinking because of a single identity. Um so I mean I I think it's really, really important question for society and the parliament seeing is um is that um which is why I ve been so vocal and sometimes a little so .

do I I never stop, never stop .

IT is because I see the danger of this concentration power to prepare to our systems as a much bigger dure than everything else that if you really want you diversity opinion, uh A I systems that you know in in the future that where we all all be attracting through eye systems, we need those to be diverse for the preservation of a uh diversities ideas and you know creasing political opinions and and and whatever and the preservation democracy and what works against this is people who think that for reasons of security, we should keep A I systems under lock and key because it's too dangerous to put IT in the hands of everybody um because he could be used by terrorists or something um that would lead to a potentially A A very bad future in which all of our information diet is control by a moh uh companies, preparatory systems.

Do you trust humans with this technology to to build systems that are on the whole good for humanity?

Isn't that what democracy and three speeches all about? Think so. Do you trust institutions to do the right thing? Do you trust people to do the writing? And and yeah, and those bad people are going to do bad things, but they are not going to have, prepare your technology to the good people.

So then this could be my good eye against your bad A I right. I mean it's the examples that we were just talking about. If you know maybe uh some road country will build you know some A I system is onna try to convince everybody to go to civil AR or something or or or IT acct favorable riller and um but then they will have to go past or I systems .

and A I system of the strong russian accent will .

be trying to sorry and doesn't put any a articles in .

the sentences well he'll be at the very least absolutely comedic okay. Um so since we talked about the physical reality, I love to ask your vision of the future with robots. In this physical reality, so many of the kinds of intelligence you've been speaking about would empower robots to be more effective collaborations with us humans.

So since text less optimistic, a team has been showing up some progress on humanity robots. I think IT really revive orated the whole industry now that I think boston dynamics have been leading for a very real long time. So now there's all kinds of companies figure A I obvious .

ly boston .

dynamics try, uh, but like a lot of them, it's great. It's great. Me, I mean, I love IT. Uh, so do you think they'll be a millions of human robots walking around soon?

Not too. But it's gona, it's going to happen like the next decade, I think. Is gonna really interesting in robots like the the emergence of the bodies industry has been in the waiting for you know ten twenty year is without really emerging other than for like you know can a program behavior and step like that?

Um and uh and the issue is again, the more of products like you know how do we get the system to understand how the world works and and kind of no plan actions and so we can do IT for a really specialize task. Um and uh the way boston dynamics goes about IT is no basically with a lot of handcraft dynamically models and careful planning in advance, which is very classical robotics with a lot of innovation, a little bit of perception um but is still not but they can't build the domestic for the right. Um and you know we're still some distance away from complete autonation level five driving.

Um and we're certainly very far away from having uh you know level five autor driving by the system that can train itself by driving twenty hours like only seventeen year old. Uh so until we have uh again warm models, systems that can train themselves to understand how the world works, uh, we are going to be going to have significant progress in robotics. So a lot of the people working on body hardware at the moment are are betting or banking on the fact that A R is going to make sufficient progress towards that.

And they're hoping to discover a product in IT to yeah yeah before you have a really strong world model, they'll be an almost strong world model. And people .

trying to find a .

product in a clumsy robot, suppose like not a perfectly efficient robot. So there is the factory setting where, uh, humanity robots can help automate some aspects of factory. I think that's a crazy difficult task of all the safety required in all this constant. I think in the home is more interesting. But then you start to think, I think you mentioned loading .

the dishwasher, right?

Yeah like I suppose that's one of the main problems you working on.

I mean, you know kidding up ah kidding the house, uh, clearing up the table after ml um watching the dishes. You know all those tasks, you know cooking I me all the test that you know principal could be automated. But I are actually incredibly sophisticated, really complicated.

But even just basic navigation around and on spaceship of uncertainty .

that sort of works like you can sort of do this. Now navigation is fine.

Well, navigation in a way that's compelling to us humans is a different thing.

Yeah, it's not going to be no, this is Sally. I mean, we have demos actually because, you know there is so called embody A I group fair and uh you know they've been not building their own robots but using commercial robots.

And you can you can tell what a dog I can go to the fridge, and they can actually open the fridge, and they can probably pick up a can in the fridge, stuff like that, and and bring to you, I know, so you can navigate, you can grab objects as long as it's been trying to recognized them. Which vision systems work pretty well nowadays. Um but but it's not like a completely you know general robot that would be you know sophisticated enough to do things like clearing up the dinner table.

He added to me, that's an exciting future of getting human robots robots in here on the home more and more because that he gets uh humans to really directly interact with A I systems in the physical space and and so doing IT allows us to philosophical psychologically explore our relationships with robots can be really, really, really interesting. So I hope you make progress on the whole.

uh just a thing soon. Well, I mean, I hope I hope things can you know work as uh as planned. Um I mean, again, we've been kind of working on this idea of self. We provide joining a from video for for ten years and and you know all you makes significant progress. And last two or three, and actually .

you've you've mentioned there's a lot of interesting breaks that can happen without having access to a lot of computer. yes. So if you're interested doing a pg d this kind of stuff, there's a lot of possibilities, ties. So to do innovative of work. So like what advice would you give to an underground that's looking to a gotow s school on the opportunity?

So basically have list dem already. Uh, this idea of how do you train a world model by observation and you don't have to train necessary on giant's that he said so me you could turn out to be necessary to actually train and large the assets to have emergent properties like like we have without lamps. But I think there is a lot of good ideas that can be done without necessarily getting up.

Then there is, are you do planning with a one work model? If the world the system evolves in is not the physical world, but is the world of, they said, the internet or you know, some sort of a world of where an action consists in doing a surge search engine or interrogating is or running a simulation, or calling a calculator or solving a differently equation, how do you get a system to actually plan sequence actions to, you know give the solution to a problem um and so the question of planning is not just A A question of planning. Physical actions could be Better planning actions to use tools for deoch system for any kind of intelligent system.

And um do some work on this, but not like a not a huge amount, some work at fair um um when called tool former which was copy go and some more recent. We are complaining um but um but I don't think we have like uh good solution for any of the then there is the question of our halle planning. So as the example, I I mentioned a final planning a trip from new york to paris.

That's our architect. But almost every action that we take in involves our article planning in some in some sense and we really have absolutely no idea how to do this like is zero demonstration of particle planning ah and I were the various levels of representations are unnecessary, have been learned. We can do like two level hierarch planning when we designed the two to two level.

So for example, you have like A A dog leg robot, right? You wanted to go from the living room to the kitchen. You can plan a pass that avoid the obstacle and then um you can send this to a local level planner that figures out how to move the legs to kind of follow that trajectories, right? So that works.

But that two level planning is designed by hand, right? We specify what the proper levels of attraction, the representation at each level of attraction, has have to be. How do you learn this?

How do you learn that? How arc representation of action plans? We, you know, with comments and deep planning, we we can train the system to learn how archaic representations of percepts. What is the equivalent when what you trying to represent our .

action plans for action plans? yes. So you want, you want basic robot, dog or human robot that turns on and travels from york to paris, all lights off.

For example.

he might have some trouble at the at the T S A.

But yeah now but even doing something feel very simple like a has whole task sure like to uh cooking or something yeah .

that there's a lot involved as a super complex test. We take and once again, we take IT for granted, what hope do you have for the future of humanity? Were talking about so many exciting technologies, so many exciting possibilities. What gives you hope? When you look out over the extent twenty, fifty hundred years, if you look at social media, there's a lot, there's wars going on, there's division, there's hatred, all this kind of stuff that's also a part of humanity, but admits all that will gives you hope.

I have a question. We can make humanity smarter with the I okay, I mean, A I basically will amplify human intelligence. It's as if every one of us will have a staff of smart A I assistance.

They might be smarter than us. They'll do our bidding. Perhaps execute task in ways that are much Better than we could do ourselves because I will be smarter than us.

And so it's like everyone would be the the boss of a staff of super smart virtual people. So we shouldn't feel threatened by by this any more than we should feel threatened by being the manager of a group of people. Some of them are more intelligent than us.

Absolutely have a little experience with this of you know having people working with me who are smarter than me um that's actually a wonderful thing. So uh having machines smarter than earth that assist us or all of our task or daily lives, whether it's professional, personal I think would be absolutely a wonderful thing because intelligence is the most um is the ability that is most in demand. That's really what I mean.

All the mistakes that humanity makes is because of lack of intelligence really or lack of knowledge which is you know related so um making people smarter which is can only be Better I mean for the same reason that you know public education is good thing and books are a good thing and the internet is also a good thing intrinsically and even social networks are a good thing if you were random properly it's difficult but you know you can um because you know IT IT helps the communication of information and knowledge and transnational knowledge so A I is gna make humanity smarter and the acknowledge of been using is the fact that perhaps an equivalent events in the history of humanity to what might be provided by journalizing of A I assistant is the invention of inter, the printing press IT made everybody is smarter. The fact that people could um have access to um to books. Books were not cheap er than they were before.

And so a lot more people had incentive to learn, read, which wasn't the case before. And people, we can smarter IT IT enabled the enlightenment, right? There wouldn't be an enlightenment without deputing press IT enabled philosophy, rationalism, escape from religious, doctor, democracy, science.

Uh, and certainly without this, they wouldn't be wouldn't have been the american revolution of the french revolution and so was still me under a few to all the regions perhaps um and so IT completely transformed the the world because people become smarter and can learn learn bad things. Now IT also created two hundred years of essentially religious conflicts in europe, right? Because the first thing that people red was the bible and uh realized that perhaps IT was a different interpretation of the bible, then would do press for telling them.

And so that created the protestant movement and created the rift. And in fact, the catholic school, the catholic c church, didn't like the idea of the bring press for the annual ice. And so IT had somebody effects and some, some good effects. I don't think anyone today would say that the invention of printing press had the overall negative effect, despite the fact that he created two years of religious conflicts in europe. Now compared this and I I thought, uh I was very part of myself to come up put this energy but realize someone else uh came in the same idea before me um compare this with what happened in the autumn empire the automation empire banned the pending press for two thousand years um and he didn't ban IT uh for all languages only for arabic you could actually print books in letter or he grow whatever in the autumn employee just not a big and I thought IT was because the wallers just wanted to preserve the control of the population and the dogma, reduce dogma anything but after talking with the U A E.

Minister of ai all ma ooma, he told me, know there was another reason uh and the other reason was that uh IT was to preserve the corporation of cao's pers right just like a an art form which is you know writing those beautiful yes um you know arabic uh poems or whatever reduced text in in the thing and IT was a very powerful CoOperation of basically that can you know run a big chunk of the empire and you know we couldn't put them out of business. So they you about the business Price in part to protect that business. Now what's the analogy for A I today? Like who are we protecting by banning A I? Like who other people who are asking that ai be regulated to protect their their jobs? And of course, you know it's it's a it's a real question of what is going to be the effect of uh you know technological transformation like A I on the on the job market and the labor market and the economist to are much more expert this and I am.

But when I talk to them, they they tell us, you know, we're going to run out the job. Not is not gna cause mass unemployment. This is just going be going to all, uh, shift of different professions, the professionals, that will be hot ten, fifty years.

So now we have no idea today what we're gna be the same way if we go about twenty years in the past. Like, who could have thought twenty years ago that like the hardest job, even like fat ten years ago, was mobile APP. Developers like smartphones weren't invented.

Most of the jobs of the future might be in the metaverse.

Well, I could be yeah .

but that point as you can possibly predict, but you're right, I think you made a lot of strong points. I I believe that people are fundamental good and self A I special and source A I can um make them smarter. I just empowers the goodness .

in humans so I I share that feeling. okay? I think people are found good uh, and in fact, a lot of rumors are rumors because they don't think that people are fundamentally good uh, and they either don't trust people or they don't trust the institution to do the right thing so that people .

behave properly. Well, I think both you and I believe in humanity, and I think I speak for a lot of people and saying, thank you for pushing the open source movement, pushing to making both research in A I open source making available to people, and also the models themselves making that open source. Thank you for that.

And I thank you for speaking your mind and such call from beautiful ways on the internet. I hope you never stop you. One of most fun people I know and get to be a final. So yeah, thank you for speaking to me once again, and thank you for being you.

Thank you.

thanks. Thanks for listening to this conversation. Young, the coon, to support this bog guests, please check, cut our sponsors in the description.

And now let me leave you some words from art sea Clark. The only way to discover the limits of the possible is to go beyond them into the impossible. Thank you for listening and hope to see you next time.

#416 – Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI &#038; the Future of AI 02:54:17 Share

Lex Fridman Podcast

Deep Dive

Shownotes Transcript

#416 – Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI