The Elegant Math Behind Machine Learning - Anil Ananthaswamy

2024/11/4

Machine Learning Street Talk (MLST)

Anil Ananthaswamy

Anil Ananthaswamy: 本书探讨了现代人工智能背后的数学基础，涵盖了微积分、线性代数、概率统计和优化理论等基本知识。作者认为，理解这些数学原理对于安全有效地使用人工智能至关重要，只有理解了数学，才能指出机器学习的局限性，例如机器学习系统目前只是在进行复杂的模式匹配，而不是真正的推理。本书还回顾了机器学习的历史，从早期的感知器算法到现代深度学习，介绍了各种重要的算法，例如k近邻算法、支持向量机和深度神经网络。作者特别关注了深度学习中的一些关键概念，例如偏差-方差权衡、过参数化和涌现行为，并探讨了自监督学习的突破性意义。此外，作者还讨论了深度学习模型与人类认知的异同，以及归纳先验在机器学习模型中的作用。本书也涉及到反向传播算法的历史和作用，以及维度灾难等挑战。总而言之，本书旨在帮助读者理解机器学习的数学基础，并对人工智能的未来发展有更深入的认识。 Anil Ananthaswamy: 本书还探讨了人工智能的潜在风险，例如就业冲击和社会偏见的加剧，并强调了理解人工智能的数学基础对于减轻这些风险的重要性。作者认为，未来的AI革命将由自监督学习主导，因为自监督学习不需要人工标注数据，可以更容易地进行大规模应用。在讨论人类认知与人工智能的关系时，作者指出，虽然大型语言模型在某些方面表现出类似人类推理的能力，但这只是复杂的模式匹配的结果，并非真正的推理。作者还探讨了能动性、自我意识等概念，以及阿尔茨海默病等神经心理学疾病对自我认知的影响。最后，作者对深度学习的未来发展进行了展望，指出目前关于深度神经网络的缩放定律是经验性的，尚不清楚这些定律是否会随着系统规模的扩大而一直保持下去。作者认为，深度学习可能存在计算上的局限性，例如在组合能力方面，但生物神经网络的存在证明了复杂智能系统的可能性，这为深度学习的未来发展提供了启示。

Deep Dive

Chapters

Anil Ananthaswamy discusses his inspiration to write about the mathematics behind machine learning, driven by his software engineering background and a desire to understand the technology from the ground up.

Ananthaswamy's software engineering background sparked his interest in machine learning.
He undertook a fellowship at MIT to teach himself coding and machine learning.
The beauty and elegance of the mathematical proofs in machine learning inspired him to communicate these ideas to a broader audience.

Shownotes Transcript

Translations:

中文

If you think about us humans, nobody has settled on living the data for us. Our brains over the evolutionary time have learned about patterns that exist in the national world. So given that that's how nature has done IT, there's no reason to expect the machines that we build are also not going to be powerful just because of that technique.

I honestly, since we believe that we can't leave the building of these A I systems to just the practitioners, we need more people in our society, whether there are science communicators, generous policymakers, just really interested users of the technology, but who have some math background, or people just willing to persist and learn enough of the math to make sense of the machines learn. It's only when we understand the math that we can point out that, hang on, these things are not reasoning in the way we think we are reasoning. It's because the shows that what's happening right now is that these machines are just doing very sophisticated change.

Welcome back to M. L. S. T. We are interviewing the author of this book, why machines learn by a new anatha warming.

A nail was flying through the U. K. On july the seventeenth on his way to india.

He had to stop over for about twelve hours. And I invited to come to the A. M.

S. T. interview. Unfortunately, there was a schedule clash. I think I thought he was going to be hit the day before, maybe the day after, and I had to get my good friend, Marcus, to pick him up from the airport, take him over to the studio and asked the questions on my behalf. So i'm going to rerecord the questions.

I think you know unfortunate stuff like that that happens but very, very pleased that I managed to get the main man in the studio even if I wasn't there. So yeah why machines learn? What is this all about? It's a really interesting kind of pedagogical history of the field um but going into some of the underlying mathematics behind many of the approaches in machine learning, a nail is a veteran science writer.

You should look at some of the other books he's written. He's really, really good. The book was written beautiful.

I enjoyed reading IT by the way, he signed IT as well. Pretty cool that I hope you enjoy the conversation with the nail. Can you introduce yourself?

My name is only longer this for me. I am a freeLancer journalist. Um I trained as a computer and electronics engineer. I did my bachelor's in india and my masters and the university of washington in seattle, worked as a softer engineer for a few years before I started feeling the age to become a writer.

And at some point I figured out that the two things I love, science and writing, could be combined, and that I could actually become a science journalist or a science writer. So I went back to school, study science journalism, 呃， came to london to do a, uh, internship with new scientist magazine. Was with them for six months doing the internship.

And that eventually LED to a staff position. I was staff writer in london, a became physics new sector, then became deputy new sector, and a route for new scientist for a long time. And while I was doing that, I was also working on, I started working on my books.

And the first one, I was called the Angel physics, it's a it's a travel logue based book on Cosmology and extra tics physics and each chapter is essentially um uh a piece of travel writing uh where I go to some really extreme locations on earth um like that come a desert and SHE um to lake vehicle le in siberia in peak winter uh to places like antarctic all the way to the south pole. And so that book explores essentially extreme physics. And uh the second book um is called the man who wasn't in there and that's an exploitation of the human sense of self.

So when when you ask the question who am I you kind of get answers from theology and philosophy. Uh and I in in this book I tried to uh answer that question from the perspective of neuroscience and neurotic ology. The third book was uh through two doors at once, which is an exploration of uh uh is essentially the story of one single experiment called the double state experiment which is an extremely mysterious uh uh experiment to explain with our standard way of understanding the world and yet it's very illustrated by what's happening at the quantum mechanical levels. It's really a story about quantum mechanics and quantum foundations tions but told through the length of uh one experiment all the variations of the experiment done over two hundred years and finally um my last book is the you know the book on machine learning is called by machines learn and it's about the mathematics that underpinned modern artificial intelligence what inspired you to .

write about the elegant mathematics of machine learning? And can you give an example that you find extremely .

exquisite and writing about particle physics or Cosmology or neuroscience, I never felt like that was something I could do as you know um personally, IT was more about understanding the science and writing about IT.

But over the last a few years I found myself writing more and more about machine learning and given my software background and given that I used to uh, to be a software engineer, every time I would write stories about machine learning, I think the software engineer part of me woke up like I would look at those stories and uh, get this desire to actually get back into doing a little bit according to actually understand the technology from the ground up. H, so about five years ago, I did a fellowship at M. I T.

Called the night science journalism fellowship. And as part of the fellowship, I decided to teach myself coding all over again. So twenty years after had stopped doing any programming, I literally went back to in the computer signs one of one kind of classes set with teenagers and talk myself at on programing and and I started building some very rudimentary uh, machine learning uh, systems. Uh, well, one or two small things that I learn how to do and as part of the exploration of trying to uh build a deep learning system A A deep neal network based system I got more and more interested in understanding the kind of mathematical entrepreneurs that the basic theory behind machine learning um and towards the end of my fellow shape covered happened when we're all stuck in our apartment and I spent a good six seven months basically stuck in an apartment by myself both in boston and the and in berkeley, california um listening to all these machine learning lectures over and over again teaching myself essentially um and at some point I started realizing that the mathematics that underlies machine learning is quite beautiful and I think then the writer and me woke up saying O I really need to communicate these ideas to do my leaders um so that's not the idea for this book came about um you know why machines learn which is essentially really about some of the conceptual mathematical principles that underlying modern artificial intelligence yeah regarding you know what is elegant about the mathematics of machine learning.

A lot of people who say machine learning is you know mainly about knowing calculus and lean algebra and probably against statistic words particularly elegant about that and um i'm not talking about those sub fields of mathematics that for me um the beauty and elegance uh that I found a uh in when I was learning our machine learning had to do with some of the theos and proves that I encountered like for instance， a if you go back to one thousand and fifty nine when the first artificial neutral networks were being designed, there is a uh there is a proof called the percept on convergence uh. Theory and its proof and IT is a very, very simple a proof just based on a linear gibran and IT was while listening to a professor um explaining the to the students and corner ah that I kind of I think fell in love with the with the subject I I really felt like, okay this is something I really need to tell readers that there is something wonderful I know in this whole subject so the the present convergence proof is an example of words you know really lovely and elegant about uh the mathematics machine learning with the caviar that you know things like elegance are always subjective what I might find beautiful and elegant may not be somebody else a cup of you but know that I would goes um there's there's also uh for instance, uh uh a technique called a kernel methods which is this uh very, very interesting idea where you take data that exists in low dimensions, uh and projected into high dimensions into in a much, much hired dimensional space, possibly even infinite dimensional space um and and the entire method of these current nel methods. What they do is they they rely on the mathematics that needs to happen in the high dimensional space, but the computations that are done are always in the low dimensional space.

So there is a there is a function or a cornal function that kind of project this data into high dimensional space. And all of your you know algorithm is functioning in the high dimensional space, but the actual computation happens in the lower dimensional space. And that called process taking low dimensions data, pushing IT into high dimensions, doing what you want, uh, you know, in those high dimensions spaces, but actually not really doing any computation in the high dimensional space.

Uh, it's really lovely when you look at IT. IT is a quite beautiful and h and very powerful. So there were a lot of ideas like this that I found as I was doing my research, that uh almost made IT very easy to come up with a lot of things about wise to write.

What basic mathematical disciplines do you find essential for machine learning?

So for me, when I when I read this book, I was thinking of, you know, people who have maybe a high school level, you know our first year undergraduate level mathematical uh, education and now want to learn something about the basics of machine learning so we're not talking of people are going to become practitioners but is basically, uh, people who need to understand machine learning, uh, add more depth than as possible if you were just to read magazine articles. So for that kind of audience, I think the disciplines that you really need to kind of um get come to grips with this basic calculus, technical lineage bra um some elements of basics of probability and statistics and a little bit optimization theory um it's not a whole lot. But when these pieces all come together, you can get a very good sense of why machines learn, you know, why they do the things they do.

Many of the recent A I advances seen quite empirically, how much of the mathematical foundations do you think we're important to grasp machine learning.

I think it's true that the modern uh A I uh or modern machine learning, which is essentially a based on a deep learning and deep neural networks. There is a lot of empirical stuff that is happening. People are just building things and finding out that they work this way or that way without really understanding why these are algorithms work the way they do. And and in order to really understand why these systems are powerful or what the limitations are, I think the answers to those questions actually will come from uh, figuring out the mathematical foundations of these algorithms uh right now the the way the field is, I think there's a lot more empirical evidence about know the workings of these machines. Uh, and we are still struggling to figure out the exact mathematical for emulation that can explain uh, why these things work as well as they do or for that matter, what miters are because until we know, uh you know all the process counts of these machines from the perspective of the mathematics, it's going to be hard to put upper and lower bounds on what these machines can or cannot do.

How does your book showcase the rich history of the field, you know, of machine learning beyond just deep learning?

I mean, if you ask anybody today you know about what A I is in people on the street, they will probably say all its ChatGPT. And yes， you know, these large language models have made a big splash. Uh, they use uh a form of technology called deep neal networks uh and deep learning uh but uh， that's not you know the entire history of machine learning goes back a long way and there is a lot of other stuff that has happened, uh that is not about deep learning.

We I mentioned earlier that the early history of deep neural of neural networks of artificial neutral networks begin sometime in the late one thousand and fifteen early one thousand nine hundred and sixties um and those where work were called single layer neurotic networks, essentially one layer of neurons uh artificial irons and uh the algorithms m that were designed were enough to train those singular neural networks to do some task um but IT was IT became clear very soon that if you had more than you one layer sandwich between the input and the output. Uh this uh layer, the sand which is called hidden layer and if you had more than one hidden layer in your network, you could not use the algorithms that you had to train them. And h so and the single neal networks, even though you could train them, couldn't really do a whole lot.

So you know, by the end of the one thousand nine hundred and sixty people had kind of given up on newer networks thinking that these things are not going to be very useful um and but machine learning research didn't stop. There were a whole range of other things that were happening. There were non neural network based ideas.

So for instance, the also in the one thousand nine hundred and sixty, a very powerful algorithm was analyst mathematically and it's called the k nearest igher algorithm that was really popular. There were techniques uh that had to do with the using, you know basic theory and other statistics al ideas to develop um you know algorithms that were really powerful. Uh probably my favorite non neural network based the machine learning algorithms, uh is the support Victor machine. The support vector machines came about uh in the early nineties and kind of dominated the pre neutral network a uh for a long time um and these are machines are these algorithms are a algorithms that tried to find uh an optimal solution to some classification problem um and they also incorporate as part of a the algorithm, the current methods that I just talked about, this idea of taking lower dimensional in data and projecting to higher dimensions, finding optimal um margins in the higher dimensions but doing your computations only in the lower dimension for the combination of um optimal margin prosit ers and coronal methods made these support back machines really powerful um so there's a whole range of stuff that one can talk about that happened between a sort of the late one thousand nine hundred and fifties and early and and sixties when the first neural networks came about and know the last decade or so when deep news networks have come back uh in full force right so um and the book does deal with the intervening history also because I think the mathematical concepts that undelayed those other algorithms are really crucial to understanding what is happening inside these machines in terms of how they represent data. How they see the world, know what they do in terms of maniples the data.

Which criteria ia did you use to select the algorithms and concepts that you spoke about in your book?

I had two hacks on when I was trying to think of what claims of things to uh, put in the book. The the first probably the most important criterion was that the algorithms were uh useful for demonstrating some very key uh mathematical idea like for instance, decay in nea neighbor algorithm is very, very um important for understanding how 有 data you know is turned into vectors and how these vectors you know are mapped onto some high dimensional space and the relationship between vectors this what determines how this algy's does his job and know using the key yours neighbor algoma to kind of give the reader a whole or any debt understanding of how did guys converted into vectors and then get embedding these high dimensions spaces right um so a lot of times I was focused on making sure that every algorithm that I selected was highlighting some key aspect of something mathematical that was crucial for developing an overall picture of uh， what the machines are doing. Um again, this is subjective.

Some other person, some other you know writer could have chosen a slightly different set and you could still make the case that that other other said could also be uh, illustrative of the mathematical concepts. So after figuring out that I needed to address uh particular set to a mathematical concerns, I also heard my writers uh had on right and and the writers had is basically uh making me choose algorithms which have uh some sort of story behind them so to to make the story engaging for the reader. So IT was not enough that there was very good math underlying these algorithms, but that the development of the algorithms themselves had A A story to tell.

No, I could tell a story about them. And I I honestly, very strongly believe that we understand things Better when what whatever we are understanding is anchored uh, in stories and so IT was a dull task of finding algorithms that had key mathematical elements to them but also had you know substantial stories uh, underpinning them. What are some of .

the basic mathematical disciplines that need to be graded way in order to get under the hood of .

machine learning? I would say calculus absolutely um basic calculus nothing very fancy um in algebra, again, depending on whether you're going to be someone who is going to building these systems, where is someone who's just going to be using this math to understand what's happening and not necessary you know doing research are going going ahead and building them.

If you are using the matter to just get a sense for why these machines are doing what they're doing, then even in a you don't really need a whole lot of IT. You need to you need to understand the you know a concept of vectors and matrices and how do you manipulations, vectors and matrices. And you know it's it's not very complicated stuff.

Um you just also need something about the basics of probability and statistics. You need to understand basic stem for instance um and again, these are not terribly difficult uh a little bit of optimization a theory again that sounds like a fancy word mizan theory. But there are some very basic techniques that we need to understand to figure out how these machines are um essentially learning. You know they are using certain techniques for optimizing their um mar space and so um yeah it's not a lot of complicated math. At least four people who want to understand or peak into the hood sort to say uh as you put IT, if you of course, if you want to build these systems and if you want to start doing research, then your mathematical jobs have to get much more sophisticated.

Can you explain the bias current tradeoff in machine learning?

Yeah, the bias variance straight off is a very classic trade off a and the basic idea here is that when you are training a machine learning model to learn patterns that exists in the data, the shown IT, if the model is to a simple and let's say, uh, we are we are uh category ing the simplicity or the complexity of the model in terms of the number of tunable parameters that has things that you can, you know, different jobs that you can turn to figure out what the model does.

So if the if the model has too few parameters, then when it's being fed data and and it's being asked to figure out the patterns or correlations that existing the data, if the model doesn't have enough parameters then is going to underfed the data IT won't do a good job of basically figuring out what the patterns are um and so such simple models that are underwriting the data are set to have high bias. But then you can you can start making the model more complex and by again, by here, by complexity and just maybe as a proxy using the number of parameters that the model has a and as you keep increasing the number of parameters, that comes a point where the model starts over fitting the data. Know if if the data has a lot of noise in IT, for instance, it's actually going to fit all the noise.

It's as if like um you know a simple model might have drawn a straight line through the data that you have, but a very complex model is going to basically draw very quick curve um no bed touching every data point that you have, some of IT could be just noise um so you essentially enable fretting the data. So and when you have a complex model that refits the data, you are in the highway in regime, right? So if you now testing how the model is doing on training data, how much error does that make when when you given a training data and you're asking you to fit the training data when you're on the low by side, the risk of, uh, training air is pretty high.

It's it's making uh, a fair amount of error even on the training data. But as the complexity of the mother keeps increasing and you're moving towards higher arians, the model starts fitting the data really well until IT over fits IT. So on the high very inside you, you basically now have zero error that you're making on the training data 啊。 But what what's interesting here is that there is a certain amount data that you hold out from the machine.

You don't show the machine a certain amount data, let's call IT the test data. And when you test the machine that that is being trained on, this held out test data. Then in the beginning, on the low by a side, you will still make a lot of air on the test data.

And then as the model gets more and more complex, your the adder that you're making on the test ata starts falling. But then some point when the when the model is starting to over fit, the training data, the other that you are making on the test data starts to rise again. So it's almost like there's one curve that is just going you know asymptotic trick down to zero, which is the risk of training either.

But there's another curve which is kind of bull shape that IT kind comes down and then to a minimum starts rising again. Uh and that essentially the vice various curve. You you want your models to be in the gold log zone where um you are making a low enough error on the training data, but you're also uh your err on the test data, the minimum um and and that's the trade of you. You don't want to overfill the data and you don't want a model to be too simple.

What is the role of over prime ization in deep planning models? And can you explain the last chapter in your book, which was terror in cognitive?

So this uh bias various curve that I just talked about, you know um as you're making the model more and more complex, uh, it's getting more or more parameters in the sense that the number of parameters in the model are increasing.

And as IT happens in deep neural networks, what has been noticed is that the number of parameters that the model has far a outstrips the instances of training data and standard machine learning theory, which is what is by various curve that we just talked about is based on, says that as you over parameter is as as your number of um you know model parameters become much, much larger than the instances of training data. You should essentially over fit the training data. You you should be in that regime where you're over fitting and so you're the loss that you make on your tested should you know keep rising uh and he turns out that that sort of not what happens in deep learning.

We don't have a good theory for why there's the case. Um deep learning systems, deep real network seem to be floating some of the accepted norms of standard machine learning theory. So even though they have, they are heavily over parametric.

They do well on the held out test data and this is called no. And ability to generalize or the generalization ever that they may that they have is actually low. So they are showing a capacity to generalize despite being over parameter. And the honest answer is we don't know why there's the case and the reason why.

Um in my book I call this aspect of deeper learning uh system terra in cognizant it's not not not a term I came up IT was something that one of the researchers that I was a talking to um said he basically talked of if you have the I just mentioned the bias various curve， the standard machine learning systems kind of live in the region of the standard bias marian scope deep learning systems IT as IT happens, your training data keeps falling and goes to zero and your test error, you know reaches its maximum at the point where the training at the reach zero at that point that the machine landing system is set to have interpolated the training data。 But then what they noticed that if you keep training, uh the test r starts falling again. And there is a portion of the curve now, which is kind of unknown territory. We don't really know why the machine learning system behaves or in this particular case, why the deep learning system or the deep neal network behaves uh in that in that manner and that that part of the bias various curve um it's also called double dissent is terran common may basically because we don't know why it's doing that.

How does your book address the apparent contradiction between the statistical principles underlying traditional machine learning verses this crazy world that we live in? Now with these overpromoted zed deep learning models.

I don't think we have a mathematical understanding of the apparent success of deep neal networks, even though they are heavily over their trust, right? The in medal data is certainly is certainly requires more mathematical theory to explain why why that's happening. We don't know the answer to that.

So I don't think my book reconciles the two IT basically points out that there is standard machine learning theory, which which tells you that this is how machines should work you know machines that learn should work. But um but do you also know just from the empty results that we have about deep neal networks that they're not behaving the same way? So the last chap of my book essentially search searched this up as uh as a mystery, not a profound mystery. I think people have some clues as to what's happening but really the the former mathematics is still lacking um about why there's the case. So I wouldn't say that the could you consider them and just uh hopefully does a good job of explaining what the situation is and telling the reader that we are we have literally entered unknown territory with those with these deep neural networks.

What are your thoughts on self supervised learning? So for example, ChatGPT, where we just train a model on the data itself using the data as a label.

I think self supervised learning was a really big breakthrough in machine learning because uh until then um we use you know the other type of learning which is supervise learning where humans had to and note the data and tell the machine what the data meant and then you and supervise learning is limited by the fact that we need human input to annotation all the data and that very, very expensive. So you your ability to have extremely large data sets um that the machine can you analyze is restricted early about IT because of cost and and also when humans hunted data and give uh labels to the data or categorize the data, the kinds of things machines learn by looking at the data and then trying to match in our patterns that existing the data to human supplied levels is a very restrictive kind of learning.

Is learning something very particular right for for instance, if you had a bunch of images of cows and a bunch of images of dogs that humans had labelled as or dogs and the machine learning system was trying to figure out of, you know, this is an image of A O and this is an an image of a dog, um IT might just pick up the fact that most of the cows are always in fields. So IT might completely ignore the fact that there's a cow there as long as IT sees some grass IT IT says all that the image of of a cow and uh and dogs maybe mostly are indoors or whatever. And so the way the kinds of things IT might pick up in order to match the patterns that exist in the data to human supply levels, uh, might be very country productive.

IT might be doing exactly not, you know, the wrong thing, or IT might be doing things that are not particularly. As for self to provide learning was a very interesting breakthrough, because essentially what the entire technique relies on, this idea that you can take a piece of data. Uh, humans don't have to label IT as anything.

Humans are not involved in the mix. All all you do is you take you, let's say, you take an image and you master portion of the image, let's a fifty percent of the image you mask. You feed the mask image to the uh machine learning system and ask you to predict the entire image, the unmask image um. You implicitly know what that unmastered should be because you had that on the input side. But when you're asking the machine to complete the entire image by filling up the last portion in the beginning is going to make areas right. It's gona come up with some nonsense but you know what the right solution is because you always had the actual input in the first place so you can you can tell the machine that oh you ve made an era and this is how much Better you made um go to tune your ameers so that you are a little bit closer in your prediction the next time around and you do this relatively over and over again uh until the machine figures out how to take some mass image and gender the 呃 you know full image um and in doing so， IT learns features about the image that maybe wouldn't have been possible with supervise learning because here there's no label that is trying to match. It's actually trying to understand the structure, the statistical structure of the image itself and and something similar happens with language.

Know the kinds of things that ChatGPT is doing right you take you take a sentence and you must the last word of the sentence and ask you to predict the last word ah it's going to make an error in the beginning but you know where the last word is because you heard that sentence in the first place and um you know you take the amount of error makes tune the parameters of the model in such a way that if you give IT the same sentence again, ask you to predict the same missing word again IT IT will make a it'll make an err again but I know you get slightly Better and you do this over and the world again uh, for that sentence until IT gets a right now imagine doing this for every sentence on the internet and before you know, IT has learned this statistical structure of human written language. And so then after that, no matter what sentencing, you give IT and mask, you know, a word, IT knows how to predict the next word, right? So the amazing part of about self suprise learning is that IT can be easily automated like there's almost no human intervention here and and the machine is really learning some very soph ticad uh statistical uh structures that are inherent in the data.

Do you think the future is supervised or unsupervised?

Um so these are not my words. These are words that come from uh alexa efforts are duc berkely and he has very uh authority timely said that the revolution will not be supervised so basically implying well not even employing explicitly saying that the revolution A I will be unsupervised um again uh one obvious reason is that supervised learning requires human intervention in the sense that humans have to label the data, they have to at the data.

And that's just not going to be possible at scale. You can do IT for small data sets, even reasonably large sets. But really to keep scaling up is going to be impossible. But also um the kinds of things that are on self supervised system learns is very different from more than uh, suprise system. So there's a rich ness to the to the learning that's happening in self surprise systems.

But for me, the probably the biggest philosophical reason to think that the revolution is going to be self supervised is a is that, you know, if you think about us humans, you know, nobody has settle around living the data for us. Our brains, over evolutionary time have learned about patterns that exist in the natal world and have figured out how to help know the body, do its thing, move towards food away from you know predators, towards pray, um you know find a mate, find food, all these things. Uh, I have happened in a non supervised manner. And uh yes, of course in over the course of the developmental uh stages have a child uh no parents to supervise uh their kids and we do some form of supervised learning， but that's a very small part of more humans learn much of what we um have learned over evolutionary time and much of what we learn even as we grow is self supervised or unsupervised. So given that that's how nature has done IT, there is no reason to expect that the machines that we build are also are not going to be powerful just because of .

the that technique. Why does sarcastic gradient decent work so well given the complexity of the optimization problem?

Well, again, this is one of those one of those things where we have empirical evidence that to the gradient decent works uh, exactly why IT works so effectively in a optimizing deep neal networks is still an open question um there has been some work that suggest that the reason why is to caster gradient decent works is because IT act doesn't implicit regular regular zer I can never say that word properly regularized um so and the reason um why um IT might be working is because IT is automatically uh or as part of the optimization process it's pruning the number of parameters, making the model simpler so that IT doesn't over fit and hence finds the unnecessary optimum.

But there has also been work that has shown that deep nal networks will still uh find optimal solution or near optimal solution even without the caster gradual stance. So IT doesn't seem like there is something particular about this uh regularizing that has to do with the stocking gradient distance that is responsible for its efficacy. Uh so again, the honest answer here is that it's an open question and um we know IT works. We know IT works amazingly well even when IT shouldn't IT seems like it's such a ad hoc thing to be doing and yet IT works beautifully uh is of course very efficient, is much faster than using pure great, great, decent um but the exact reasons behind its efficacy are still um not clear.

Can you explain the of dimensionality?

So when you think of something like the k new neighbor algorithm, right, you you take what what that algorithm does is IT turns data into vectors and plot them in in in some high dimensional space. So let's say we have a you know ten by ten image like we have a thousand ten by images of cats and a thousand ten by images of dogs uh and attend by ten images, hundred pixel. You can imagine each pixels as if it's great scale.

Then you know that pixel has a value between zero and two fifty five. So each image can be turned into a vector that is like a hundred numbers long, and that vector can be plots in different dimensions, space. Uh, so, well, you know, one pixel along one access. And what will happen more less is that all vector ies representing cats will end up in one region of that high dimensional ACE and all vectors representing dogs will end up in a different part of the high dimensions space.

Um and then when you have uh a new image that you don't know whether the cat or a dog, you turn that image into a vector and then you plotted in the same high dimensions space and see, oh, is IT closer to dogs or is IT closer to cats? If the thing is closer to dogs, you call this new major dog. If you, if it's closer to cats, you call the new major cat, right? This procedure depends on the central idea that vectors that are alike are near each other in this hydeman sial space, or or, or vectors representing similar things near each other in this high dimension space.

So um you know the new image, which if that takes a dog, if you plots in that space, should be close closed to other dogs in that space. Now the funny one, one funny thing that happens is when you move to higher and higher dimensions is that I know uh let's say let's say the image was, I don't know, million pixel. Uh, so now you're Operating in, you know a vector which has a million elements and so you are in a million dimensional space.

And IT turns out that the the idea that similar things are closer in these hy dimensions spaces, then things that are not similar, that whole idea falls apart as you start moving into higher high dimensions. And that is the curse of dimensionality you. The very metric that you use in order to compare vectors starts falling apart because in this high dimensions, spaces with everything is just as far away from everything else. So the notion of similarity that two things are similar because they're close to each other doesn't work anymore. So um and and that in a sense, the curse of dimensionality and as your data starts becoming more and more high, um you cannot use some of these algorithms that rely on the notional similarity by just using some distance metric between the vectors.

Can you explain the context of emergence in language models? And why do you think it's a little bit of a slippery concept and chAllenging to explain?

Um emerging behavior has probably garnered more attention than IT deserves. I mean that the term seems to suggest something mysterious and magical let's happening and IT refers to this idea that um as large language models like ChatGPT started getting bigger and bigger, they started demonstrating behaviors that weren't observed in smaller models and in a sense, that's all emergencies.

It's basically saying that if if there's a certain kind of task that you asked a smaller model like GPT to to to perform and IT failed, but then you build a larger model like GPT three or three point five or GPT four, nothing fundamentally changed in the underlying mathematics or the underlying architecture of this large language models. There's nothing different about the way they are trained. Uh, everything is the same.

All that has happened as these models have been scaled up, they're become bigger, they've seen more data. But the fundamental sort of mathematics on underlying the training, the fundamental architecture, you know, that underpins uh, these neural networks that hasn't changed. And yet when these things get bigger, you take the same problem that you gave to GPT two IT could not solve IT.

And you give that problem now to GPT three point five hundred GPT four. And IT solves IT. And that behavior is being called emerging.

Behavior is emerging simply because you're making something bigger. Uh, it's certainly not magical, know. Of course, these systems have become bigger.

They've seen more data. So they're able to do much, much more sophisticated pattern matching. They they're able to learn much more sophisticated correlations that existing the data. So it's not surprising that um that they're going to do things that the smaller models couldn't um but it's not like some kind of behavior that cannot be explained.

The term emergence seems to suggest something the serious and it's not depending on how you use the word emergence you know either you just define IT simply as saying that okay, all IT is is behavior that a smaller model couldn't um do and now that behaviour is being observed in a larger model and IT seems to IT correctly. Uh if emergencies simply the fact that certain capabilities are arise as you make the model bigger A. Mainly because IT has seen more data.

IT just has a large number parameters and hence is able to process um the data and waste the small or more. Couldn't if you just look at IT that way um then there's nothing to be skeptical about IT just makes sense that, that would be the case. But if you want to use the term to imply something that is absolutely not understood, I mean, yes, there are aspects of why this happens that is still being worked out mathematically.

But but if you have a shen of mystery around IT, then I think I would be sceptical. It's it's not like that it's not a sudden appearance of some ability in a large language model. IT is a very gradual, uh， ability that emerges.

I am also one of the things to notice that we build GPT two, which has a certain number of phenomenon, and then we build the GPT three, which has an order of magnitude de more. And when we test jeopard three, we see some behavior which wasn't present in jip ery too. Uh, and we think that that's a starting transition, that something just happened between these two things.

But the fact is that we didn't build, you know, so GPT three has ten times more, let's say. But GPT two, we didn't build models that where you know twice as big, twice as big. We just went from, you know something that had one set up parameters is something that has ten times more. But if you had built the intermediate stages also and checked behavior, you probably would have seen a gradual increasing ability nor the sudden step changes that seems to come back. Ah so in in that sense, again, it's not emergence in any magical sense that IT just appears suddenly IT is a very gradual process.

How do deep learning models compare with human cognition?

I think we have to be really careful comparing deep learning models to um human code, human cognition or human cognitive abilities that are a models that people have started developing um that model, for instance, the human visual system or the human auditory system, even the old factory system and they are the best models we have to date about what might be happening in the brain uh but they are not you know, exact models.

They're not telling us exactly what's happening in the brain. They they recapture late some of the behaviors that we see in our biological systems, whether it's a human brain or other primate brains um but um are they are they replicating the exact mechanisms that are there in our nerve system, in our brains? Absolutely not to mean, for instance, most of these deep learning models are water called feed forward.

The you know you you have input coming in on one side and the information just flows from the input to the output. There is no recurrence. So for instance, if you have neurons in the tent layer, the outputs, the outputs of those neurons in the ten layer don't feedback to the tent layer or the layers no uh nine, eight, seven and earlier.

Uh, so the output of the ten player has to move forward. IT has gone to the eleven and twelve. Uh, our brains are not like that. There are numerous. In fact, the number of recurrent connections probably outnumber feed forward connections in the brain.

So there's a lot of feedback h loops in the brain and the you know the current models we have do not have this kind of recurrence. So whatever however close these deep learning models seem to be to um what might be happening in our brains. Uh they lack very obvious uh architecture details so they can't be you know uh they they can't be telling us about exactly what's happening, saying that they are the best we have right now. And they are definitely shutting light on how our brains might be processing information.

How do inductive priors work in machine learning models? So things like symmetry I and various ent permutation invariance and stuff like that.

So inductive priors are essentially information that we can somehow incorporate into the architecture of the deputy network based on ideas we have about how certain kinds of information need to be processed. For example, if you take things like convolution neural networks, they were inspired by what we understand about the human visual system or the primate visual system. And we know that that there is a certain hierarchy uh, involved in the way our visual system processes information that's coming in.

You know there's a there's a certain a amount of processing that happens that has to do with identifying low level features of images so for instance, looking at you know H A cup uh the visual system is identifying the edges, the curves, the shapes, the texture before IT puts IT altogether and said, so this is a cup um and but this is happening in, you know, in stages. There's also invariance built into the human visual system. So for instance, if there is an edge detector in our visual system, that edge can be anywhere in the visual field and IT should still be in the visual system, should still be capable of detecting that uh or the edge can be tilted and you should be able to you know still be able to detect that it's an edge.

So there's rotational invariance translation invariance. And we've taken these ideas that we learned from observing the you know um the animal visual system and incorporated those things into designs of deep nal network. So that's how the first convolution let came about.

So these were the inductive prior, so to say. So we had prior information about what these networks should be doing that were big into the architecture of the system. So um there are other examples of this uh, where we we already build in prior knowledge about what we think we need in order to make more sense of the data into the architecture of the system.

Can you explain the back prop uh algorithm and its history?

The back propagation algorithm, probably one of those algorithms that I particularly personally found quite elegant and is a significant part of my book and is also a very significant part of why um you know deep learning and deep neal networks have succeeded so brilliantly. The basic idea behind back propagation is very straightforward.

Again, if you go back to the late one hundred and fifteen, early one thousand hundred and sixty, we just had single their neural networks. So you provided the neural network and input IT produced an output. And then you you figure out whether the network made an error by looking at the output and the expected output and what IT does you, you calculate an error.

And based on that error, you just modify the strengths of the connections of the neurons, the weights of the neurons. And those algorithms worked as long as there was just a single layer. The moment you put another layer between the output and the input, the so called hidden layer, the algorithm, algorithm couldn't work anymore.

And the reason was that what you had to do was every time the network made an letter, you calculated the loss that is made on its prediction and you had to then figure out there is this problem of credit assignment. You have to figure out how much of that error that the network has made, uh should be a portioned to each of the uh weights of the network, right? H if is just a single layer r then it's easy to take uh that loss and opportunity to the rights of the single layer. But the moment you have a hidden later IT was very hard to figure out how to kind of back propagate or you a move backwards from the output stage back to the input stage, allocate uh to each weight what its responsibility was for the rather than network made.

Uh and this was something that Frank rosen black, who came up with the percept on algorithms one hundred and fifty nine he was aware of so he had in his book um in one thousand thousand nine hundred and sixty one principles of new dynamics ah he identified this problem that look, the moment we have a multilayer h neural network then you're going to have this problem of having to back propagate tio errors uh from the output side all the way back to the input side so that every um wait in your network um is IT just to accordingly he just didn't know how to do IT here I identified the problem um also in the nineteen sixties um there were on article and electronics engineers who are building um uh control systems for controlling the trajectory of rockets. Um Henry Kelly and I forget after brice and I think ah so the algorithm is called Kelly brand um they had some form of this back propagation algorithm even though IT wasn't called that. Uh to be able to design uh systems that could help control the trajectory of rockets as there you know going in space.

Um I think nineteen nineteen sixty two store drives came up with the uh use of the chain rule in calculus to actually make the Kelly prison already thin uh Better. So these elements were sort slowly falling into place and then sometime I think um nineteen um sixty seven there was uh japanese researcher uh amari who also figured our time some aspects of the back proportion of them again none of these uh very well flex start but the the kind of bits and pieces were falling into place and um you know there's a uh there's a whole history uh of this topic on urgent smart tubers webb site that one can go look but where he also mentions, for instance, uh sample in anna who comes up but I think that would have been nineteen seventy um where he creates the code necessary for efficient back propagation. One hundred and seventy four paul verbs was doing the P H D at harvard.

Uh develops what can be called the closest, uh, a sort of version of the modern back propagation algorithm. Forest P. D thesis, which had more to do with behavioral sciences. IT wasn't really addressing neural networks. So all of this stuff was happening.

But the real a sort of breakthrough happens in one thousand nine hundred eighty six, when a rumored art, hinton and Williams published their paper, just a four, three or four page paper in nature about the back propagation algorithm. So now finally, this algorithm was being talked of specifically for training neural networks and for neural networks with hidden layers. And you know, IT.

And they also not not only did they kind of formalized the algorithm, but they also pointed out that if you use this algorithm to train multi air neural networks, they learn certain kinds of things. They are about the data, so they identify kind of what they call future learning um a representation learning. They they could identify what kinds of things in neural networks are learning because you use this back proportion of dom.

So finally, in nineteen eighty six, I think people woke up to the fact that, okay, there's this formal thing and and write fully or wrong fully. Uh a lot of the credit is given to saying in this case, jeff hinton because uh you know he he's currently regarded as. One of the main people behind the back propagation algan but even he would say that looking for romal heart i'd been alive.

He would be the guy getting all the credit. And not just that, he, uh, hinton, no technology. To start, there is A A large history to to this algorithm.

They were just the people who kind of put IT all together and made IT, uh, sort of palatable to the neural network community. But the ideas created them by decades. Do machine .

learning models reason? And if they do reason, why do you think the reason and how do you think they're reasoning? It's different to us.

not really. If you think of reasoning as what we do as humans um we have this ability to learn something about how to solve a problem in a particular domain.

Uh, not only do we learn how to solve the task, we are capable of abstracting the principles involved in solving the task and then we are able to transfer those principles using you know symbolic language like mathematics or just language uh to then reason or uh about or solve problems in some other entirely domain and that kind of uh, a symbolic thinking is not what machine learning models are doing, right? Machine learning models are essentially very, very, very sophisticated pattern matching machines. So they they can detect patterns in data that might even miss, that humans might miss.

So they're very good at that. Um and it's true that there's a large class of problems that can be sold if you are a very good pattern matching machine, if if you can identify um you know correlations between imports and outputs and sophisticated distin correlations are that that might be sufficient uh for solving a large class of problems and that's currently what's happening with uh, these machines. So depending on what questions you ask them, if these questions are the type that only require the machine to really deep into its understanding of the statistical correlations that exist in the data and IT can solve the problem, IT will seem like reasoning when you look at the answer.

Uh, but it's not reasoning in the in the way we think of as human reasoning none's depending on where you set the bar as to what constitutes reasoning, right? Uh, you you could say machines are reasoning, but only in a very limited sense, right? IT IT. These machines right now, machine learning systems are essentially very, very sophisticated correlation machines.

What do you think that readers will take back home from your exploration of mathematical and foundations in machine learning?

I think I would hope that readers of why machines learn um are going to be A. Kind of appreciated of the what I think is fairly elegant math, right? There are underlies or underpins machine learning that these machines learn because the math, say as possible. So so I would like them to be able to gain an appreciation for of all these goings on under the hood, sort to speak. Uh the math that makes IT possible and it's almost like trying to the math helps us kind of visualize and conceptualized how machines are court and court thinking and they're not really thinking but you know what I um so by understanding the math, we really do get a glimpse into how machines might be processing information. Um the other I think more important part for me is that I honestly very uh sincerely believe that we can't leave the building of these A I systems to just the practitioners, to just the people who building them today.

We need uh more people in our society, whether their science communicators, journalists, policymakers, just really interested users of the technology but who have some math uh background or or people are just willing to persist and learn enough of the to make sense of wide machines, learn, uh, in order to be able to appreciate, you know, we we are best to know we are making these machines quite powerful and and the power comes from the algorithms we design and the math that makes the algorithms work. So understanding the math is going to tell us about, uh, you know, how powerful these things are gonna get, but it's also going to tell us about the limitations, right? So so it's only when we understand the mad that we can point out that hung on these things are not reasoning in the way we think we are. Reasoning is, is because the mamah clearly shows that what's happening right now is that these machines are, you know, just doing very sophisticated pattern matching.

So ChatGPT IT hallucinates, sometimes that gets the answer right, sometimes that gets the answer wrong. Do you think that affects their reliability and utility in real world situations? And I guess as an extension of that, do they understand right? And and what would that mean for them to understand?

Yeah, it's true that the elements are always hello Cindy. I think the term hello, Cindy has often locally been used only when aliens get things wrong. But if you look at the way else function, everything that they are doing is essentially healthy. And and I think that word really lose as such meaning. If you realized that that's just how they work, they are essentially in a given a piece of text, they're producing the next most like word to follow their text.

They append that word to the original piece of text, and they predict the next most likely word and then the next most likely word and so on until they produce like an end of token or end of text token and the whole thing stops. Uh so at each stage is uh uh a probabilistic statement about what is the most likely word to follow the taxi already given um IT doesn't matter whether the answer is right or wrong. The process is always the same IT just so happens that when the alem is big enough, these abilities, these probabilities that IT is internally generating in order to make its best cost about what should come next, they get Better and Better so the answers can start looking like, uh, the alem is reasoning or the m is thinking.

But the process, whether is getting IT wrong or whether is getting IT right, is always the same. So so given that they are using the same process so called palestine ation, to whether to come up with answers that are either right or wrong, uh, it's really hard to know when the answers they are producing uh is correct. And when it's strong, it's almost requires a human expert uh, to be able to look over ward uh and limits producing in order to ensure that is producing the correct output.

Now there will always be certain tasks that an eleven can be ask to do where most of what he does, even if IT gets things a little bit wrong, is still pretty amazing. For instance, you know, when you are doing base on coding, uh, these elements can be extremely good assistance. They can, they can generate so much cause so much and so fast that a lot of your basic coding ings is already done.

And if you have enough expertise, you can look at over very quickly and make sure it's doing what it's supposed to do. So they can be very good assistance as long as the human who is using them has enough expertise to be able to tell late from wrong. Uh, but are they actually understanding what their producing? This is a matter of huge debate.

IT really depends on what you define as understanding where you set the bar for what constitutes a semantic understanding of language. And depending on where you set the barb alliance, either cleared at time, simply they're very good at IT or they feel miserably. And it's really up to you know, it's really definite if you define understanding in a way that you know only humans will ever be able to answer those questions, elements will probably fail them miserably.

But there are certain things that allow them do that are just as good as what humans can do. Uh, because the the notion of understanding is set at that level. So this is this is a question of semantics, and I would say the debate is still playing out.

What is your definition of intelligence? And do you think that deep learning models are intelligent? Well.

intelligence is a really, really difficult term to a define. I don't think I even try defining IT in my book. Most people are write about there.

I try, try not to define IT but I think um the reason why uh it's hard to define is because intelligence means different things in different context, right? The kind of intelligence that a dog needs to have to function in this environment is very different from the kind of intelligence uh you know an elephant might need or uh a whale might need or or for that matter humans, right? So our intelligence is each each particular type of intelligence is the outcome of having a particular kind of body that has to navigate its environment and functioning.

Its no uh construal context or social context or whatever IT might be um and as long as the nervous system in the brain taken to and the body all taken together are capable of helping the body function in this environment to peak capacity, you would say that that system is intelligent for that purpose. And so so it's hard to come up with just a sort of know um abstract notion of intelligence that applies across the board. So if you if you think of intelligence like that are A I systems intelligence again, if so, it's a matter of what you're defining the task.

There are certain tasks. You know, if intelligence is is playing chess without really knowing how the machine is doing IT, let's say all you're doing is trying to play chess with the machine and you're defining the ability to win a the game of chest as A A kind of intelligence that is necessary to play chess. E S. Machines are intelligence.

They can they can read signs down now pretty much anyone, uh, when he comes to playing chess or so many other games, right um this is not about what what's happening under the hood is just about looking at the behavior and saying, is the behavior manifesting the kind of intelligence that is required to achieve the goal uh so yeah I think to me um this is a slippery slope. You can define IT hover, you want IT. And in some cases the machines will be termed intelligent, in other cases absolutely not right.

So yeah ah we we have to be very careful about how we use this term. There is certainly no such thing as a completely general intelligence that somehow abstracts away all notions of intelligence and makes the coupled from the bodies in which we function. So maybe possible at some point, but I don't think we're there.

Do large language models have agency? And what does that mean?

Agency, from the perspective of humans, is this feeling we have of being agents of our actions, right? So if I were to pick up A, A, A mugg of coffee, I have an implicit feeling that I willed that action into existence and that I am the agent of the action.

And there is an internal sensation of being someone who is directing this body's actions in the world and also being the recipient of the experiences, right? So there is we just have that feeling of being agents are now our A I systems at this point. Do they have a sense of agency? Or are the age we can certainly build a in a robotic systems that, uh, model themselves as agents in the world.

Uh, so that's very different from saying that the robot has a sense of agency that IT feels the way about itself, uh, the way we do about ourselves. I would say that at this point, we can certainly build robotic systems that act as agents in the world. But I don't think anyone would really claim at this point that they have an internal sense of agency. Those are two separate things. And we are a long way from having robots that can claim to internally feel that their agents.

who was responsible for the deep learning revolution.

We talked about how um you know the back propagation algorithm in the mid nineteen eighties and became a big deal because that's what allowed us to train uh, deep neural networks, neural networks that had more than one hidden layer but IT wasn't enough even though we could, we had the mathematics now to train deep ual networks.

We couldn't do anything particularly effective with them because at that time in the in the made to late one thousand nine hundred eighties and even through the one thousand nine ninety is the amount of data that we had that we needed to train these new networks was very small. We just did not have enough data. And that had to change.

That did change by somewhere around two thousand and seven, two thousand eight on words, on one of the first big data steps that came about was the image data said, which was I forget millions of images and you all energy by humans, lots and lots of different categories of images. Uh so we finally had a very, very large um data set on which to train the news neural networks. But so we had back propagation algorithm in place.

We had large a large data set in place. Um the the one other thing that was missing was, uh you know these the training of a new al network is computationally extremely expensive. IT takes an awful long time to uh, train these things and people around two thousand ten started noticing that instead of training these newer networks using C P U central process unna, there's a much Better way to train them.

And that is to use these graphical processing is which are actually designed for gaming. They had they were not built and designed for, uh, training neutral networks. But people realized that they could coped gp U.

S. To train these systems much faster. So IT was a combination of, you know, the back propagation algorithm, which was fairly old by then. Then the address of really large amounts of training data and the ability to use graphics processing units for training them, all these things came together. And I think IT was twenty eleven or so when the first deep neal network named alexa t kind of finally broke through and ensured how IT could do image recognition Better than anything else that existed before.

I knew your book is a tiny bit connections ist leaning or they're not entirely. But what do you think about some of the other methods in A I like, for example, you know, symbolic methods, evolutionary message, biomimetic message.

except so so my book is, i'm not sure I would say it's connection is a machine learning, is a book on machine learning. So there are, I know the history starts with the history of connections ism with the percept ron algorithm, mental sob, uh, the withdraw hf least mean squares algorithm, which are both algorithms that are used for training single layer neutral networks um but then there is a whole intervening history of machine learning which has nothing to do with connections ism so ideas from whether is the name base classify the optimal base classify, the kenya neighbour algorithms, the support victim machines.

All of these are our principal component analysis then which is a statistical method that then can be used for a unsupervised learning extra um all these are you know very important and are non connections st uh but yes, it's true that the latter half of the book kind of focuses on the recent developments by recent mean in the last two decades where the focus shifted back to neural networks. Um is IT hinton centric? Uh instance is uh a character in one of the chapters.

I mean the back propagation algorithm uh is really about the uh rule hinton Williams paper uh so in dark and that is uh um you know hinton is front center in dark chapter and then he reappears in the chapter on convolution neural networks because of alex net that was his team's breakthrough um those were kind of unavoidable milestones。 I don't think this um anything more about him in the book um because the book is really about machine learning IT really doesn't deal with symbolic I I know uh by symbolic I assuming you you're talking about um the kind of A I that proceeded machine learning uh use these days kind of is called good old fashioned day. And you know the problem with simplicity, while I was very good at what they did, IT couldn't learn uh about patterns, the existing data by just simply examining the data.

Uh, so IT requite a lot of human effort to make IT work. IT was very brittle. Uh and but but h you know, symbolic um the ideas from symbolic. I are really want to be very important if we are going to get machines to reason. And I do think that the things that are coming now are going to combine the abilities of deeper learning systems to learn about patterns that existing data.

And on the back end, we might have a symbolic architectures that allow us to reason about those patterns in ways that we humans seem to be capable of doing so uh, I don't think that should be thought of as either or systems they are going to be put together um in. Wait, that we don't quite know yet how to do fully, that there are already ongoing attempts and those know the entire field is called neurosis bolic. I um where you're taking the connections approach and the symbolic approach and and putting them together.

Um so i'm for that I think if IT if IT helps achieve you know systems that can actually do the kind of abstract reasoning that humans can mine, not um biomimetic evolutionary algorithms uh no searching over the space of possibilities, which is what evolutionary alterius do so well, will also be a part of uh you know searching for architectures of deep neal networks that uh worked Better than others. Um biometry is already in place. I an no convolution neural networks.

The inductive basis that go into building convolution neal networks are already inspired by what we think of our visual system, the human visual system and even artificial neal networks. The artificial neuron is very, very loosely inspired by water biological neuron. So biome is already an integral part of how things happen.

Uh, that's only going to get more, more important. For instance, we need to figure out why our brains so much more energy efficient than artificial new al networks, artificial neural networks lead neal networks of today consume ridiculous amount of energy to do something that um is still way less than what our brains are capable of. And our brains are doing this with some twenty words of power.

And part of the reason, one of the reasons, not the entire reason, but one of the reasons, is that our neurons are not firing all the time. They, they, they, they, they are water call spiking neon. So you know, inputs coming to the neuron, the neuron does some computation and every you know every now and then I will stand out and what is bike? Um that's a very different kind of uh, functioning than what is actually happening in artificial neural networks today. So if we get inspired by these spiking irons in in biological systems and learn how to build them in hardware, uh, and we build, let's say, we build spiking in neurons and hardware and we figure out how to train them, how do you know do influence with them in hardware? Uh, well, that would be a huge leap in terms of energy efficiency and and that would very much be a bio idea.

It's a big responsibility to write a history of the field in a book. And of course, many different folks have wildly different histories of the field. Like, for example, you against me, huber, although I do appreciate that that you did you did get some some input from you again in writing this book. What what are your reflections on that?

First off I agree that um you know we have to be as right as responsible to the history of the field um and we have to our best to capture IT as accurate, clear possible uh saying that my intent in this book was first and foremost to capture the mathematical ideas and those are not that different uh across different ways of looking at the history.

So um the the the once I identified what the math was that I needed to explain, then finding the stories to anchor um uh those mathematical ideas was important and I know I choose a certain set of people to interview and and help enter pin the narrative. Uh but I do, for instance a agree that schema european hermida has uh you know um contributed enormously to the field. Um IT would be impossible to do an exhaustive narrative of all the different things that all the different people in machine learning have done over the past a decades already.

My book, for instance, is about foreign and fifty pages and and, uh so the way the way I approached IT was to tell the story of certain developments to the length of a few people. But then try very hard to make sure that the others get no two right. So for instance, uh h me tuber uh is acknowledge in the book as uh someone who has contributed to LSTM these recurrent neural networks is just that I don't talk about recurrent neural networks in my book so I don't you know deal into that deeply.

But I do mention mattei's contribution, even convolution neural networks which is off and the use of GPU um is often a attributed to hinton and others as having I made IT a deliver to use gp s you know alex night was the one that used G P S and made IT very popular. But SHE met were had done that earlier too he may not have done at a scale, but certainly the ideas were there in newspaper and I made sure I acknowledge that. Uh or if you take the you know back propagation algorithm and against me tubers, uh pointing out that sample a dyna a had come up with the ideas for coding, efficient back propagation.

Um I I tell the reader that OK you know there are these resources you should go look at up. So that was my approach. Try and make sure that any time there was an alternate viewpoint that was that warranted mentioned, I at least mentioned IT. But then in services of the book, which is about the conceptual aspects of math, I still had to find the narrative that you to one way of telling the story.

what are your thoughts on scaling laws with respect to how we continue to improve A I systems? I mean, do you think that we were hit s any theoretical or mathematical limitations as we continue to scale this technology?

So the scale loss that we have right now about the behavior of deep neal networks, these are imperial uh, scaling loss in the sense that we have observed the behavior of this system systems and we have figured out that their behavior kind of follows a particular uh, set of laws and there is no underlying deep mathematical understanding of why these laws are what they are given that it's really hard to say whether the scaling laws will keep holding as we make this system bigger and bigger.

If there was a real hardcore mathematical result that says yes, absolutely, then as he would expect things to continue. But um right now these are impartial results and IT could very well be that will find out in year or two that if we keeps making the systems bigger uh, that their performance may not scale the same way as IT has been so far, things might saturate. And it's you know often times when we have such scaling laws in other systems, we eventually notice saturation that things things improve according to some power law to a point, and then at some point they stop. There's a lot of diminishing return. So I given the lack of exact mathematical sort of results, it's very hard to say, okay, the trend is keep going now is going to continue forever .

and ever other clear computational limitations to the deep learning.

I think IT depends again on what you want your deep learning system to do. Um if if, for instance, we are asking the question our deep learning system is going to be uh capable of a certain kind of reasoning, you let's let's say the kind of reasoning that humans can do, which is to um take a complex task and break IT up into small sub task and then apply these uh sub tasks in clever ways to achieve a perfect result um uh this is this is something more compositionally uh and will deep learning systems get there just by the way, you know, just by using the techniques we have so far for training them using, say, self super SE learning are probably not because there are already some mathematical s results that are showing that there might be a uh inherent mathematical backstop to how much composition so the compositionally can be done by these, for instance, this transformed with architectures, so there might be mathematical limitations.

And um again, without a complete understanding of why these neural networks are doing what they're doing, it's always hard to make an unequivocal claims about what they might might not be able to do. Um and I think we have to remain a little bit open minded about IT. I mean, for me, the things that I keep coming back to my mind, nature has evolved biological news networks as our brains.

And even if we have very, very sophisticated forms of reasoning, all that is an outcome of evolution. No one has SAT around wiring our brains up in a certain way. Evolution has discovered IT.

Evolution has discovered these solutions. Is the architecture of our biological neural network is the same as that 呃， in these artificial and absolutely not. There are so many more complications in biological systems, and we are nowhere close to approaching that complexity in artificial systems.

But our brains are a proof of principle. It's been done once. It's been done by nature nor by us. It's been done over evolutionary time, you know, but yet it's been done.

So um is there any reason in principal to expect that deep neural networks won't get there? Not an in principle reason. Uh, will I be possible as an engineering thing? Probably not. I don't know. IT will require breakthrough, and we don't know what those breakthrough are yet.

You recently did a talk ChatGPT and its ilk about the theory of mind experiment with alison bob. What does IT tell us about the capabilities .

of ChatGPT? I played around with ChatGPT um asking IT a theory of mind questions ah and even though I know that it's simply doing next word prediction, some of these questions can be posed in very complex ways and the output that generates seems to suggest IT has the ability to murdered the minds of others I mean but because you know what it's doing now behind the scenes under the hood, you realized I couldn't possibly be doing anything more than sophisticated pattern matching right um but if you just look at the output, there is no denying that if if if all you had was the output to go by, you would be hard pressed to say that IT hasn't got the ability to reason that IT is showing climate of being able to reason so that I I think that's the there's the problem.

If you only look at the behavior, you don't know anything our works behind you know um the curtain or under the hood, I don't know how you're gna say is not reasoning. But once you pick under the hood, once you know what is doing, you become much more skeptical. And also, a is very easy to break the systems.

You you can, you can ask some very simple seasoning questions, and they fail miserable. So it's very clear that they don't have sophisticated reasoning ability. They just that sometimes they seem to have that. And IT takes us a back.

You spoke about the potential risks of A I, including job disruptions and the enchantment of societal biases. What steps do you think need to be taken to mitigate these risks? And what are these societal effects of ai?

I think there are some near term societal effects that we really need to be concerned about. You know, remember that machine learning systems are essentially learning about patterns that exists in the data that we provide. Um so if the data that we provide has biases built in, you know whether it's like, let's say, you're trying to build a system that analysis resumes or cvs and you know traditional hiring patterns and companies have always been sexist, racist and all this all of the other concerns that we traditionally have to fight in society. If we teach machine learning systems with data that is inherently bias, they will exemplify those biases.

There is no mystery there, right? And also there's always an assumption in machine learning that you know the the data that you have uh, trained the system on is drawn from the same underlying distribution as the data you've gonna test IT on. And if those two distributions are different, you know, let's say your training data was drawn from a certain data distribution, but your test data, the one that you're testing your system on in real life in the wild, is being drawn from some other distribution, then all bets are off as to what that machine learning system will do.

So there are a lot of assumptions that are big in um so biases that are in the data might get back into the machine learning systems. Then the problem it's one thing for humans to make bias decisions. And because we have the ability to question ourselves as humans, we have Jackson baLances hopefully in place where if a human being makes a decision that is seemingly sexy, store races, store anything else like that, uh, we have hopefully waste in which we can mitigate that.

The problem with machine learning systems is is not often obvious to people are using IT is that there is a implicit uncertainty or uncertainty in the the way these already gms are functioning except that when they produce the output, the output is always seen as being certain and the right answer or you know this is only one answer to be had and under the hood does not what's happening um and this lack of uncertainty or rather put putting IT differently, the seeming certainty about the answers that machine learning systems provide can be a problem. For instance, if you something like ChatGPT um there the couple of researchers uh from U C. Berkely the last kid with a psychologist and her colleague.

They made the point that when humans interact with large language models and when they are asking large language models questions, IT is in the nature of human psychology that we are at our most vulnerable when we are asking questions, and we are receptive to answers. So if you have a large language model that gives you wrong answers, but does so with extreme confidence, which is the nature of his output, then because humans who are asking the questions at that point in time, or psychologically receptive to the answer, they will very likely get influenced by these confident seeming answers. And but once those answers are incorporated into our a psychological makeup, we become less able to change our views.

It's almost like there was a window of opportunity where we were liable and willing to take different kinds of answers. And if you have a large language model that's giving an answer and it's wrong and we have we have no way of telling, we will get influences because we are receptive at that point in time. So these are all issues that we need to be worried about.

You have compared the number of connections in in your network to the number of connections in a human brain. Do you think that this comparison is meaningful?

So the number of connections are in the largest large language models. Uh, today is probably about a trillion, I mean anywhere from uh half a trillion to a trillion or maybe even more. Now um compare that to the human brain, which in a very simplistic account of the number of synapses, the human brain, we standard about one hundred three lion.

So a large language model, even the largest one is about two orders are managed less in terms of the number of connections that we think uh, is there in the human brain. That's a that's a big number um but when we talk of the connections in the human brain, we don't take into account a whole bunch of other complexity complexities that exist in the brain. For instance, we don't talk of neurotransmitters neuromodulators.

We don't talk of the fact that um there's a whole bunch of computation happening in the end rides which are feeding input to the neons. We don't fully understand what kinds of computations are happening within a single neuron. So there's this probably orders more orders of magnitude more complexity in the human brain than just then we can just infer from looking at the number of connections.

So in dar sense, large language models are far, far away from being able to capture the complexity of the human brain. But there's a reverse way to look at IT, which is that even though the large language models are orders or man to away from the complexity of the human brain, they're already able to do some pretty amazing things right now. You imagine a situation where we are able to scale up these artificial systems to the to the level of complexity of biological systems.

Um not only do we scale them up, but we somehow make the energy efficient, which right now is proving really difficult. But let's say we are able to make the g efficient so that even at scale, they are not consuming inordinate amounts of power. Um so we have artificial systems that are are pushing the complexity of the human brain uh but are also getting more energy efficient than couple that to the fact that these artificial systems have access to almost any information that we can feed them, are human and are not capable of that you and I have limited access to information, right?

So you you take the power of silicon, you take them on a memory that we can give to these machines. You scale them up to the complexity of human brains. That's what makes me parent thing that we are only just beginning with the eye.

Can you tell us about your work in the science of the self?

The second book that I, the man who was in there, that book was an exploration of the human sense of self. And essentially, uh, in that book I look at eight different neuropsychology ical uh, neurological conditions. Each of these conditions kind of disturbs our sense of self in a different way. And and the entire thesis of the book is that by looking at the different ways in which the self comes apart, and by self, I just mean the way we internally feel about ourselves, the way our body feels to us, the way our stories feel to us, the way we think of us as, uh, being here now or existing over time from our earliest memories, to imagine future.

All that goes into this idea that you know of being an identity, of being a person, of being this uh thing that exists space and time um so the the thesis of the book was that, okay, lets look at the ways in which we come apart, not entirely but you know parts of IT come apart of uh and then canada tell us something about the way this complexing call ourself is put together in the first place by the brain and body. So that was the you um impetus for writing that book. He was an exploration of the human health.

You discuss various neo psychological conditions that provide insights into the nature of self. Which condition do you find the most intriguing and why? Well.

I had eight different conditions in the book and honestly, each one of them, because IT affects a very different aspect of ourselves, is both important and intriguing in its own right.

right? So it's really hard to say that any one condition was the most entry but maybe in terms of how other word IT was probably cotard syndrome um was the most intriguing because, you know, rain decor the french philosopher said, I think therefore I am and in quota syndrome you can almost legitimately make the claim that they can say I think therefore i'm not uh and the reason for saying that is people with kotara syndrome actually are convinced that they don't exist and and and this is such a deeply felt delusion which is completely immune to any kind of rationalization. You can't talk them out of IT until IT resolves.

So while IT laws, the delusion is almost unshakable to the point that they will actually start planning their own funeral um and and we know a little bit about why that might be the case now not the fund planning part, but the fact that they actually think that they don't exist. Uh so there are some neurological evidence to suggest that you know there are certain key brain areas that are being affected because of which they feel like that. But to me， the reason why it's intriguing is you can be and I the subject of an experience， you can be self that says I exist, but you can also be a self that says I don't exist.

And IT raises the fundamental question, who or what is that I that is making that statement, in one case is making the statement I exist, like the cards have said. And in another, in another situation, with quotas, the same eyes is making the statement, I don't exist and is equally convinced of not existing as the former is convinced. Ed of existing.

You spoke about alzheimer disease and its effect on a narrative self, which was the terminal gy you used. How does this inform our understanding of identity and personhood?

I think alzheimer disease is probably the most pointless and devastating of these conditions. Because, you know, if I were to ask you, who are you, you are very likely going to give me a story about yourself. You're going to tell me who you are in the form of a story.

And these are stories that we tell ourselves and others about who we are. And these stories change depending on the context. You might be a different story with your parent, and you might be a different story with a certain set of your friends.

But none's, you know, we are stories and and what alzheimer is telling us is that even when these stories disappear, which is what happens in alzheimer, because in alzheimer you have short term memory loss. You you don't form short term memories. So as a consequent, if you just had an experience and that experience never entered short term memory, the consequence of that is he doesn't enter a long term memory.

IT doesn't become an episode in your story. So your story kind of stops farming as alza mer set in and eventually alzheimer basically destroys your story. You're unable to um you know you you unable to be your story whether that story is just coding tive or a story that in your body like like for instance, if you are a conductor of an orchestra, you may you may lose a certain amount of cognitive skills because of alzheimer.

But there is an aspect of yourself that is embodied that if you were standing in front of your orchestra, you could potentially just conduct their orchestra without being able to cognitive say anything about IT. So there's a lot of cell phone that is embodied, but all of that goes away. And one of the important uh, philosophical arguments for a long time was that the reason why we feel like we are, and I like capital, the reason why we feel like we are the subject of different experiences is because that sense of the eye comes about from these narratives.

It's almost like the brain is creating the swelling narratives and where at the center, but the center is nebulous, it's not. There IT only appears to be so because the of the narratives there was this philosophy, the late philosophers, Daniel denner, who had a beautiful phrase to talk about this. He called the self, the experiencing self, the center of narrative gravity.

And the, it's an allegiance to the idea that physical systems have a center of gravity. So any, any physical objects, object, has a central gravity. But if you go looking for the molecule or atom that represents the central gravity, you won't find anything is just the property of the entire system. And so for the IT ourself was also a property of all these narratives are rolling around, no, created by the brain and body.

Uh, and if you together the native there would be no I and he turns out alzheimer's chAllenges that because in alzheimer you do end up losing your entire narrative but you would be hard pressed to say that even in end stage of zimmer that there isn't somebody still existing who is not experiencing just uh, been a body listener tions because in the sensory and motor systems of the brain are still intact, the cerebrum is mostly intact. So even though they can't cognitively recall their stories, even though their body is shelford has kind of gotten damaged, uh, it's very likely that there is still somebody of the experiencing just being some minimal aspect of their body, and that I hasn't gone away. So hey, I mean, by just looking at how the narrative self comes apart, by understanding that the self is more than just the relative self.

you discuss the concept of body ownership uh in respect of this um conditions genome a how does that affect and our understanding of the bodies and and ourselves I .

mean like all the other conditions in the book, each of genome or what he used to be called before body integrity identity disorder is telling us that something we take as implicit is actually something that the brain has to construct moment by moment so if you were to just, you know, look at your ARM, you would have no doubt in your mind that this is your ARM um there's a implicit sense of ownership of your ARM.

It's even a silly to be asking, you know is this your ARM of course it's my ARM right this I don't think anyone would uh in the right mind would question that feeling but IT so happens that in genome a or biid people feel like some part of their body is not theirs. And we now again have some neurological evidence as to why that might be the case. But the point is that in order for us to feel like this army is mine, the brain has to be constantly doing what is supposed to be doing, which is in viewing our entire body self with a sense of mindless or ownership.

And sometimes IT fails, sometimes IT fails to do that for the whole body. Sometimes IT fails to do that for parts of body. And when that happens, IT can become extremely debilitating because it's almost like some foreign object is attached your body and you can't bear to have IT.

There is likes know if you were somebody who are who was afraid of spiders, and if a spiders was sitting on your ARM, you would want to take that off, uh, and your entire attention would be focused on that foreign things sitting on your ARM if your ARM itself was feeling foreign. And but there's nothing you can do because it's your ARM. It's functional.

Everything else about IT is fine except that he doesn't feel like your own um it's a very difficult condition to live with and um but what he tells you about the self is that things that we take for granted like sense of body otherness is actually something that the brain has to construct that there is no nothing fundamentally real about IT. It's is just a kind of information processing there's happening in the brain sometimes that goes wrong. So you can be someone, you can be an I, the subject of an experience who experiences an ARM has their own or you can be an eye who can experience an ARM is not belonging to you. So the again, IT comes back to this idea that we still need to explain .

what the eye is. What is your definition of agency?

So in the context of the exploration of the sense of self, agency turns out to also be a construction. So you know, we talked about this earlier where, you know, if you pick up something, you have an implicit sense that you are the agent of that action and you will that action into existence, right? It's just it's a feeling we don't question IT turns out that there are brain mechanisms um that make this feeling come about.

It's not something that can be taken for granted. So if if you are, for instance, performing some action, the brain is standing motor comments to ARM to perform the action. But at the same time, the brain is standing a copy of the comments to other parts of the brain that are now predicting the sensory consequences of the action that you are about to take.

And if the sensory consequences that have been predicted match up with what you actually feel, then the whole action is implicitly tight as being than with you. So the sense of agency is, in this way of thinking a computation that matches the prediction against what actually happens. And if the those to match you are the agent, for some reason, if there was a mismatch, that action that you perform will not feel like you did IT.

This might seem strange, but this is exactly what happens in people with chizen rania. So they might do the same action, but they won't necessarily feel like they are the agent of the action. So that is a disruption in this mechanism, which called the comparator mechanism.

The mechanism that compares the prediction against what actually happens and IT was to match the action is tagged as being yours and hence you have a feeling of being the agent of the action. Um schizo nia shows that that doesn't have to be the case. You can be someone who feels like they are the agent of the action or you can be someone who feels like they're not the agent of the action that they just perform.

So even the sense of agencies of construction um in this way of thinking, uh, can A I uh models be agency? If we computationally build this mechanism into air agents, then we are essentially defining agency as this process. And if we build the necessary computational uh structure in place, then yes, we endowed them with uh a sense of agency.

A though sense of agency still involves this idea that we have a subjective experience of that that there's inner conscious experience. And I don't think anyone at this point to claim the air models, even if you got the competition aspects of IT sorted out, would claim that the l agents are at this pent feeling like they have a sense of agency. I don't know where it's going to come from or how there is going to happen because whether or not that happens really depends on your definition of word consciousness and that's a different ribs hole, uh, and a difficult one to get into.

No, it's been a pleasure and anon a having you on nemesis. I'm very sorry that I wasn't there on the day, but I hope our paths will cross again and we can do the interview as a tet tet in the same room. Anyway, I hope you enjoy the show.

Fox, by the way, now is an amazing time to tell you that we have a patron, patron dot com ford slash M L S. T. It's pretty cool.

Over there. We have a private discord. We released early access versions of the shows. Many of the best shows that you've just been watching on our channel were there on the patron on months ago. We have by weekly meetings with myself and key. And you know, we talk about all of this stuff that we're doing a of course, you can influence us on interesting guest to invite exception. So please give us some support here over to patron dot com for lash jez.

The Elegant Math Behind Machine Learning - Anil Ananthaswamy 01:53:11 Share

Machine Learning Street Talk (MLST)

Deep Dive

Shownotes Transcript

The Elegant Math Behind Machine Learning - Anil Ananthaswamy