Home
cover of episode 834: In Case You Missed It in October 2024

834: In Case You Missed It in October 2024

2024/11/8
logo of podcast Super Data Science: ML & AI Podcast with Jon Krohn

Super Data Science: ML & AI Podcast with Jon Krohn

Chapters

Dr. Bradley Voytek discusses how data science and AI can accelerate our understanding of the brain, streamlining knowledge gathering and making neuroscience research more accessible.
  • AI and data science can accelerate discoveries in neuroscience.
  • There is a need for a brain discovery engine that integrates various data sets.
  • Technology can dissolve boundaries between neuroscience disciplines.

Shownotes Transcript

Translations:
中文

This is epsom number eight hundred and thirty four hour in case you missed IT IT in october episode.

Welcome back to the super data size podcast. I am your host, john crown. This is an in case you missed IT epsom that highlights the best parts of conversations we had on the show over the past month.

My first clip is from epo de number eight hundred and twenty nine with dr. Bradlee, void tech. Brad is associate professor in cognitive science that you see said ago.

I asked him how data science facilities breakthrough in our understanding of the brain. Da science particularly, I think things like islands. We touch this a little bit. Will, I think, be able to accelerate discoveries in a lot of different fields, including your science up. I mean, do you really do you think that there are emerging data science technologies or methodologies that could accelerate our understanding of the brain in the coming ten years.

for sure? I mean, you know it's uh it's a mostly given uh that I has to and will right um is like saying, you know you think calculators will you know acela ate science? yes.

Do you think search engines are going to I can't even imagine running a research lab without search engine, right? Uh, just the rate at which I can quickly, easily discover information, uh, has a just a huge impact on the way that everybody does science, right? So google is probably one of those transformative aspects of science, you know, the last one hundred years, right, to significantly shaped the way that we are able to find and retrieve information that allows us to then continue to build science um and do research Better, more accurately and faster.

And so I think alams are going to be something similar, right? Yes, there are so many problems with the current iteration of our allies who senate and things like this, right um but they do you can see the glimmer of where the future will be. And so I just to give a concrete example, when I was doing my P H D, uh I was looking specifically at the effect that very focal brain lesions in the prefrontal cortex, uh, or the basis of ganga two interconnected structures in the brain that are known to be involved in higher level cognition.

If somebody has a stroke that damages one of these brain regions, what impact does that have on their memory functions? That's what I spent my P. H. D. Doing at the start of my PHD twenty years ago in two thousand four, in my nativity, I believed that there must be some kind of website that I could go to where I could click on the profile cortex on the brain like an image of brain and get a listing of what all the inputs and outputs to that brain region are I didn't exist. This still doesn't exist. It's just very frustrated um and that drives me to uh ultimately leading to a project years later um at the end of my P H E that my wife and I publish together, which was um this issue frustrating to be an insult with me first along because instead of having like a really easy to discover um like mapping of these inputs output to these different brain regions, I to go into the uc berkely right of my hd like archives of pure newspaper is publishing the one thousand nine hundred seventies where they did at all these anatomical tracing studies and dig in through these all papers are trying to figure out what the inputs and outputs were to these brain regions.

And I was at a conference on a panel at stanford in two thousand ten or so uh with quite a number of actually like names that your listeners will probably be familiar uh senior you know eminent people in A I and and neuroscience um and somebody ask a question question on the panel and I answered by saying the peer viewed neuroscience nature probably knows the brain there's something like three million peer viewed news science papers that have been published are indexed in public, which is the national library medicine and h database of pure viewed biomedical research. If we could tap in to all of that knowledge, we probably would have be fifty percent for their long in neuroscience. But we as humans are limited to how much we can read and synthesize.

And one of the fact the members that uh you know is a sort of giant on the field basically said, that's really dumb, uh you know and like I am pretty sure right about this. And so back and dozen ten, my wife and I did like a proto NLP project um where I just did we we well I should say he wrote the python code uh to scrape all of the sort of text out of the abstracts of all these papers to just look at like o occurrences of words and phrases with the hypothesis being that the more frequently two ideas were discussed in the peer view literature, the more likely they are to be related. So very simplistically, if a paper is written about alzheimer's s disease, they tend to also talk about memory, because all time mra disease has a significant impact on memory.

But IT also mention like toasts thees uh which is one of the mechanisms by which we think alzheimer's disease uh like h manius um but you know papers to talk about alzheimer's disease are less likely to talk about brady canidia, which is like slow moving uh which is much more commonly observed in parkinson's disease, right? And so by looking at the sort of word frequencies and coco currently very simplistic L P proto NLP, uh we built a graph of like a knowledge graph of neusatz. Um and so this is a paper that we published two thousand twelve.

We did the project to in two thousand and two thousand and ten of pain in the ass to publish because all the good reviews, we build the snawley graph and then we could find clusters in the graf and went to publish this paper and you know we like, hey look um we can naturally from natural language, free form pure view text to discover clusters of topics are interrelated like parkinson's disease uh is highly clustered with dopamine, the neurotransmitter uh who and neurons in the substantial aga which are the document manufacture ons and dial and parkin synthesize that give rise to motor tremors. And pretty canadian. And this is like I nature discovered just through texaco, ocurred cities and the pure of viewers that something like, yeah, we know these things and he was like, yeah, I know you as an expert who is read the principles of neuroscience and has been studying neuroscience ence for twenty years knows but now math knows isn't that amazing and back to two thousand and ten and people aren't really buying IT.

Um and now I think we're in an era where we can do that same thing in my lab were trying to build this right now, actually that same thing, but like two orders of magic tude more sophisticated, right? So we actually are building that site right now where you can go click on the profile cortex and um or whatever brain region and IT is built on everything we know from publicly available data sets of human brain imaging about the brain. So the Allen brain institute has a database of gene expression in the human brain.

There's about twenty thousand and so different genes that are differentially expressed across the human brain. So we pull that data set in and then there's another data set of like neurotransmitter densities based on positive on tomography and pull that data set and pull this data set in. Um and this has already been done by collaborators up at mega university brought us love music uh is the uh uh the lab had there and they created an open source python package called neural maps where um I think it's ross, mark kelo and Justin handsome or the first authors on the neuron newspaper published a couple of years ago where they did all the lake work of I actually going out and pulling in all these publicly available data.

Ts, and so what we're doing right on the lab is we're building a brain that coates all these different data sets in browser so you can click on an arbitrary brain region and get a listing of everything we know about this part of the brain. And the next step that we're trying to build on top of that with uh an industry collaborator who cannot be named yet because is not formalized um but a long side. That browser is an L M chat window where you can then say, show me the hip campus and then the alam will pop up then like illustrate on the screen uh in the sort of dynamic brain viewer where the hip campuses say, give me a listing of the top ten genes that are most strongly expressed in hyper campus, uniquely compared other brain regions.

And then you can ask, what are the primary inputs and outputs? And they'll show the primary inputs, outputs. We're trying to build a like brain discovery engine that is uh L M powered that is trained on these beer viewed papers and these open data sets so that you can do Better neuroscience discovery so that we're sort of dissolving the boundaries between like a set of the binning the podcast, the neurogenetics is who don't want to think about thereto al neos des, we don't know what to think about neuron atomy, right, and trying to sort of to dissolve all of those boundaries to bring all these different data sets together in one easy to digest platform. So you know, that's honestly still probably couple of years away. But prototyping you right now.

IT was so interesting to hear from brand about how technology might help us gather and streamline knowledge about the brain. I'm excited to check out his labs and brain viewer once it's ready. Our next clip is from episode eight hundred and twenty three and at the eloquent netty mumbai, who is the head of strategy at the genre studio.

Our one talk to me about building digital avatar ars of ourselves to help scale up our public facing work online. You're the head of strategy at our one which pioneer generating life like video avatars or A I clues of humans. You call them virtual humans or virtual twins, and promoted a virtual human economy that you will get to momentarily, or maybe that I will leave in tent right now. So tell us what value this technology does provide? What are the great use cases for virtual humans, for these virtual presences, and how does that create a virtual human economy?

So first of all, virtual humans should not replace real humans. And I think the whole preambles to this question that we just got passionate about, uh suggest that um virtual humans and the use of this is called them A I avataras in content have had success and should continue to be deployed in areas where humans don't have any business being. And that is to say where k, let's start with where we found product market fit as a category.

Nuclear clean websites.

yes, or learning and development where enterprise organizations, where people are just so bored with the content, that is just of the kind of content that you have to consume. You have to hit these kind of quoters. People need to learn you know, everything from kind of like safety hazards to compliance and all this kind of of what needs to be done.

There is a lot of budget that assigned to to this type of content. IT isn't profit generating uh but it's boring. And IT usually exists as A P D F, right? So this is been a ripe place for um A I avatars and generated video content to play.

So what you just do, you can literally take P D F, S and transform them, transform them into engaging presentation LED videos, uh through, you know an assortment of A I avatars that you can select from the platform through different templates that actually make you look like that. You've invested a lot in video editing and you can instantly upgrade your content. So that's one very basic area where not necessary.

The sexist, but whether there has been massive product market fit over the last few years。 And then I think the next place is uh so as um avataras have become more commonplace or at least A I has become more integrated and accepted know in society and culturally since a ChatGPT moment, I think we've seen more outward use cases of this technology. Uh so for example, one of our customers uh racket from critical brand uh uses A I avataras within their amazon less things to explain baby formula products, you know.

So just again, so this is a place where you wouldn't have a human being presenting the small print of these products. But the smaller int print of these products is not easy to consume and it's important. And you know Young parents need to know this information. So this more print has been transformed into these friendly, engaging uh A I avatara LED videos that explain products in a way that is a lot more um digestible.

Can this today? I know IT could be done and IT sounds like based on the podcast kind of uniform mat that you are describing being possible or being in development, I know that this is possible is IT is IT done today, yet that in that kind of example, where there is an amazon shopping listing being explained, can the amazon shopper ask questions and get a response .

at this time? So at this time, within the amazon use case, the way that you could do that is through having to the, the, the listing will support video. So static images and video. So within that setting, you could have a series of videos, right that address different questions um that sort of outside of that particular use case.

Yes, I think we you can have conversations with with A I avatars uh today a real time live conversation is not going to have the same visual quality of a video avatar that is prevented that takes a couple of minutes to render. So kind of there are some trade offs, but we are getting to a point where we will all kind of come together where you can have a realistic real time conversation that feels a massive um that feels life like so that's coming. But today, I wouldn't say that all of those components come together to have a great first.

We have talked about uh, the L D training. You talked about uh explaining, uh you know shopping listings. And so yes, so now I think you are about to go to kind of a broad.

virtually human action yeah or so I think so currently and through kind of read A I moment where we've seen thought leader really use this technology in a way to fulfill his vision and what he's trying to do, which is to get his points of view out there in ways that resonate with people. And so we've been playing with the medium of having N A, I twin to help him with his thought leadership.

He also translated A A commencement speech into a dozen different languages so that he could reach people in different countries who he could Normally not communicate with. Um and so since that moment, a taught us a lot of things. And and IT continues to continues to in this partnership. But what we've seen is that people who have I P in their image, their lightness, ss uh, their ideas, who they are, have a lot of a lot to be gained through this technology.

And so and also people are getting used to cloning themselves, right? So this is kind of a bit of a pivots like it's always been the vision that I say everybody with the linked in profile would have an AI avatar that could communicate with for them on their behalf, help them to be more productive, help them to augment their skills and all of that. And I think that we're kind of hitting that pivotal moment thanks to capabilities.

So more realistic AI cloes, that people respond really well to the fact that people are more kind, so receptive to just have A I as a communications medium in general. And then also now the fact that you can thought leaders are actually seeing the benefit of using this technology in order to kind of fulfill their mission in terms of the brand they're trying to build and that kind of think so this kind of touches on the virtual human economy and that you can start to use your virtual human, your virtual, uh, to help, you know advance whatever IT is that you are trying to do. And uh, so in some cases, when were talking about people of note who are using the virtual twin to scale and content, that one thing, but you could you could start to create product x, right? With that, we actually had, you can, we've had thought leaders make money out of the A I avatar, having a job with the different platform.

So for example, we had a futures are called in by krafft, uh, whose A I twin became the ai respondent for a news platform called the finance media, which is one hundred percent digital. A I first uses A I avataras for presenting. And so that was kind of like a new type of just kind of a new type of deal.

And the the ability to scale yourself and then literally actually make money out of your AI twin is something that is just really fascinating and um so that so that's an example of what I call the virtual human economy in which we can create A I clients are a cells and put them to work on our behalf in myriad ways and sure like this. You know, when I when I think about IT through the lens of one, these are a physical kind of AI avatar s right of our digital representations but equally IT can be your body of work, your books on your, uh the way that you think your expertise. Because I was you, you can see how this works for entertainment and allisters, right, that have a lot of these attack that already trade on this asset of who they are.

But then for, you know, White collar workers, how does that work? Well, um I think what we're onna see is our platforms enable people to clone their expertise and make their expertise available at a lower cost. Then would ordinary be available if know if you needed that time and also opens up that type of expertise to people that couldn't necessarily afford IT a or just, you know, imagine, as is what i'm hoping for, you know, you need a contract to just be reviewed.

You don't, when I cei want to pay thousands of dollars and spend weeks trying to make that happen, the idea that you can kind of license just access to that as you need IT um is is interesting. And I can also become an a lead generator for those experts. So you like what you saw, you like that little taste of my expertise. So, you know, like maybe there's a more involved project, and then that's an will engage in person.

We continue the threat of genius commercial applications with doctor luga antigua in a clip taken from epsom number eight hundred and thirty one luga, whose the chief technology officer at lightning A I explains where he is, generate A I being so useful for our professional profiles, looking a bit towards the future. Generate A I obviously is transforming how offered developers work, how data scientists work. What do you think are the kinds of the new skills and knowledge bases that developers, data scientists, need to stay relevant in this general, the I world?

Yeah that's interesting because I I hear a lot of people that are are you know compliment themselves with language models and I do uh as well um in and maybe not small doses depends on what I do honestly. I done thing we've cracked the recipe for working alongside a eye for coating yet there a few very notable examples but at the same time you know um for the more mounting tasks, ts is great. Uh it's so also for complicated things you can be great um but you need to find your dimension as a like you need to use IT as a wall to bounce ideas back for that that's that's when I I get the most out of IT like IT helps me when I have an idea even on a total logical idea even with some maybe math theory behind IT whether I would I would look on for papers on on google.

Sometimes that process of getting in the vicinity of where you want to be is very facilities by by language model, a powerful one uh if you know how to bounce ideas back for um and anything that is the the skill you need to to develop also I think you know as I know sometimes I develop on the back and on and I see a lot of things that could be so much easier if if only I could delegate IT to someone else and and that someone you can be a language model because those task are extremely predictable and repetitive. Um and yes, you can use library that obstruct things away. But then what happens when something goes wrong? You you know many layers to be off.

And maybe it's just easier to keep things simple from a library perspective and have a eye like feeling the gap between you and you're willingness to spend time at that level and the task you need to. I and I do think that in the future, you will just write a function or call the function, and that function doesn't exist until you call IT um and uh or you will be cash many ways. But for sure you know you will have to weave everything from you know from the top to the top there there would be some some layers that may be delegated again in that realm of a repetitiveness.

And on. I don't think we're at the point where you know, we can say that the softer development will be rendered. I think it's more like work nearing the point where personal automation will be in is closer and then what suffered development would become when you have personal automation, then you know IT will evolve naturally.

Of course, if you want to be in your life comparison of being a super expert that a language and know the announce, that's great. It's probably not the way things will evolve ve in the future, although they were still be a need for that. But just you know in a smaller number, but something I I already brought up.

I thinking in another um some some other time in another video, uh, when I first interacted with tragic ity IT reminded me of something Christal zed in proof of my my, my old person, mine, which is a hypercard you know hypercard, yeah. Hypercard you remember? Hypercard.

uh no hypercard yeah .

so hard card is something that uh incidentally was uh created by a person under the influence of oceanic. But IT was actually a product by apple that shifted with macos s clastic. Um I don't know if he was a softer on top away to our ship with IT I can remember but he wasn't attempt to bring automation and programing and building systems to people that were not programmed IT had a hyperscore ript which was a scripting language that resembled english from the syntax perspective. The problem with that is that is not the syntax.

The problem is that if you need to write a for look and you your mind need to be crafted in such a way that you understand what a ford does and you keep track of simple. So yes, you can write IT with a you know semi column or um reserve or or in english, but he doesn't matter in the end, right? So I was trying to solve that problem from an angle of syntax and accessibility, but he didn't really solve the problem throughout, which is how can I express what I want in natural language?

So, of course, technology didn't allow to express what I want in natural language because that was relegated to, uh, you know, uh, science fiction movies. And now where there, uh, but the whole purpose of that thing was, can I allow someone who doesn't have backroom background, your science, to write their own thing, to have their own thing materializing from of them? Because they need to solve a very specific problem and they want to solve IT in the way that fits them, their, their, their immediate need.

And so I think we're at the point where the technology is ready to get there. And I think that what makes my me the most excited, right, the ability to bring automation and the. The ability to express algorithms with an intention rather than with them um with with the having to to to walk the little role about every step of the way.

Um so I think yeah um that that that I think is um um what the future of development might be. So in a way that will get more accessible so that I can just ss what is also already partially true, but it's just very. You with the agents on and then there there will be another set of people that will just dive deep into whatever you know and um use uh language models to and empower them to think faster and to get the results faster. And that's IT able nice.

yes. So basically to kind of somewhere ize back to you, what you're describing is that with general A I, we already today, we have some of these kinds of personal automation where you can be you can be delegating some software development or data science tasks. And over time, that will become more reliable, more expensive.

But for the foreseeable future at least, the role of software developer, the role of data scientists won't disappear. It's just that there will be more and more automation that you can that you can spin up easily. Um so IT provides more accessibility. Um IT means that you don't necessarily need to be expert in all of the different programing languages that you are are developing, inc.

And so maybe the kinds of skills that become more important than that kind of an environment or the kinds of collaborative skills with the team to understand what the product needs, what the business needs, uh, you know, creativity to be coming up, the solutions that will really move the middle for some product to organization and simultaneous ly no principles around, you know, architecture and having systems workout. So you kind of IT moves IT moves you up the stack or you don't need to be worried so much about the low level coding. This is ofa developer of the scientist. You're more you're thinking at a at at a higher level um in may, be in a sitting a little bit closer to product.

Yeah that that's true. Yeah I agree with that. Although at at last like blocky, right, that says that you need to be low level to understand exactly what you want from a system. And so we're not at the level where a ultimated system will be able to figure out the architecture something for you, but IT will help you you iterate much faster getting that architecture of the door. And and I I don't think there any system that right now that is ready to to just replace the whole like full stack, you would still need to be full stack, but it's actually easier to be full stack because some of the things you just have to be to know in order to have an acceptable speed, you don't need to know them all to have an acceptable speed, right?

Accessibility also concerns chat sAnderson's work on data contracts. As mentioned in epo de number eight hundred and twenty five, when we work on projects that concern data is a data science practitioners, we always need to think about how other users might come to interpret our data. This is why chat finds data contract so important and why he's reading a book about IT.

You are the C E. O of gable, which is a data contract platform, and you're writing the definitive guide to data contracts with a rally, probably the most prestigious technical publisher that you can be writing with for our space. So tell us about data contracts.

You your book introduced them as a solution to the persistent data quality and data governance issues at organizations face. Um but candidly, it's not something that I had heard much about when I first saw that. That's what you're expert in. I was thinking about like web three or the block chain and somehow sounded like that kind of contract to me, but I never get as anything to do that.

That's right. So one of the big problems that has manifested itself in the last ten or fifteen years, so really since the cloud took over, the primary place that companies are storing massive amounts of data, is that back in the old days, you used to have A A producer of data and a consumer of data that were very tightly connected to each other and more of a centralized team that was thinking about the data architecture and which data is actually accessible and could could be used by a data scientist or a da engineer or or analyst.

And they went through a lot of time and effort to construct a highly usable, highly semantic ally represented a data model. But now, thanks to the internet and thanks to the cloud, you've got so much data flowing in from everywhere, from hundreds, tens of different sources. And when things change, IT causes lots of problems for anyone whose downstream of that data for models, for reports, for dashboards and things like that.

So the data contract is starting to adopt a lot of the similar terminology and technology as software engineers who use apps, which is effectively a service contract. right? Is an engineer saying, hey, like this is what my application produces.

You can expect this not to change. Here are some isolates around that service and and you can trust that there is always going to be a certain level of late and city and and up time. And we're taking that approach and applying IT to the .

data as well, right? So IT is similar and sort engineers to the idea of, uh, what is what is the term self engine is .

service contract .

yeah yeah and so you're taking those kinds of ideas from of engineering, applying them to database yeah exactly.

Uh, data is obviously very different from applications like you need to think about, uh, the number of records s that are being admitted at any particular pointing time. If a team always expects there to be a thousand events in an hour and in one particular hour is one of them or two events, that definitely a big problem.

The schema matters a lot if you suddenly drop a column or add a new column, that's an incremental version of a previous column. N is a really big deal if you change the semantic meaning of the data. This is obviously another really huge deal if I ve got to call him call distance.

And I, as the producer have defined IT to mean kilometers, but then I change IT to miles. That's gonna cause an issue. So the same sort of, you know, binding agreements that A P S, have, also the explicit definitions of expectations coming from a producer. We going to apply that to the data producers and not just the software engineers. So on the application.

very cool. Sounds really valuable. In chapter two of your forthcoming book, you discuss how did quality isn't about having present data, but rather about understanding the tradeoffs in Operationalize ing data at various levels of correctness. So how can organizations strike baLance between data quality and the speed of data delivery?

That's actually a great question. So my definition of data quality is A A bit different, I think from other people's um in the software world, folks think about quality as its very deterministic though I am writing a feature, I am building application. I have a set of requirements for that application.

And if the software no longer meets those requirements, that's what we call a bug get to quality issue. But in the data space, you might have a producer of data that is admitting ing data or collecting data in some way that makes a change, which is totally sensible for their use case. So as an example, maybe I have a column called time stand, and that's currently being recorded in local time.

And I S, engineer decide to change that to U, T, C. format. Totally fine. Makes complete sense. It's probably exactly what you should do. But if there's someone downstream of me expecting local time, they're going to experience a data quality issue. So my perspective is that data quality is actually a result of mismanaged expectations between data producers and data consumers. And that sort of the function of a data contract is to help these two sides actually collaborate Better with each other, the work Better with each other, not so much prevent changes from happening.

So when you talk about data producers and data consumers like you just did there, is that typically you referring to internal in in organza or I guess you could equally apply to um an external facing .

A P I did exactly. So our producer is really anyone that is making a unique transformation of the data in some way, which could mean the creation of the data itself. That might be an internal software engineer who is creating an event that's admitted from a front end like a user clicks on a button at a web out IT could be someone who A D B A who owns a data base IT could be a data engineer who's aggregating all that data together and reading a silver and bronze and gold data models. That could be a data scientist to aggregate oldest into into a training set that ultimately another data scientists in the company is abusing IT could be a tool like a sales force for C R M or S A P for E R P or could be someone outside the company altogether, like company providing A A P I or N F T P uh serve data up or something like that. The problems are the same regardless.

Can you like break down for us? I'm kind of as we've now been talking with data contracts, I kind of I get the utility, but can you break down for me? What they book like, like is how is IT format? How do you share IT? And like how does somebody receive IT?

How do you you read IT? yeah. So this is where data contracts are a little bit different from the service contracts, where you have something like an and open A P I standard in the data contract world, it's more about having a consistent abstraction and then being able to enforce or monitor that obstruction in the different technologies where data is created or move to.

So I prefer using something like gamel or j on to describe my contracts that has various components within IT. So you might lay out the schema, the owner of the data, the S A, the actual data asset that is being defined or being references by the contract, any data quality rules, P I, I rules and so on. And so for um and then the goal is to translate all those constraints into monitors and checks against the data itself as its flowing between systems are potentially even before that data has been uh producer or deployed in some way. But I in teams that have rolled out data contracts as confluence pages, as excel spread sheet, really anything that allows a producer to take ownership of a data asset, I think, works as a first step towards data contracts.

awesome. Yeah Chris clear. Um let's talk about trustworthiness and around data. So h we've talked now about data correctness, which relates to trustworthiness. And so you've argued the value of data hinges on is trustworthy. So how do da contract help establish trust between data producers and consumers? And yeah, what role of data contracts play in rebuilding trust if it's been lost?

So I think trust comes down to a couple components. One component of trust is understanding and the second component of trust is meeting a consistent expected. And when I say understanding, what i'm referring to there is I am more willing to trust a data source or a data set if I understand what IT actually represents when a table is called customer orders.

Does that mean customer orders that were place to our website or through application or through both or through our customer service line, does IT just refer to a certain type of customer or a certain type of order? So the more information I have about that data asset, the more that I can actually trust IT. And then the second part of trust is the expectation setting.

So what is going to happen to that data set over time is IT going to be changing every month? Am I going to know when IT changes? Will I know the context of the change so that I can adjust my training data? Or my clearly, I think the same is actually true and real life, right? Like if someone says to you, hey john, i'm going to become coming over to your house later, but I might be thirty to forty five months, eight because of traffic.

You respond very differently. Then if someone is just forty five minutes late and they don't tell you, they just show up. So I think this is where trust comes from. And the data contract is really all about leading the expectation and also helping people understand what the data actually means, that how they should use IT.

My final and favorite snipped from the excEllent month of interviews, comes from episode number eight hundred and twenty seven with Richie vink rich. He is C E. O. And cofounder of polar incorporated, and he made the perfect guest for answering all our listeners burning questions about working with the popular poll's library for data frame Operations. We had a great chat about the open source of python's libraries, incredible spects, and what users can expect from them. So Richie, what is the secret sauce? So I guess it's not so secret because you have blogged about IT ah what IT is the not so secret sauce behind poll's being so much faster and more memory efficient relative to the in comments out .

there for relational data processing, uh, data frame like data processing. And there are a few things you can do. It's actually um pretty old relational data promising it's what databases is do for over decades um and IT was equal like us and IT IT will click out us and um snowflake um so all these database are they exist and they have different performances um and the different performances are because of various reasons.

Um for instance, if you look at a single light which is only uh uh which is row oriented, which is great for transaction processing, transactional data processing is when you when you load data, when you when you have a database and you use IT, when you due transactions princess, if you buy a product, you updated raw and then you need to check of that transaction will succeeded or not. Otherwise you have to fall back. That's that's an an application of database.

Another one is an a analytical um data processing and that's more where polar spondon or snowflake come in. Um and in that case, doing things culinary is way Foster. So culinary means that you process data column by column.

This is something wrong with us as well um they they're based on mumbai, but there are all the things you need to do with um which problems has ignored um and that's multiprocessing or multiple reading to be just my multi trade parallel programing. I mean my laptop at six sixteen course available. Um I want to use him um it's a waste of those resources if you only use one core for expensive Operations like john's or goodbyes.

The other one is that oil is close to the metal pandas is just taken mumbai which was meant for um for numerical data analysis. It's it's great but when when you had string data before a newby two point out that was there was no real good solution for that. Um and if you talk about next data like list and struck uh and arbitrary, the pandas actually sort of gave up because he just used a python object type, which means, hey, we don't know what to do with this.

We let pten interpreters see what to do with this um and in that sense, um polar uh a so pandas to gloomy I and built on top of that but IT was never really meant to nobody was never really meant to build a data processing tool like database on top of that. Ebola is written from scratch. It's it's written from scratching rust um and every every performance critical data structure we control, we control.

And by that control, we can have very effective cash. I am very effective cash cashing behavior, um very effective resource allocation, very effective control over memory. That's that's the most important part because um a lot of computer or a lot of resource um is in control of a mei.

Ah and then the third point, that one I think is very important is that we also use an optimization. So we actually made a different we um if you look at database day a can be really fast because of performance and how you write a card um how you write the kernels that that execute the the computer. But there's also an optimize.

And this optimize will make sure you only do the computer that's needed. And this is very similar to what A C compiler does. If you if you write, if you write your see you you can be sure that the compiler will never that the computer will never execute the just you've written IT.

There will be a compiler in between that will try as hard as possible to prove that IT doesn't have to do certain amount, certain kinds of work. And that's actually quite similar to data processing. If you don't need to load the column, this saves A I O trip.

This saves all research and location. So this can save. Yeah, a huge amount work.

right? That's today. In case you missed episode, to be sure not to miss any of our exciting upcoming episodes, be sure to subscribed to this point gas, if you haven't already. Most importantly, I just hope you keep on listening until next time, keep on a rocket and out there and and looking forward to enjoy another around to the study sons podcast with you very soon.