Big data is dead, analytics is alive

2024/10/24

Practical AI: Machine Learning, Data Science, LLM

AI Deep Dive AI Chapters Transcript

A

Adithya Krishnan

C

Chris Benson

T

Till Döhmen

Chris Benson: 本期节目讨论了 DuckDB 和 MotherDuck 如何改变数据分析和 AI 领域。重点关注了 DuckDB 的独特之处，例如其在各种数据源上执行快速分析查询的能力，以及它与 AI 技术（如文本转 SQL、向量搜索和 AI 驱动的 SQL 查询纠正）的结合。 Till Döhmen: 分享了他与 DuckDB 的初次相遇，以及它在速度和效率方面的优势，特别是与 Spark 等传统大数据处理工具相比。他详细解释了 DuckDB 的“内存中”处理架构如何提高性能，以及它如何简化数据准备流程。他还讨论了 DuckDB 的多功能性，它可以处理各种数据格式，并与其他工具集成。最后，他还展望了 DuckDB 未来可能的发展方向，例如与本地模型和远程知识库的集成。 Adithya Krishnan: 讲述了他对 DuckDB 的早期体验，特别是它在浏览器中进行地理空间分析的能力。他强调了 DuckDB 的开发人员友好型特性，以及它如何通过共享内存和简化的 SQL 方言来提高生产力。他还讨论了 MotherDuck 如何扩展 DuckDB 的功能，使其能够进行大规模的云端分析和协作。此外，他还深入探讨了 DuckDB 的向量搜索和混合搜索功能，以及它们在 AI 工作流程中的应用。他最后也对 DuckDB 的未来发展方向进行了展望，例如在数据库中集成 AI 和机器学习功能。 Chris Benson: 表达了他对 DuckDB 的赞赏，特别是它在本地计算机上执行复杂分析查询的能力。他分享了他使用传统数据库系统（如 Spark）的经验，以及 DuckDB 如何解决这些系统中存在的性能问题。他还讨论了 DuckDB 如何与 AI 工作流程集成，例如使用自然语言处理来生成 SQL 查询，以及如何从多种数据源中获取数据。

Deep Dive

Welcome to practical AI. If you work in artificial intelligence, aspire to or are are curious how A I related tech is changing the world. This is the show for you.

Thank you to our partners at fly I O fly transformer containers in demo vim that run on their hardware in thirty plus regions on six confidence so you can launch your APP near your user. Learn more and why. But I .

friends you .

know where big fans of flight T I O. And i'm here with curt mai, cofounder and CEO of fly. We have had some, and i've heard you say that public clouds suck. What is your personal lens into public clouds sucking? And how does fly not suck?

Alright, so public cloud sock, I actually think most ways of hosting stuff on the internet sucks. And I have a lot of theories about why this is, because almost doesn't matter. The reality is, is like i've built a new APP for like generating sandwich recipe, because my families, just a specific types of sandwiches, use Brown sher as a component, for example.

And then I want to that somewhere. You go to A W S and it's harder than just going getting like a dedicated server from hear. It's like it's actually like more complicated to figure out to deploy my dumps anal jp on top of A W S because it's not built for me as a developer to be productive with.

Its built for other people. It's built for platform teams to kind of build the infrastructure of their dreams and hopefully create new U X. That you all for the developers that they work with. And again, I like I feel like every time I talk about this, it's like i'm just too impatient.

I don't particularly want to go figure so many things out purely to put my sandwich APP in front of people, and I don't particularly want to have to go talk to a platform team once my sandwich APP becomes a huge up and IOS. And I have to, like, do in deploy. I kind of feel like all that stuff should just work for me without me having to go ask permission, talk to anyone else.

And so this is a lot of informed, a lot of how we build fly, like we're still a public lab. We still have a lot of very similar low level primitives as the bigger guys. But in general, they are designed to be used directly by developers.

They're not built for a platform team to kind of cobble together. They're designed to be useful quickly for developers. One of the ways we have thought about this as as if you can turn a very difficult problem into a two hour problem, people will build much more interesting types of apps.

And so this is why we've done things like made IT easy to run an APP alty region company, don't run multi region apps on public clouds because it's it's functionally impossible to do without a huge amount of front effort, is why we made things like the the virtual machine primitives behind just a simple API. Most people don't do like codes and Marks or their own virtualization because it's is not really easy. It's not there's another path to that on top of the clouds.

So in general, like I feel like and it's not really fair of me, the same public cloud suck because I ever built for different time. If you build one of these things starting in two thousand and seven, the world's very different than that is right now. And so a lot of what i'm saying, I think, is that public clouds are kind of old and there's the new version of public clouds that we should all be building on top of that. Definitely, me as a developer, much happier than I was like five or six years ago when I was kind of to .

stuck in this quire. So IT be was built for a different era, a different cloud era, and fly a public cloud, yes, but a public club built for developers whose ship, that's a difference. And we here IT changed our developers whose ships, so you should trust us, try to fly, fly dot I O.

Over three million apps that includes us have launched on fly the, the global slow balancing, the zero fig vate networking, harvard instant world garv pi connections with push buton deployment, scaling to thousands of incenses. This is the cloud you want. Check IT out. Fly dot I O again.

fly to I O.

Welcome to another episode of the practical AI podcast. This is Daniel light neck. I am CEO at prediction guard where we're building a private security in AI platform.

And i'm joined as always by my cohosh Christenson who is a principal A I research engineer at logger y Martin. Are you doing Chris? I do not very well to the Daniel has a going uh IT is going great. I am super excited about this one because it's it's a very you know we schedule a lot of shows and they're all interesting, of course um but occasionally there is like a show on a topic that into sex with something that i'm working on at the moment or something that I found that it's really exciting and you know found to be really useful. And so selfishly um i'm really extra excited about this this episode this week with which is with a till and adiga from mother duck .

how you do on doing good excited .

to be here yes and note ducas in the bird so editors, you don't have to bleep us out. Sure that's something that is an old old joke for you all. I can point point very easily how I ran across duck D, B and mother duck is there is a blog post um the title was very simple. It's said big data is dead and immediately when I saw the title I was like, think goodness finally.

But i'm wondering, like if you can maybe just kind of step back doesn't necessarily have to be the points in that blog post, but how you see the kind of data analytics, big data AI intersections as of now? And what are the sort of concerns and issues that people are thinking about that is driving them to duck D, B, and then of course, all obviously get into duck D, B and mother duck and all all that you're doing, but setting that stage of, you know, what are people struggling with? What if they realized in the past about this sort of big data hype in in one way or the other, positive or negative? And how is that kind of changed the way that people are thinking about analytics and database?

I can tell the story about, uh, how I got in touch with a stack to be IT started at the very beginning of the db project. I was actually uh doing my mates suspect and that the C W I where uh that to be originated from and um after I graduated uh honey sweet um develop or they found of to be laps uh reach out and we were talking and and they were saying a we are working on this new project. We're working on this database system.

Are interested in uh like maybe joining, maybe working on IT. But I was very focused on machine learning and and stuff like this. So I wanted to go into data analytics, data science, these kind of things.

So year later, I was working at a take a company and we were analyzing your customer data with pack and on. And um one day that was like one of the first versions of tg b was released. So I pip install and run the first like simple aggregation query er or may be a hundred maybe data set, something like this.

And I was surprised because I thought something was going wrong. I thought it's impossible. That is just the aggregation, right? Because from working with Spark was so used to, okay, now spinner starting for for ten seconds at least right then.

That was really I opening. And i've heard similar experience from from a lot of people, even until today. He are very similar stories, ies and experiences.

yeah. And for me, IT started in a different way. I first figured out that D. B wasn't existed, that you could draw an analysis, al engine in the browser. And to think about something like that was super crazy.

And the kind of stuff that you could do on top of IT start to look super crazy. And one of the things that I was super excited about when ducky wasn't released was the possibility to do geo special another take. So back then when I started my first encounter with dog, T B. Was uh doing geo special analytics. And then to think about that could be actually be done in the browser, uh, was like mind blowing and that's when my journey deduct to be started.

So let me ask you A A follow question is you're diving into your passion for those out there who maybe listening, who are not already familiar with IT and they're hearing database. They are hearing big data dead. They're hearing doing this in the browser. Give me a little bit of background on and kind of the the ecosystem that you are that you are coming from a bit and also what this idea was so that people can kind of follow you into that. What is that that caught your passion and attention and made you say, oh, this is the way and assume somebody IT doesn't already have a familiar with IT.

So I guess the I was going into this, uh, coming from the machine learning side of things, or I I was used to working with psychology panels or the Spark equivalent, so to that, like Spark building data pipelines and now and so on. And so what? So and then I encountering this db thing suddenly that apparently is doing aggregations of, you know, the sizes of data was working with much, much, much faster.

The speak some fantasies. How much of the data preparation pipeline can we push in the duct to be actually. And this idea that fancy has been following me for for the past years and I think is still an exciting topic .

to follow a little bit on that the way that large data or big data has been analyzed in the last years. I'm permanently that you you require some server in the cloud. You you require resources that were not local to be able to perform, like large analysis, but something that dumpty be opened up, that MIT possible was to lose use.

Local computer in your local mac, uh, book for, for example, was utilized that computer at the most to, like perform this kind of huge analysis. And that, I guess, sets Spark to a change in the ecosystem, I would say. And I guess that's what we're up.

I resonate so much with this. So like coming from a background, also a data scientist living through the years of like being told, hey, you know, use Spark for this. Like basically my experience in this sort of ecosystem was like, I would try to write a query and IT would get the right result.

But tear point tell you, like I would just be waiting forever to to get a result. And so i'd have to send IT to some, like, other guy whose name was ugly. You gene was really smart and he could figure out away to like, make IT go fast. And I never became your gene. So like, I resonate with this very much.

And the fact that this concept of, hey, there's these seemingly big data sets out there and I want to do maybe even complicated analytics types of queries over these or even you know execute workflows of, as you mention, until aggregation or or other processes um at query time. I could do that with a system that I could just run on my laptop or I code to um in process is really intriguing. So maybe now is a good time then to like introduce duck D B formally.

So like i'm on the ducklings side is douk dv is a fast in process and a liberal database. So maybe one of you could like take a stab at you know thinking about those data scientists out there who are maybe at the point of not also not believing that what we just described is, is maybe possible or they are living in a world where that's not possible. Describe what dark db is and maybe why that becomes possible as a function of what IT is.

I think I can talk a little bit about the motivation behind duck to be at least the way I perceived IT at the time. And that was um actually originated from the r uh ecosystem. Yeah so um honest was very involved in the ecosystem.

And uh people were using are to crunch relatively large data with relatively primitive methods. And so at a time C W I had a uh database system and an analytical database system called money to be that has incorporated the idea of a Victory ized COlina execution. And IT was a large system that was not really easy for the typical eye use us to adopt. So um the first idea was to say, hey, let's maybe built a light version of monelia and integrated with, I think IT was delier h something uh something like this and uh we just LED IT to run on the on declined but eventually turned out to be easier maybe to just rebuild the, you know a database system from scratch that was actually designed to run uh in process to be super light weight as super easy to install and everything essentially to give their power of this Victory. Zed, uh, query execution into the hands of date analyst.

i'm wondering if you could um when you talk about that being in process and light weight, could you could you describe what that means for someone that that may not be familiar with the term in process? And how is that different from other database that are not in process? You they have their own process is can you describe a little bit of of what that means?

So um classical database systems Operate new cent of architecture. Usually you have a database server running somewhere and you have a client that uh secret case, uh essentially to the database server. And then the result is transfer back to the client through some kind of transfer protocol.

And um one paper at turn as a mark um whose um mark rosman ed was also so um coal fun deck tired ABS. Um they were working on the paper that basically bench marketing cline protocols and IT turned out that that was actually a huge boat neck. So even even when you're running post grass on your local er machine, you still have this clients of a protocol tc. So and the way to get around this is to have the database actually running within your process, that is you know in that case maybe r or python and um has access to the result set just in memory and no transfer to happen.

And maybe i'd like to just add in that if for those who maybe haven't done programing and stuff in our audience, that when it's expensive to go to between processes and uh and so that database server in a different process, IT takes a lot of resources to go from the process you're in off to that and back. And so this what's IT all into one you might say one little sandbox, uh, where you're able to to maximize ze that. Would that be a fair assessment?

yeah. So I think one of the other advantages of having this type of a model is that you can share a memory between the process. So just to go a little bit inside the technical aspects of this is that the bottom neck that till was explaining was more like the data transfer bottle neck.

But in this case, when is running within the process, you can you can share the same memory. You can share the variables that are, uh, that you on cinching inside, let's say, a bitts clip that you're cunnin a valuable, and then you have access to the value inside your data. This as well, for an example.

And this makes a super powerful for the developer, for the developer experience as well. And I guess one of the things that apart from the database itself being super fast, the developer experience of using that T V. Is so awesome in that sense that I guess that has also made to the success of IT. but.

Okay, friends, i'm here with the new friend of hours over at time scale. Athar, so then so afthernoon understand what exactly is time scale.

So time scales, a postcard company. We build tools in the cloud, in the open sourcing system that allowed developers to do more with progress. So using IT for things like time series and A X and more recently, air applications like crag in such agents.

Okay, if our listeners were trying to get start with post grass time scale AI application development, what would you tell them?

What's a good robot if you are developed about the you're either getting to sport building and air application or you're interested and you're seeing all the innovation going on in the space and want to get involved yourself. And the good news is that any developer to they can become an A I engineer using tools that they already know in love.

And so the work that we've been doing a timescale with the P G, A I project, is allowing developers to build A I application with the tools and with the database that they already know. And that being progress, what this means is that you can actually level up your career. You can bold, new, interesting projects.

You can add more skills without learning a whole new set of technologies. And the best part is it's all open source for P G A I N, P, G vector scale, our open source. You can go and spend IT up on your local machine via doctor. Follow one of the tories, scp log bull. These are cutting edge applications like crag and surge without having to learn ten different new technologies. And just using post gress in the create language that you probably know, yeah that's IT it's A P G A I project and just go to any of the time scale get up report as uh either the P G A I one or the P G back scale one and follow one of the tories to get started with becoming an.

okay, just use post grass and just use post grass to get started with A I development build rag search in its all open source, go to time skill 点 com flash A I play with P G， A I play with P G vector scale all locally on your desktop is open source once again time scale 点 com flash ai。

So a detail you you are just describing the developer experience, which uh I I would definitely say as kind of fitting that magical experience um that you alerted to with with duck D B. And maybe just to give a sense of people like know when I was initially expLoring the similar to some of the experiences that you all talked about, I would encourage our listeners to go out and and install dcd, be locally and try something because IT is a really interesting experience, especially for those that have worked with traditional database systems in the past and all the sudden. So um you kind of installed R T B locally important as a library.

Then you can query, you know point to CSV files or j sun files or parky files, or even a database like a post gress data abase or da sorted and s three bucket and you have this consistent than equal interface. It's familiar that you can do queries over over that data. So I don't know maybe um one of you uh could describe some of the you know just to give people a sense of the use cases for W T V may maybe on one side where uh it's like the the primary or the key or the most often often occurring use cases that you see people grabbing duck D B and using IT for and then maybe on the other side, just to kind of help people understand where IT fit, maybe where IT wouldn't be as as relevant. Um if you have any of of those thoughts.

I can give like a brief overview of this. Some of the biggest users of ddb come from the big ecosystem and which which means that is it's being a standard for a data for me, for example. And one of the advantages of using the T P C IT is really fast on aggregate.

And for the python ecosystem, IT helps with standing in for a data frame to be used with other M A libraries, for example. right? So that's like one part of the ecosystem.

And the other part of the ecosystem is is for a data engineer uh to be able to pull in data from different sources, like you said, you know postgrads from CSV and to be able to join those, uh, different data sets. A joins are really good with the T P. As well.

And to create transformed data sets is also pretty uh, useful. And on the third ecosystem, for a data analyst whose writing eco and one of the really nice aspects of ducky is the single dialect itself, it's pretty flavored that you have a lot of dcd functions that makes data cleaning easy, data transformation easy. For example, we also have a dialect that says from table, and that's just going to show you the table.

In going select star from table, you can go from table, and that table just fetch data from that table. So there is flavors of dug, a dialogue for that T. V. That makes sadness.

I was also looking through the the ducks T, B website and stuff. And I know IT runs on kind of all the major platforms and architectures and you support of a variety of of languages on IT. I'm curious with because I have a i'm asking a question to my own my own interest, selfishly as done would say, do you support a kind of embedded environments and kind of you know, on the edge, that kind of stuff, where do you find IT embedded in Operating or is not necessarily, you know, on a cloud server on one of the major platforms? Is that is that a typical use case?

That is one of A A good use cases for the T B. Since it's the improper protocol that IT has for running the A T B. IT can run wherever you on patent or or or anywhere.

So and they've also optimized ed the to run in different architectures as well. So so this makes IT possible and to kind of go beyond that, you can also run IT in the browser. So any edge environment you can run IT, of course, there's a lot of optimization for they are like a lot of her edge environments at the moment. Not everything is optimize that uh to interactive v, but I guess it's also moving towards being run in every edge environment as well.

Some of our investors might be curious why you know, a person like me is sort of living data day in the A I world. Is is thinking is is super excited to talk about dr. T.

V. I mean, certainly I have a pass in more broadly data science, and this is pain I felt over time. But also there's a very relevant piece of this that intersects with the the needs of the AI community more broadly and the workflows that their executing.

And one of those you know is where I kind of started getting into this is in the sort of dashboard killing A I apps that people are trying to build in the sense that like hey um another pain of mine as a data scientist in my life is building dashboards because you always build them and you know they never answer the questions that people actually have. And so there's this real desire to have, like a natural language question input. And then you can then compute very quickly the answer to that natural language question by using the alm to generate a seal query to a number of data sources.

But then when you start thinking about, oh, well, now I have these c sv files that people have uploaded into a chat in her face, or I have these types of database that I need to connect to, or have the state in a three buckets, and my answer could come from these different places. All the sudden this kind of rich, equal dialect that you talked about this very quick and can run in us with a standardized A P I across those sources becomes incredibly intriguing um for me, transparently, that's how I sort of like got into this. And i'm like thinking of all of these sources of data that I could answer questions out of using an l but how do I standardize a fast interface to all of these diverse sets of data and also do IT in a way that doesn't, you know, is easy to use from a developer perspective? But I also know that you you will see much more than I do and maybe that maybe that is an entry point that you're seeing. I'm wondering if whatever you could talk a little bit more broadly of how the problems the duck D, B is solving and the problems that your customers are, are looking at are interacting with this rapidly developing world of of AI workflows.

I mean, one way to describe that T, B is a is a secret for analytics. So um IT is basically a very easy way, a very develop, a frantic way to achieve what you just describe. If I want to create a demo for my new texas sequent model, if I use duck to be for IT, I can even make a completely like wasn't based demo out of IT.

For example, I don't have any issues with a CSV upload. There might be databases. We have to specify the limit of the fire that the user update.

So I would have to show a dialogue to my user, says, oh, it's comma separated and IT has a heal and uh so on h with the duck to be just walks. So um IT takes away some of the edges you might have with its other databases. And on top of, as you said, IT integrates with uh different storage baggage like IT can read from s IT can read from H G D P.

When I see an interesting file on uh that a hugging face or get up, I just uh run ret S V from this U R L. And I have the data set locally in my in my c eye, in my piton. Furthermore, when I have a, say, a pion environment, I I start a collab not book right? And I create some data frames then with duck to be, I can just read those data frames.

I've seen very cool demons of people basically using taxi square for um yeah for analytics on penis data frames and uh that under the hord it's just duck to be sitting there and basically reading straight from those ten data frames, which by the way is one of the other benefits of uh shared memory of in process. It's not only for fetching results is also for reading data straight from the process. So in that case, from panners, that's very exciting. I'm happy to talk more about takes a sequel. We have uh have had a project about that uh at mother duck but um .

yeah yeah and maybe also um before we get into maybe some of those stories, I think that that's one side of IT is like the integration of this analytics piece into A I work close. But then also, if i'm not mistake and there is sort of vector search capabilities within duc D B as well, I I don't know one of you could speak to that .

yeah that's one of the exciting aspects of a duck D B. As well. So if I could take a step back and think about other ecosystems where, let's say, postcard has been shining a lot, postgrads has exploded into the kind of possibilities that you can do, because IT has kind of like an amazing extension mechanism where you could add extensions and capabilities of.

And in a similar way, W, T, B has an extension mechanism that you have access to the internal workings of dt b, and you could add more workflows on top of what duck tb can do, right? Ducky has the capabilities of doing vector search, for example, and IT also has uh hybrid search where you you also have full text search and vector search that you could put together to to create hyper search. One of the ways that does is that IT has a really nice uh, data type.

Uh, I can go to the abbot hall of the inner workings of how how they make this happen, which is also pretty exciting. But one of the things that they make this possible is to provide, I can r data type, where you can have an R A of uh you know, uh, floating points, and then you can store this as a data type. And then that eventually becomes an emda vector that you can do cosine similarity against.

So that is to do like an embedding by search. Then you can also have full text search where you can create a ah inverted index of keywords to your documents and you can search across your keywords to find your ideal documents and rack them according to the score. And then you could fuse both of these scores from empty search and from a four tech search to have like a hybrid search. 所以 i so all of these are possible and they're very accessible。

Well, there's no shortage of helpful A I tools out there, but using these tools means you can switch back and forth, back and forth between yet one more tool. So instead of simplifying your workers, IT just gets more complicated. But that's not how works when you're using notion.

Notion is the perfect place to organize lots of stuff, task, tracking your habits, writing beautiful dogs, collaborating with your team knowledge basis. And the more content to you and to notion, the more this cooking called notion A I, and personalize all the responses for you. Like the chat box, notion A I R has the context of your work.

Plus IT has multiple knowledge sources. IT uses a noodge from GPT four and claude, and that helped you chat about any topic. And here's the kicker. Now in beta, notion A I can search across slack discussions, google docs, sheet slide and even more tools like get up and geria. Those are coming soon.

And unlike special tools or legacy that have you bouncing between different applications, notion is seamless, ously, integrated, infinity, flexible and beautifully easy to use, so you are empowered to do your most meaning work inside notion. From small teams to massive fortune five hundred companies, these teams, both small and large, use notion to send us email, cancel more meetings, save time searching for their work. And they reduce spending on tools, which helps everyone stay on the same page.

Try ocean for free today by going to notion that com slash practical A I, that's all over case notion 点 com flash practical A I to try the powerful， easy use AI today. And of course, when you use our link, you're supporting our show. And I know you love that again. Notion 点 com flash practical I。

So uh till you you are starting to get into even some of the things now that you're doing a mother duck on top of duck. D B, um i'm wondering, you know uh hopefully we can get to some of those. Those use cases are the things that you've been doing with with customers or or internally. But i'm hering before we do that. Um you know I see also this sort of story about um doug D B efficiency, but with this kind of multi I player aspect as part of what you're doing at doug D B. So maybe one of you could describe kind of now I think we have a sense of what duck D B is and it's this free thing that you know is is open and and I can pull down, I can install, I can run IT very quickly, run on my laptop, run in my browser, do these analytic queries. So now kind of described maybe a little bit of how you're taking that that further with mother dog and and how you're thinking about some of the enterprise use cases.

I like to describe modec as um giving your duck to be A A cloud companion. So um it's easy to think or to associate. Okay um we bring mother duck to the cloud, which is one way how we describe ourselves as well to associate debt with. We provide infinite scale in the cloud.

You give us a workload and we start a how many hundred ducks debs in the background that um in a dusk like fashion um that say process your data currently but um actually the um one of one of the the hypothesis that mother duck is a is based on our that the company was was was founded on this that actually single node compute which means one dark b database ah with nowadays is hardware cloth hardware is really actually get you very, very, very far. So when your local computer ressources um reach reach limit, you will have um cloud cloud singer cloud instances with up to much is twenty four terrified of memory that's relatively big data. So that's one aspect, right?

So scaling up with one child company and act to be another aspect is um that collaboration. So once you you are connected to a god incident, you can have shed context with other users in your organization. You can create share data sets, you can uh have shed notebooks and so on and so forth. And with that, of course, comes all the enterprise sock to kind of things that that some of the enterprise customers uh, require to adapt towards c.

I'm curious if you could you really captured my imagination with that that description and so like because by drawing, for instance, with kind of know the old school post gress things that people would do with that, and you just talked about having many doug D, B instances Operate and concurrently.

You know what kinds of problems and kind of you know grounding IT in a practical way for from a users perspective, what kind of problems uh, do you see people solving with that kind of architecture um and that new capability that they may not have historically had over the years with previous database capabilities on other platforms? What new sets of concerns can they address? Now with this.

I would come from the perspective on this that um there are a lot of companies out there that when they want to go to the cloud with the analytics workload, they have relatively limited choices. One of those choices, a snowflake or data bricks and day, of course, are of those systems are optimize for a big data scale. So but then one of observations is that a lot of companies actually don't have that amount of data when they run query es or they might have big data, but they queries they are running, uh, only access a very small subset of the data. For example, you know you run a monthly reports, they don't touch your your entire historic data said, so those companies want to um might want to have something that is easier first, easier to use, easier to set up, and that's also more consumerist than other existing solutions.

One of the things that we haven't touched upon in this yet is kind of how mother dog and that T B go hand in hand with like the remote and the local aspect where you have on your local annual remove the same client so that you could actually your running the same thing. So it's easy to go from one place to the other, doing the same thing.

And what mother duck also provides is a dual execution, where your local D, B, if you're running IT locally, can communicate with your remote mother duck and execute seamlessly between both and, for example, equality, where you have a table in your local dock, T, B, and you want to join IT with a remote dog T, B, you can join both of these uh tables together to run an underground. And then there is like a quality optimization that we run, where we transfer the data, which was required from the remote, your local or from your local to remote and executed intelligently in away a if I could say that. And this kind of opens up new opportunities in like the dual execution aspect of running local and the remote with the same client.

I'm curious against selfish question is you're doing that and you have the local version and the remote version, the connection between the two um there what what does that look like is, is something that if they're widely separated, if you know mother ducks in the cloud and i'm out on a device that's not cloud based, uh is that efficient communication how you all handle those different types of use cases?

Um yes so one of two principles of this dual execution is to um reduce the amount uh of data that has to be transferred as much as possible. One of the use cases, for example, is I have a really large data set on necessary, and I want to join IT with a small tour that I have on my notebook.

So uh, in that case, an optimization query optimization when we make the decision to instead of downloading the a one terror by data set to a local device and doing the joint there to instead upload your small local fire to the uh to the cloud worker and do the processing there so that in that case a lot of bandwidth, same with you know filter pushed on, I Carry a large data set on a three again and the transfer only has to happen for for the first time. And that's you can get something similar with dcd b as well if the data is petitioned. So dcd B S clever ways to optimize remote file excesses. Well, uh, without mother duck, but the thing you get with mother duck is IT even far as the data. If your data is not petition because the cloud work is still take care of the doing the bike of the work and only gives you the result, you actually, you actually want to need.

A lot of what we talked about are the features of duck tb. And then what mother ducks is, is adding on that and also how that intersects with A I workflows like the texas qual case or the rag case, where we're doing vector or semantic search, we're doing hybrid search. All of those things are super relevant to people building their AI workload.

But I also find IT interesting that I I see um till you you wrote one of the blog post that i'm looking at now, which is like you're you're also thinking as a company about how to use A I intelligently in in your own product as well for the uh users of your product to are maybe technical users. They're building their own workflows, but also you have sort of A I integrated into some of the features of that. I'm looking at this fix IT feature. So wonder if you talk a little bit about that, how this is both your enabled ling AI developers, but also you are definitely integrating this technology as well, at least that's how .

that seems yeah is a one of the big appeals of duck b is the simplicity that's what what brings a lot of users to duck be. Um I think that solicit can be extended towards usage of A I to a certain extent, that usage of a in the context of data notices, data management and that multiple aspects to that on one end this user experience side of things.

So how can we make IT easier for people to write six what and um I think the answer to that is not our only taxi square. And part of the story is fixed IT. So one of our main names with with fixed was to keep IT um basically make IT non intrusive and not interrupting your flow of writing square.

Why is still being helpful when IT trigger? And I think kasa, for example, is an excelled example of of integrating A I into I D S or into the work flow of soft developers. And um in our case, we have to think more about data engineers and data ales. And I think it's a super exciting time for those kind of things.

I think mother duck is a particularly interesting place to work on those kind of things because one of the unique advantages that we have is we have an actual data base running on the client side in the brows er of the of the user, someone he's using our web o eye. That user is actually uh duck to be running in the browser that can do passing binding. And that gives us so much information about the current state of the query that to use us writing and fixed only the greatest the surface of what is possible in terms of secure writing assistant in that sense.

And so i'm i'm curious as we start winding up, you really got me thinking of about use cases ah that I had not thought about before and all the things I might be able to do here. Um so a little bit like a kid in handy start, I got to ask you and i'd like each of you to take a swinging at IT. Um it's pretty cool what you've talked about today in terms of what is possible for us.

How are you thinking about the future? Like what are the new cool things that you have in mind? You know I often say, like when when you're kind of not actually work in hard on your problem, that you're kind of chilling out of the end of the day and your mind just wondering in free form and you're thinking, and boy, what if we could do this?

I can imagine that you and I can kind of see a path forward to get there. How are each of you thinking about mother duck and duck D, B, in terms of what the future might offer if you wants to kind of, you know, get out there and wax poetic C A little bit? And IT doesn't have to be grounded in current work, but more and imagination and aspiration.

One of the things that I did like about the current state of A I is, is how good the local models are, the small models that you can run locally. And is a great ecosystem of that building. On top of that, one of the things that I see with the local models of because the house, senate, but to prevent housing nation, you can use a realized rq mechanism to put context into those local models.

Uh, and these local models could be on the edge as well. IT could be on the local laptop. IT could be on the on the edge. And knowledge bases are essentially created to kind of prevent these kind of all nations.

And one wasteful aspect of creating colledge basis that everybody y's creating very single knowledge bases, right? And what if there could be a mechanism where we could share these knowledge bases? A user could create a knowledge base and they could chat a knowledge base. And one of the imaginative words that I driven is how mota could be there to do this kind of shareable knowledge bases where you you essentially have a world of remote knowledge places out there in your remote tables, and then you have a local dt, b client there that helps you pull a knowledge space that you want.

Use the local knowledge base argument, your local model with, you know the relevant context for your current question and then when where you don't want the knowledge base, you could also drop the knowledge base and that's like you know having a remote knowledge base, uh a possible and pull whatever you want. This is just like one of the dreams that I know. Think about how mother duck could be, how mother duck and duty we could be useful for this.

And another aspect of, and talking about knowledge bases and rag applications is that not all applications and workflows require a real time data base to build agents on top of them. And some of these agents could be running as background agents that do some work flow once every day. And instead of having a real time database for that, what if you could provide a very light weight on logical engine that's quite cheap to on locally as well? And that could also, you know, you could offload some work to with a remote cloud. So this is another thing that keeps me excited at night to to think about what could be this kind of use cases, which are these are two use cases that I am .

quite excited about yeah I mean maybe um I can add two things that one thing that actually connected that and um that is bringing A I and machine learning capabilities in more into the database. So one of the things we seen in in the passes that the influence cost of language models have dropped quite significantly compared to two years ago.

Uh, it's now I think only it's two percent of of the Price for a inference with GPT GPT four many compared to GPT three. And um that actually makes this possible to run a language more inference on on your tables and uh also to do things like in bedding compute on your tables. And sequel is just a really, really convenient user interface for that.

So we edit this ebel ding function sometime, a goal that works really well together with a Victor surge, so you can basically do and betting based such onion square awaiting the prompting capabilities. So you can do language model base data ringing in your database and get together with local models and this hybrid execution model say, okay, we do part of the work locally. Maybe you have a GPU do part of the ebel ding inference locally.

If you want to do IT faster, do IT in the cloud. We, uh, you know what? What if you were a one hundred and again, everything is 7 some yeah。

Well, thank you both for taking time out of your analytics A I database work to come talk to us. This has been been super amazing. And I would definitely encourage people out there.

Please, please, please go try out some things, try out some examples with the b check out the mother duck website and some of the great blog post contempt that they have, their examples or things that they're doing, check IT out because it's definitely A A really wonderful thing that you can add into your AI stack and and think about an experiment with. So thank you so much. Till in A D, T, A. For joining.

it's been a pleasure.

Thank you guys for thank you guys. IT was pretty awesome to be here.

Alright, that is practically eye H O this week. subscribe. Now, if you haven't already had to practical I df m for all the ways, join our free slack team where you can hang out anio Chris and the entire change law community sign up today at practical A I D F M slash community.

Thanks again to our partners at fly A I O to r beat freaking residents breakfast mastercard inder and to you for listening. We appreciate you spending time with us. That's all for now. Next time.

Big data is dead, analytics is alive 50:21 Share

Practical AI: Machine Learning, Data Science, LLM

Deep Dive

Shownotes Transcript

Big data is dead, analytics is alive