Creating tested, reliable AI applications

2024/11/13

Practical AI: Machine Learning, Data Science, LLM

AI Deep Dive AI Chapters Transcript

People

Chris Benson

Curt Maggi

Daniel Whitenack

朋

朋友

Topics

Chris Benson 和 Daniel Whitenack 探讨了如何构建可靠的 AI 应用，特别关注从原型到生产环境的过渡。他们强调了测试的重要性，指出许多 AI 应用在 80% 的时间内运行良好，但在其余 20% 的时间内却会失败。他们建议将 AI 工作流分解成可测试的代码单元，并使用行为测试方法，包括最小功能测试、不变性测试和变异性测试，来评估模型的可靠性和对输入变化的敏感性。他们还讨论了低代码/无代码工具在 AI 工作流构建中的作用，以及这些工具的局限性。他们认为，虽然这些工具适合快速原型设计，但要实现生产环境的可靠性，仍然需要将工作流转化为可测试的代码。Curt Maggi 介绍了 Fly.io 平台，这是一个允许开发者快速构建和部署应用程序的平台。朋友介绍了 TimescaleDB，这是一个用于构建 AI 应用的 PostgreSQL 数据库。 Daniel Whitenack 指出，AI 代理的采用速度比预期慢，并且大型语言模型的发布速度也放缓。他认为，即使当前一代模型成为最佳模型，AI 仍然可以在企业和个人生活中产生显著的变革性影响。他强调了对现有工作流程进行评估和优化的重要性，以充分利用现有 AI 工具。 Chris Benson 认为，即使大型语言模型不再有新的突破，开源模型的兴起也会使商业模型的价值降低，并加速 AI 的普及。他指出，当前的大型语言模型可以作为工具的编排层，与其他专用工具结合使用，实现更复杂的 AI 应用。他认为，当前可用的 AI 工具已经足够强大，可以对文化和社会产生变革性影响。朋友介绍了 TimescaleDB，这是一个用于构建 AI 应用的 PostgreSQL 数据库，它允许开发者使用熟悉的 PostgreSQL 和 SQL 语言来构建 AI 应用，无需学习新的技术。

Deep Dive

Chapters

The episode discusses the challenges of creating reliable AI applications, focusing on the transition from prototype to production and the importance of behavior testing.

AI applications often work well 80% of the time but fail 20% of the time.
Behavior testing and the flow from prototype to production are crucial.
The release of frontier models has slowed down, affecting expectations.

Shownotes Transcript

Translations:

中文

Welcome to practical AI, the podcast that makes artificial intelligence practical, productive and accessible ble to all. If you like this show, you will love the change log is news on mondays, deep technical interviews on wednesday on fridays, and awesome talk show for your weekend enjoyment. Find by searching for the change log whatever you get your thanks to our partners at fly I I O launch your AI apps in five minutes or less. Learn how at fly dii O.

Was of friends. I'm in the curt magii cofounder and CEO of fly. As you know, we love fly.

That is the home of change law 点 com。 I wants to know how you explain fly to developers. S you story first. To do that.

I can change how explain IT based on almost like the generation of developer i'm talking to. So like for me, I build a ship APP on heroica, which you've been never used. Heroica is roughly like building and shipping and APP on for today.

IT is is twenty twenty four instead of two thousand later, whatever. And what pressured to me about doing that was I didn't. I got stuck. If you can build a ship, a rails APP with a postgrads on erotic e the same way you can build a ship in next days APP on hersell.

But as soon as you want to do something interesting, like as soon as you wana at the time, I think one of things I ran into is like, I wanted to add what used to be like kind of the basis for elastic surge. I want to do full text search in my applications. You're kind of hip this wall with something I could OK where you can't really do that.

I think lately we've seen IT was like people wanting to add all of them kind of inference stuff to their applications on brazell heroic. The are, however, these days have started like releasing abstractions and sort of like you do this. But I can't just run the model.

I run locally on these black box platforms that are very specialized from the people my age. It's always like to great, but I agree IT. And one of the things that I felt like I should really do, what I was using her ou, was like run by a close to people in tokyo for users that were in tokyo.

And that was never possible for modern generation. Debs, it's a lot more versie based. It's a lot like versey is great, right? Appetite, one of their hard, blind boundaries. And in your kind of stuck there is one. We've had someone, the company I can burn name of this game, but the tagline was like five minutes to start forever. To master this sort of hour pitching fly is like you can get an up going in five minutes, but there is so much debt to the platform that you're never going to .

run out of things you can do with IT. So unlike A S or eroica or vessel, which are all great platforms, the cool thing we love here, IT, changes most about fly, is that no matter what we want to do on the platform, we have primitives, we have abilities, and we as developers, can charge our own mission on fly. IT is a no limits platform built developers, and we think you should try IT out, go to fly, die ot lr, more. Launch rap in five minutes too easy. Once again, fly the IO.

Come to another episode of the practical AI podcast. In this fully connected episode of the show, person and I will keep you fully connected with everything that's happening in the world of AI and discuss some of the latest trends and share some learning resources for you to level up your machine learning and A I game. I'm Daniel White mac.

I am CEO at prediction guard when we're creating a private secure A I platform. And I joined us always by my cohosh Christenson, who is a principle AI research engineer at lucky Martin. Are you don't .

Chris do in very well, Daniel? I know you're out traveling and ironically, I think i'll be i'll be where you are next week, but I think you'll be gone .

by swapping places.

There you go. We're we're trading geography here.

Yeah, yeah. It's I don't know why november always seems to be heavy conference event, summer onside months for me. I I don't know exactly why that is a sort of the last uh last little bit before the end of the year.

Maybe it's to make you earn that vegan turkey that you're .

going to enjoy you. Exactly exactly. Yeah, I got IT all picked out. So we're ready for two for key and on thanksgiving.

for sure.

And excEllent. And speaking of other things to celebrate, I wanted to mention before we happen to other discussions today that are good friends over at the M L. ops. Community, so demetrios and his crew, he from a series of virtual conferences, uh one or about l limbs and production and another about data engineering and A I and their latest in the series is called agents in production, which sounds very exciting as you are listening to this episode.

If you are listening to the right, when IT goes live, you can still probably catch the event life, but you can also catch the content afterwards, i'm sure, as its recorded looks like an amazing conference talking about A I agents moving from R, N D to reality. Are you ready? Question mark.

So go check that out. Their events are always, are always great. So I wanted to mention that front at the beginning of the show cause, uh, do you have any active AI agents in your life? You know what? I actually .

don't right now, but I probably should. I feel bad that i'm not that I can say yes to on that. But what i've also read recently that the the uptake on agents has been a lot slower.

Yeah then was expected. You know there was kind of one of those hype things and I think it's pretty hard. So maybe something to discuss .

yeah maybe along the same thing, I see a lot of news articles, especially as related to I think what you know over the past, I guess, couple weeks or whenever IT was that people realized that opening, I wasn't going to release GPT five. I don't know there is expected this year and not really anyway not release on the timeline that the people thought.

And also some indications that maybe that, that next jump in the functionality of these ai bottles is proving more difficult than than was originally thought. One question I I had, as related to that, Chris, was let's saying we never get GPT five. We're just stuck with all the models that we have now. So no, no more models were made in the in the world. What do you think the value of A I and kind of its integration across the the enterprise and business and our personal lives, you think I would still have the you know high transformative effect that people are are talking about?

And and we've talked a little bit about that ms of hype cycles and stuff on previous episodes. If you positive that we're hitting a ceiling ing right there, I don't think it's to no more models. I think what happens is all the is that there's more open source that comes along and and you know kind of at least catches up to where some of the the leading ones are out there.

And you having the value of a commercial model, uh, is less because you have more open source options that are out there. And we're seeing that in industry anyway. I know you know not everybody wants to pipe their data out to OpenAI or you know some of the other organizations doing the same thing. And so I think the availability of open models is going to happen regardless in terms of a the uptake on that. And I think that would just kind of force that to happen sooner or rather, you you if the leading models are no longer, you know, new ones coming out that are Better and Better to chase, they catch up and that kind of commoditized the whole space even faster.

Yeah, I think one of the things I was one thing was, let's say that regardless of whether it's an open model or close model, so if I told you, Chris, you're an AI engineer actively integrating N A I functionality and I told you the best model, you're ever gonna a get on the open side. If you're using open models, maybe long and three, one or whatever, and on the close, suicide, maybe it's the latest cloud or GPT for whatever, whatever that might be. So if that were the case, would that be like? And I I I don't think we're gonna able to do all of what we had hope to do with A I or do you think it's more of a yeah what what is the level of transformation that you think we could still get with the current generation of models?

So it's kind of funny. And I know we've talked about this. You know some of our listeners will remembers some of the previous conversations, but there's a lot more to A I than just the gene I models know they have got all the spotlight the last couple of years, but there's a lot you can do.

And honestly, you without going into detail, the things I think about all the everyday G A, I is not the center of IT. It's not the stuff in A, I that I care the most of all is not what's making me most productive. So are there many things you can do A G, A, I, to be productive? sure.

And we're still learning how to do that. And I think that's harder than people realized h to get there. And I think that's one of the reasons it's plunging down into the trough of disillusion ment in the hype cycle as people who are frustrated.

But we will have lots of jenni things. But I think IT reminds us, as we've said recently, to look at the larger landscape of A I capabilities out there and other things that we used to be excited about are incredibly productive these days. And yet we're not talking a lot about them. You know deeply enforcement learning remains amazing. And what you can do and you know if you combine that with robotics and other areas, there's lots of really productive work being done out there, but it's not getting much media attention.

Where my mind goes is that the current models that are available, if you think about that general purpose reasoning to I model, are good enough, in my opinion, to do kind of most tasks at the sort of orchestration layer. And what I mean by that is, let's say that you wanted to do time series forecasting. I don't think that you know whenever model you look at on the gene I side, it's not the best time series forecasting.

However, you know there is really good tools for that, that art exists. So you know facebook profit or something like that, and you can use the gene model as almost the front to tools like that, right? You can say, I want to know what my reviews is going to be in six months to a forecast or something like that, and use the model to extract the data that's needed to make that forecast and maybe call a tool like professor something to actually do the forecast and and get something back.

So I think even if we got stuck with the models as there out there, to your point, there is a variety of purpose built tools and non G A I tools out there, whether they be just rules, rule base tools or or machine learning models or statistical models or whatever, that can do a variety of the really important tasks that we want to do. And the G A I models that we have now could serve as a way to orchestrate between those tasks and to you know, create some really appealing workflows and automation and flexible interfaces and and all of this stuff. So where my mind was going with that is i'm not that concern if IT takes a while for P P T five or than my opinion.

When I think about the rest of the time, i'll have to be a developer and AI practitioner for the rest of my career. I could keep myself busy, no problem with all the things I have access to and create some really interesting products and tools and features and interest. I think .

that's a great insight, right? There is the fact if you look at all of the different jobs and workflows people have out there in their careers, and you know, if you think of all these tools that we currently have today, like without having to go forward, I would argue that I would be a very practical next step, you know, for practical AI to be able to start assessing your workflows, assessing where these different tools in the models we have can make a difference, and doing some processor engineering to figure out how you can do that.

I think that the vast majority of organizations out there have not done that sufficiently. They might have done that with a workflow or two, but they haven't gone through there. Especially large corporations have thousands and thousands of them. There are so many places where productivity can be enhances by finding the places where people are struggling through their own workflows and where those online. Well, with the capabilities in in these models, that might be a place they want to invest a little bit and get a great long term benefit out of IT. But the next model coming, whatever model we're talking about, whatever the line of models, family models, always get the attention rather than the kind of brunch work of going through your processes in finding where you can save a whole bunch of effort, a whole bunch of time in a matter of of moments, you know, uh, to increase .

protectively yeah, I think that there will be just a dry paralo. Maybe I know that people have drawn comparisons between this wave of AI technology and the onset of the internet and the web, that sort of thing. I think you could see that the basic components are many and many of the most impact for all what based technologies that have shaped culture and shape their lives.

The building block components of those were around from the very early gaze of the web. And there are a sort of generational jumps, right, like the, you know, the advent of streaming and all of what we now consume via streaming would not have been possible over certain types of internet connections technology, right? So there are certainly generational shifts, but the the building blocks.

Or nothing. And probably some of the people in those early days working with those building blocks could not have imagined the transformative and kind of culture defining effects of those basic building blocks. And so I think we're in a similar scenario where the the building blocks of what we have with A I, whether that be N A I or non G A I are enough to I I don't think I would be too far to say, transform certain elements of our culture.

Sounds sort of bRandy us, but I think that sort of what's coming. But there will likely be the kind of um generational jobs, rather that b GPT five or another language family or whatever. There will likely be generational jumps that we also don't anticipate yet. But the tooling we already have, the building blocks we already have are enough to create transformative technologies and products and systems.

I would agree I think maybe there's another show where we talk about what we think the transformation of of society and culture is in the future. I don't think that the show right now, but I would agree with you with what we have today, we can go a long way into your point. I was in college when the web came into being a so I do remember exactly, you know, those very early building blocks and trying to imagine, and this is not the same world that we live in today, that IT was then.

okay.

Friends, I heard the new friend of hours over at time. Thorn, so athor help me understand what exactly is time scale? So time scale .

is a post gress company. We build tools in the cloud and in the open source system that allowed developers to do more with post gress sing IT for things like time series and a tics and more recently, air applications like crag in such an agents OK.

If our listeners, we're trying to get start with post gress time scale A I application development, what would you tell them what's a good .

romp if you are developed about the you're either getting top for building and air application or you're interested in you're seeing all the innovation going on space and want to get involved yourself. And the good news is that any development today can become an A I engineer using tools that they already know in love.

And so the work that we've ve been doing a time scale with the P G, A, I project is allowing developers to build A I applications with the tools and with the database that they already know. And that being posed, what this means is that you can actually level up your career, you and bold, new interesting projects. You can add more skills without learning a whole new set of technologies.

And the best part is it's all open source for P, G A, I and P G vector scale. Our open source, you can go and spend IT up on your local a doctor, follow one of the tutorials and the time scope blog, bulled these cutting edge applications like a rag and surge, without having to learn ten different new technologies and just using post gress in the single cray language that you will probably already know and are familiar with. H, so yeah, that's IT gets started today. H, it's A P, G, A, I project and just go to any of the time scale. Get up report, uh, either the P, G, I, one of the P G, back a one, and follow one of the tutorials to get started with becoming an A I just using post press.

Okay, just use post grass, and just use post grass to get sorted with A I development build rag. Search A I agents in its own source, go to time skill, dcom flash ai play with P G A I play with P G vector scale, all locally under desktop is open source, once again, time ale 点 com flash ai。

I've been usually teaching workshops again. I have, albeit q kon S F next week. So those of you that are around ukon all look forward to seeing you looks like a good event. But some of what's come up by thinking workshops for me recently is how to think kind of about your AI workflows go from prototype to some level of production.

And I think these are things that we've talked about on the show before prior to G A I in terms of how you, anna, be testing and monitoring and thinking about deployment of A I based workload. But i'm guessing there's a lot of people maybe joining the show from different backgrounds after this G A, I phase. And I connect this to the sentiment that people often, with this technology, are able to get to a point really quickly where they see an amazing workflow kind of come to shape right. And IT works amazing in like some of the time, like half the time, and then the other half of the time that fails. But they get sort of the taste of the goodness, and they maybe don't know how to get the rest of the way.

And I thought of an interesting parallel, because some people are using these kind of low code, no code, A I workflow builder tools, right? Whether that be something like flow eyes or gum lue duffy, which is like, you know, these little interfaces, you can bring things together or it's like built the tools like altera or something like that, that may be a little bit more enterprise y focus, but they built out this workflow and IT sort of does this staying at works like half of the time and not the other half of the time. And maybe they're using A I calls as part of that. And IT struck me that this is sort of like, I don't know if your member, Chris, back in the day, we had phase of our ai podcast life where we were really trying to convince people that they shouldn't run no books in production. Um do remember these these days of of data science yeah .

I do it's that's a little ways back. Yes, I do indeed.

So for those that aren't familiar, when I say notebook, they referring till like a jupiter notebook, this is a an interactive what based code editor. And if you imagine, uh, maybe some of you that have used mathematics in the past, similar, but you kind of go to the screen, there is a cell of that. You can put code and you can execute that cell, you can take notes, you can execute IT another, sell code, and all that state is saved.

So you can like execute, sell one, and then go down to sell five, and we execute self five, and then go up to sell three, and we execute cell three, and then go down to cell seven and we execute cell seven. And what happens is, so if we just rewind our mind back to the olden days, i'm a day, a scientist and building a model or creating a workflow, and i'm doing this in a notebook, that sort of workflow generally is good for experimental and produces really, really terrible code just by its nature. So it's really good for experimenters.

But I find hoping around all the time, I don't really understand what the state in the background is. I am like hoping around between cells I can give you the same notebook and you could never reproduced what I did even though it's the same exact code. You can never really produce my exact sort of steps.

And I just was struck by the fact that the way the people, it's almost like we forgot that that doesn't work that well. And now we're just doing IT, not in no books for doing that in these low code, no code tools or with agents that jump or around between various tasks, right? And part of this is the same reason why no books are really terrible.

Producing good reliable code is the is the same reason, and I think why people are taking Z I workflows from tools that they're using and are able to make them robust and reliable. So that's my hot take for the day. What's your thought? press?

No, I think that's great. First of all, i've got to say, boy, it's already making me feel aged again in a different way. The fact that IT wasn't that long go that we IT was all the hotness of jupiter note. Ks, you know that we are talking about and that was the cool thing .

yeah and even like products managing all your notebooks and such yeah .

IT feels like you just gave eulogy. You know for you Better no books to some degree. And so it's a reminder that things are changing constantly.

So you bring a great point that you we're taking some of the same chAllenges that we had in that environment, and we're just recreating them in the newer tools that are out there. There was another company I worked at. I won't name the company before I was at lockheed. And at that company, I remember thinking there were people at the time that new, they knew the AI modeling bit and there were people that knew the software bit, but they never seem to cross over.

And when you raise the point about you're kind of in process development workflow and then how do you actually get that to some level of production? I think there's a lot of people out there um that are you going to know that in less times have really changed in that area? And my god says they probably haven't.

People tend to focus on the thing that they want to do. What is the right development workflows and how do you start getting to that production environment? I know you've gotten tons and tons of experience at that in recent years. How do you think about IT? Can you frame IT a little bit before?

Do you dive and do IT? yeah. Well, I was thinking about IT in light of this parallel to what we went through in the data science world with no books in these kind of and how workflows that execute some of the time and not other times, expanding on how you execute them.

And in reality, the answer to running that code and production is not the file, you know, download as pit on script think because that just will never work, because the the state and the workflow is not preserved. How that actually gets productions ized or would be productive ized in the past is taking the logical steps that are being executed in that workflow and taking those out of the new book and embedding them in actual code. In this case, you know, pid on code in functions or classes, and attaching test to those functions are classes just like a software engineer would do because this is software engineering.

And then figuring out, you know, again, doing the testing on the front end of that, whether that's A U I system or that's an API or whatever IT is to make sure that the behavior that you are, you are testing and your no book actually works and that kind of sucks because it's a reimplement table, right? To some degree, maybe you don't have to throw everything out like you got something working in your notebook and you can bring that through and and have a work. But IT does take actual work to go from that nobo state to the production code.

And so I think if you just look at one of these tools, and I do think some of these tools, the kind of low code, no code, assistant builders, workflow builders with A I stuff, are useful and can be useful in your needy, building out a nice workflows for your personal life, right, like email assistance stuff for automation to, you know, turn news articles into podcast or what whatever the like the thing is right that you wanna do. But ultimately, these tools have their own opinionated way of tracing and testing and ut bugging right ah the same way that like debugging a jupiter note. But you have slightly different tooling, you have a slightly different workflow then if you are debugging regular code.

And so I think part of the answer, unfortunately, and I guess this is my this is my hot take, if there is one, is I think that we're gonna see a similar dynamic in the AI engineering world where like a business person, except I think the rules are different here. So similar way to like before the data scientist would build a workflow in a jupiter notebook and maybe a software engineer would integrate that in to, uh, actual code that tested and has some form that resembles actual code. And the data scientist is probably interacting with the business person in this case is slightly different role wise because the data scientist almost isn't there.

But the business person might go into a tool like gumble or flow wise or differ or whatever and build out a tool that takes market analysis things and generates, I don't know, articles are summaries that go into some emails that are sent out to the company or whatever workflow that they created, right? And that has a series of steps. And they're like, yeah this like this works, but I kind of does work, but I kind of doesn't work because they haven't thought about all the edge cases.

And it's hard to deep bug gets hard to know when it's down. And so I think now it's like that business person bringing that workflow, if IT really, truly doesn't to be scaled across an organization or released as a product done its own, you just sort of have to take those steps out and actually put them in functions, put them in classes in your code that can be tested. And we can talk about like technology of testing here in a second.

So I think that the code, no code, things are cool and awesome and have their place, just like no books have their place. And I still use prior google collab, not jupiter locally, but, uh, google co lab, no books. I still use no books.

I just realized their limitations, right? Maybe sometimes Better than other times, but I I realize their limitations and then I eventually write software. So I think it's the similar thing with these tools that are up and coming is they're great and they allow for a quick prototyping and business people to get their workflows and their ideas into a workflow that out Operates. But ultimately, this has to become software. If the intention is to make IT a feature that you release or something that scales across your organization and or something.

Let me ask you. I'm just curious when you're in co lab and you're doing that and you decide it's time to write software, how do you make your own transition? What what do you do having done this for so long? What's your transition look like? Are you staying in python? Are you converting some of that over and to go or something else? Or how do you think about IT?

I think if it's um IT depends of course case by case but generally I would say if i'm doing in in collab that probably means that I am doing something that requires path on and so that you know I don't know doing something in ranchin or by torith, whatever the thing is. Then I think when i'm ready, I essentially have two ideas in my head of where that's gona live.

Either it's gona live in a rest API because you're gonna to make this functionality available to the rest of your software. So they're gone to live in a rest API. It's going to be integrated into some software your already supporting, right? So that art has a code base or it's onna be run as a script, kind of an offline script that a certain you okay answer or something like that.

And so if it's the A P, I scenario, you know, I have a bunch of code that i've written over time with fast A P I. I can just copy one of those projects, rip out the stuff and that is irrelevant, and put in the stuff the relevant kind of copying over from the but if it's more of the native software immigration, I think that really depends on so if it's uh, what kind of application IT is, what the architecture is that sort of thing. And so that might involve a change of language or, uh, change of the type of infrastructure that you're using or the type of database you're connecting to and whenever that might.

And that's an area that i'm keenly interested in because, you know historical we have we've been developing these models in python and then deploying them in python mainly because that's where the tools and stuff are still at.

But there's a number of these cases, especially we go forward and we're looking at autonomy and we're looking at robotics and things like that were in many other cases, python is not the best language for the platform that you're deploying to. And so you have this uh, incongruity between the development environment and a distinctly tly different production or deployment environment that you're trying to target. And so you know in my world, there are many things that might start in python that probably should end in something like a rust given what were you what we're trying to accomplish here. So I think that, that still remains a very immature deployment, a kind of arena to work again. And i'm rather hoping that in the years to come, maybe we see more tools from tall providers open source in that arena that I can actually cross over uh from one language to another to make sure that it's always the right one for what you're dealing with.

As a friends, I love my eight sleep chicken out, eight sleep dot com. I've never slept Better. And you know, I love biohacking.

I love sleep science. And this is all about sleep science. Mix with A I to keep you at your best while you sleep.

This technology is pushing the boundaries of what's possible in our bedroom. Let me tell you about eight sleep in their cutting edge pod for ultra. What exactly is the pod main? A high tech matters cover that you can easily add to any bed.

But this isn't just any cover. It's pack with sensors, heating and calling elements. And it's all controlled by sophisticated A I algorithms. It's like having a sleep lab, a smart thermal dad and a personal sleep coach all rolled into one single device.

And the pod uses network of sensors to track a wide array of biometric White sleep IT, track sleep stages, heart rate variability, respite ory rate, temperature and more. And the really cool part of this IT does all this without you having to wear any devices. The accuracy of this thing rivals what you would get in a professional sleep lab.

Now let me take up my personal favorite thing, autopilot ricket. Every day, my asleep tells me what my autopilot did for me to help me sleep Better at night, IT said last night, last night, autopilot, a justice s to boost your risley by sixty two percent, sixty two percent. That means that IT updated and change my temperature to cool, to warm and help me fine tune exactly well.

Want to be with precision tech control to get to maim m. rama. sleep. And sleep is the most important function we do every single day. As you get to tell. I'm a massive fan of my, but I think you should get one go to eight sleep dog comms lh change log and right now they have an awesome deal for black friday going from november eleven through december fourteen s that this t code change law will give you up to six hundred dollars off the pub for ultra when you bundle IT again, the code to use is change log and that's from november a 1 th through the seventh once again， that eight sleep that com slash change log I like you love IT I step on this thing every night and I absolutely love it's a game changer and it's onna change your game once again eight leap dcom flash change log.

Well, rss, I kind of started talking about the testing and integration of some of these workshops and high, I see that playing out from my experience and talking with people, there are some general confusion around how to. So let's assume that you're convinced that I want to rip out these various pieces of workflow that have maybe been prototyped in a low code, no code tool, and I want to put them into some software.

It's N A, P, I or U I or a script data pipeline, whatever that is. Let's assume that your convention of that, then the question comes, well, okay, now I have this functionality, I have this class in code that executes some sort of A I call a call to an AI model. How do I test that? And what sort of considerations might need to be in place around that? And I I often find that this sort of breaks people's mind.

And this is also something that I and dealt with for a long time in the data science world, which is it's just very I guess, overall, it's very interesting to me that the same types of things are popping up, but with a new audience. So I don't know if you're remember back in the days of of data science, when there were data scientists, they would create a model and that model has a certain level performance, right? Like to ninety percent accuracy or something, right? So it's going to be wrong some of the time.

So you put that model into mediates, uh, fraud detection model, fraud or not fraud, right? You put that model into into production, you integrated into a software function. And now the question comes, well, how do you test that model? Because it's not always going to give the same response and it's not always gonna be of right. I don't know if you remember these discussions happening a lot in in the data science world.

Yeah I think you just an eulogy for data science as well. So you .

oh of my goodness, this one always really intrigued me because my background is in physics. And so if we said, oh, this model is not deterministic, right? So we can't test IT.

If we took that approach and physics, we basically wouldn't have any of the technology that we have today because it's all bason quantum mechanics and everything is a probability distribution. So there is a way to test things that behave non deterministic like A I models. And maybe that people just that need a bit of a reminder about that. And I often kind of break this down into a few categories. But yeah, I don't know if you come across people with this sort of mindset, especially integrating kind of elements or something something like this all the time.

I think there is a lot of people out there. I think everyone still kind of figuring that out quite honestly. Um if they're not in the business that you're in where you're dealing with that constantly, I think that's one of the big unknown in you with focus in general is how do I go about testing this? I want to get in the work flow. I don't think people know .

least that you know you can only do so much in forty five minute podcast, but at least to sketch out maybe a good framework for people to think about is, number one, I think you should have test in your code for each step of the process, right? So if you are doing, if you have an element based workflow, and the first step of that is translating something into spanish, and then the next step is summarizing IT in one sentence. And the next step is in bedding that one sentence in a temple let. And then the next thing is, you know, generating an image for what, whatever kind of string of things you have going on, you should have tasks s for each of those kind of sub tasks in the chain of processing, which particularly also gets so I testing agents is hard, which is I think an interesting thing to may become circle back to at the end.

I think that's just good software. You know, engineering what you're describing. And if you if you took IT out of the AI world and talked about software functions, you want to you each one is a discrete function that does something and you want to test IT on its own even though they're all connected together to do something. I think that's just really sensible.

Yeah and the you know the agents stuff makes this may be a little bit more difficult, which we can come back to. But let's assume that you have a workflow that you just want to execute over and over again, which is is probably most enterprise use cases. So you spt that up in a subtask, right? You subtask that you can that you can test.

The next thing I would recommend us to have people think about creating a set of in three categories. And this comes kind of from the ideas of behavioral testing. And this is like just take the fraud detection peace for second.

See you you're asking fraud or not fraud, right? So the first category of text you wants to think about as minimum functionality test, which would be this, you know, is the most fraulein thing I can think of. IT should always like the most fragile and russian characters.

Uh, you know, nigerian prince, whatever, you know, k, take your pick, right? Should always be labeled fraud, right? Hundred percent of the time. That is a minimum functionality. These are not the most in depth test, but they should pass one hundred percent of time no matter what you do to the model, no matter what you do to your system. These are like a one hundred percent pass.

So and you can do the same thing with all ones you know, even though it's not a classifier, you can say i'm creating a uh a boat that asks, you know, IT gives all the information about prediction guard. If I ask, who is the CEO prediction guard? They should always return the same name, right?

That's a minimum functionality. That's a pretty easy question. Should be embedded in the knowledge, you know, these are things that should one hundred percent the return and that you can test for deterministic later, right? Like does that name appear in the response? That sort of thing? The second category would be infancy.

Terms might be called in various protections, but basically in non fancy term changes in the input, the doll produce a changes in the output. So the classic example of this is, if I ask an l an maybe to do a sentiment analysis of a statement, and the statement, as I love the united states, IT is so amazing, IT is so great, I get positive sentiment return. If I change the united states to turkey, I say, turkey is so great.

IT is amazing. IT is wonderful, right? In theory, regardless of what you think about, what you personally think about the united states are turkey that should always return positive sentiment, right? That is invariant change in the so you can make changes in the formatting, you can make changes in the ordering of things and and all these should produce invariant changes.

And then of course, the final one would be the necessarily, uh, variant changes, meaning a change in the input should definitely produce a change in the output, right? Like if I change, I love the united states too. I do not love the united states.

I should actually have a change in the output. And that's a very easy thing to see. And so what you do is you create a table of minimum functionality test, the table of in variant test, the table variant test.

And if you have that those full tables, you can basically probe the behavior of the model and the sensitivity of your model to changes. And the the sensitivity is really the thing that people get hung up on with these work for they don't realize how sensitive the models are to small changes in the input. And so this allows you engage the sensitivity of your system to a real number from passing this test and then works systematically to improve that. interesting.

So i'm trying to just kind of put all that together from my own learning purposes, and i'm trying to think how how we can apply that to workflow directly. Like how do you how do you actually fit that in to the nut symbol ts of moving in into production? Where do you do .

that in your workflow? yeah. So I would take kind of the steps. I lets have a workflow with five steps. I take each of those steps, and I produced, you know, five functions or five classes, or however, that fits in to your code. And for function, one corresponding to step one, I create the table of each of those test table, meaning just input equals output, the same as you are testing an API. Like, if I get this input into a API, I should definitely return this.

And so you create that table and a set of unit tests or whatever testing framework you use to go over each one of those examples in your table and check the output to make sure that corresponds with what you expect to get either a passing or not passing score. So the steps would be, I have my five steps of my workflow, I split each of those up into a functional or class, or whatever the programming object is, and then I develop these sorts of tests for each of those functions are classes. Now that one question that might come up here as well, should my model always pass or should that function always pass all of those tests? And when I tell people is I should always pass one hundred percent of the minimum functionality test because you've define those from the start as minimum functionality.

So if you're not having minimum functionality, then your software should be released, right? And then the other ones, basically, I would say you should never be regressing in those, right? They give you a sense of what the sensitivity of your model is in those variations.

And you would not want to regress. You would want to systematically make those Better. So some people might treat those as a certain percent of thresh hold that needs to be above that or or something like that.

gotcha. That takes us back up to your point earlier. IT just sounds like data science, good data science, right there.

Yeah, yeah. Well, it's it's interesting because the roles have shifted, right? I think we went through those phases of data science being very wild west to all the way up to like good engineering practices and testing and all that. And we kind of now throw out the data scientists in the middle, and we have business people developing these workplace and trying to integrate them into software. And yeah, there's a lot of reminders and learning, I think, think that needs to be done.

And that connects you know to where we started that out the day, which is agents and production agents are hard to because you don't know the work flow up front, right? An agent determines what steps it's going to accomplish on the fly. And so if you don't know that worker front, then uh, then there's some interesting things that you might need to do to test those. But maybe we'll save that for another episode and or people could join the great learning opportunity that is the agency production of the from .

the max community. Oh, absolutely yeah. You know what? You started the show asking what agents I had no, and I had I had to say, no, I didn't have any going, got me thinking. And now that we're talking about workflow and testing, get those agents working or have to come back to this topic, I have to bring something though to .

discuss yeah will come up with some agent ideas and maybe maybe worked through the testing of those.

Sounds good. okay. Thanks a lot for .

the insights today. Yeah, thanks. Great day. And all swap places geographically. Next week you.

Perfect sounds like a november.

All right, that is our show for this week. If you haven't checked out our change log news letter, had to change log 点 com slash news。 There you'll find twenty nine reasons, yes, twenty nine reasons why you should subscribe. I'll tell you reason number seventeen, you might actually start looking forward to mondays.

Sounds like somebody is got a case of the .

s twenty eight. More reasons are waiting for you at change of 点 com slash news。 Thanks again to our partners at flight to I O to brake master cylinder for the beats and to you for listening. That is all for now, but will talk you again next time.

Creating tested, reliable AI applications 50:09 Share

Practical AI: Machine Learning, Data Science, LLM

Deep Dive

Shownotes Transcript

Creating tested, reliable AI applications