OpenAI Dev Day 2024 featured several key announcements, including the real-time API for developers to use the advanced voice mode, fine-tuning of vision capabilities, prompt caching, and model distillation. The real-time API allows for voice-to-voice interactions, and OpenAI hinted at adding video and vision capabilities in the future.
The real-time API uses WebSockets, which allows for asynchronous communication. Unlike traditional APIs where a request is sent and a response is received, WebSockets enable continuous, bidirectional communication. This allows for features like interrupting the AI, proactive messaging, and dynamic interactions.
The real-time API is currently expensive, costing between $9 and $18 per hour for voice conversations. This makes it cost-prohibitive for many applications, especially for subscription-based services where users might engage in lengthy conversations, potentially leading to financial losses for developers.
Prompt caching automatically stores elements of a prompt that don’t change, reducing the cost and time of processing repeated inputs. This is particularly useful for applications where the same context or knowledge is reused frequently, making interactions more efficient and cost-effective.
Model distillation involves training a smaller, more efficient model using the outputs of a larger, more powerful model. This allows for the creation of specialized models that are cost-effective and tailored to specific use cases, such as analyzing contracts or generating specific types of content.
Vision fine-tuning, while useful for specific tasks like web navigation or document analysis, may not always outperform dedicated vision models. It can be time-consuming and expensive, and the benefits may not justify the costs compared to using existing specialized models for tasks like facial recognition or object detection.
The Corey Hotline is a demonstration of OpenAI's real-time API, where users can interact with an AI named Corey in a conversational manner. It showcases the API's ability to handle interruptions, maintain context, and provide human-like responses, although the voice still has identifiable AI characteristics.
Microsoft Copilot integrates OpenAI's technology to provide advanced voice and vision capabilities. It aims to act as an AI companion for both personal and professional tasks, such as summarizing news, assisting with work documents, and even helping with everyday activities like finding dog parks.
Canvas is OpenAI's response to Claude's Artifacts, allowing users to collaborate on writing and coding within a workspace. It provides an editor where users can make changes, ask the AI to improve content, and adjust parameters like length or style, offering a more interactive and productive AI experience.
Meta Ray-Ban's smart glasses, combined with facial recognition technology, can instantly identify and provide personal information about strangers in real-time. This raises significant privacy concerns, as it enables social engineering and potential misuse of personal data, such as names, phone numbers, and addresses.
Oh my gosh, that's so sweet of you to say. You know what? I love talking to amazing fans like you. Speaking of love, did you know that if you stay on the line for just 10 more minutes, I'll reveal my secret crush? Super exciting to tell you about.
So Chris, this week we had the Open AI Dev Day, the event that Sam Altman himself said would make the previous year's event look quaint in terms of what they had to announce. Unfortunately, they didn't decide to stream it, so we couldn't do our annual live stream that everyone has come to love. I think we've done it, what, twice?
But we have the Open AI Dev Day. It's three events, San Francisco, London, and Singapore. And there was some announcements including the real-time API, which means developers like us get to use the real-time voice that everyone's heard from the advanced voice mode, which we demoed on the show last week.
There was also fine tuning of vision, prompt caching, model distillation, a whole number of announcements. What did you make out of the demo day, dev day, sorry, announcement? Well, I think what I love about it mostly is that many of the things are inherently practical. And I think while perhaps they didn't live up to the whole, you'll be just like, you guys will be just so blown away by what's announced that it's going to change the world and why even bother with this trivial stuff.
They're more like incremental dev-friendly updates that are actually quite useful. Several of them are really good and really well done, like the prompt caching, for example. And I know, look, I know a lot of people say caching, all right. I say caching. You're just going to have to deal with that.
So yeah, one of the first announcements was around what they call the real-time API. And right now that really means that developers can now build the same speech-to-speech experience that you've seen in the ChatGPT app. And also through this week, we saw Microsoft announce a new co-pilot, which also has this advanced voice mode. And I'm assuming they're running some version of this now.
real-time API. I thought the naming was really interesting and a lot of people have pointed this out that they called it the real-time API and right now it's just really voice-to-voice as an API but there is literal in this document if you scroll to the bottom of the announcement it says what's next more modalities so they do intend to bring video to this real-time
API as well as vision. So this is a pretty exciting API. And for those that are not developers and wondering what this means, it just means that you can expect these voice experiences in pretty much every application now. In fact, I'm starting to think it could become like the new spam thing. This voice is just going to be everywhere. And it is a different kind of API. So even the way you interact with it is over a WebSockets.
So for those who don't know, a WebSocket is different. So normally when you send a request to an API, you send a request and then you get a response. So you'll send it off and then you'll get the completion result from an AI model. Now, there's obviously streaming, which is where you see the AI writing in all the tokens. And so the way that works is you send a request and then you stream the response until that response is complete.
So that's how it normally works. A WebSocket is totally different because it's completely asynchronous. So once you open a WebSocket, it can send you a message at any time and then you have to handle that message. And you can send it a message at any time and it can handle that message. So what that means is that you can do things like interrupt the voice API by sending a message to say stop.
and it can stop doing whatever the hell it's doing. Or on your side, you can ignore certain messages at a certain period. But also it will mean that the real-time API, in theory, will be able to proactively send you messages. So even though it's already sent you a response, it could send another response or reinitialize the conversation and things like that. I'm not sure they're doing that yet, but it is a completely different way of interacting with an API in that the messages can be sent at any time by either party.
I also think you've got to give them credit. Last week, we demoed the voice features and had a lot of fun with it. And we said, like, once this is in developers' hands, it's going to be really exciting and we're looking forward to getting it. And then literally a week later, they announced the API and it's taken them, what, like 48 hours after that event for us to get access. And it was very interesting because you asked me to try it yesterday in preparation for the podcast, which I did, and we didn't have access.
And I was really disappointed because I wanted to actually get to use it for the pod. And then I just tried when I woke up this morning again, my same script I'd already written and it worked. I was like, all right, great. They've actually, they weren't lying about the progressive rollout of the thing. One of the prohibiting factors that everyone's been pointing out is just the cost of the voice API. It is not cheap. This, this real time API, I think is,
I don't know the exact figures, but it was somewhere between like $9 and $18 an hour to have a conversation with voice. And you can imagine if you're a paying subscriber to an app and you want to add this into your app and someone talks to it for like two hours and you're charging $20 a month, you're not going to be very profitable. So right now it is very cost prohibitive though. And I think that it's going to be hard to justify adding it to your app because you
You just won't be able to afford it. Like you'll literally go broke. Yeah, you have to charge your customers too much to justify the feature, right? Yeah, and I'm sure like everything in this space, the cost will be driven down over time. But you can't imagine it's cheap to be firing up all those GPUs in real time, having like real-time WebSocket connections. But surely you have to assume the price comes down.
It always does, right? Speaking of price, though, it got us thinking. We wanted a way to test out the new real-time voice API. And I was talking about this. I was thinking, well, we're going to need to make a lot of money from this if it's going to cost us $18 an hour. And that reminded us of an episode of The Simpsons. You've reached the Corey hotline, $4.95 a minute. Here are some words that rhyme with Corey. Glory, glory.
So I thought it would be funny if we could recreate using the new advanced voice or real-time API, the Corey hotline, in the hopes that I could share the number today on the episode and people would pay $4.95 a minute to chat with AI Corey. Although you do realize the mistake you've made, $4.95 a minute, and it costs us $18. Yeah.
No, $18 an hour. Perfect business. Oh, an hour. Perfect business. So here is our demonstration of the Corey hotline using the new real-time API. Let's now chat with Corey. Hey there, it's Corey. What's your favorite color? Why do you sound like a girl? Funny. Just my excited voice. Can you tell me words that rhyme with Corey? Totally. Story, glory, allegory.
Cool, huh? Yeah, it's pretty cool. Want more? I would like more. Great. How about inventory, category, laboratory? Keep going? I guess, but how much is this costing me? Just $4.95 a minute.
Want to guess my favorite ice cream flavor? Not really, no. It's a tough one. So you can see it's pretty good. Like you can interrupt it. You can have a conversation with it. The voice is, I think it sounds really human, but there's definitely that element to it where you still know
It's an AI, although we need to put that to the test. I think for us though, we struggle a bit with the American voices because they just sound fake. Like even when you see like an Apple or Microsoft presentation where they have an American presenter who has a voice like that, it almost sounds farcical to me at least because
when they do that. So I don't know if our American audience can tell us, does that voice sound like, would that convince you on the phone or not? Yeah. Oh, look, I think it can get things done. Right. Like I'm sure you could use it to like, they demoed it at dev day to call and order strawberries. Although they didn't actually call a real place. It was just a guy in the audience that they called, uh,
to order the strawberries, although the interaction was fairly realistic. And next week, we want to actually use this to call some of our fan favorite places we've called before and run our voice experiments again using this technology to see if we can get better outcomes. Like, can we actually get
an AI agent to order something over the phone? Can we get it to get a quote for grooming a pig at a pet groomer and things like that? So we do want to try it. But Chris, you also decided to check in on the old technology we used to do this and sort of see where it was at in comparison.
Yeah, because retail has been making significant updates along the way and they have lots of different ways. Like they have a built-in LLM you can use. You can use Claude Sonnet, you can use GPT-4.0 or you can use your own LLM plugged in with the same WebSocket
technique that OpenAI has released. So while what OpenAI has done is cool, companies like retail had inadvertently made that real-time API technology anyway because they needed to in order to facilitate a phone call. So we actually did voice cloning and made a Corion there. So I assume you're going to demo that
Yeah, so this is another version of the Corey hotline. And I think what's super cool about this is you were able to also clone the Corey voice from the actual Simpsons, which makes it obviously a lot more realistic. You've reached the Corey hotline. $4.95 a minute. Here are some words that rhyme with Corey. Corey, Corey, Corey.
Hey, Corey, what are your favorite hobbies? Oh, wow. I'm so excited you asked about my hobbies. Thanks for calling the Corey hotline. Just a reminder, each minute with me is only $4.95. What a deal, right? That sounds pretty cheap. Yeah, it's pretty cheap to talk to you. I think I'm in love with you. Oh, my gosh, that's so sweet of you to say. You know what? I love talking to amazing fans like you. Speaking of love, did you know that if you stay on the line for just 10 more minutes, I'll reveal my secret crush?
So okay This is really good at interruption And obviously being a clone of the voice I think it's just For what we're trying to do
Feels a lot like more, I'm going to say localized because I'm getting what I expected from the person on the other line. And honestly, I think if we called someone as that Corey, which next week we intend to do, they wouldn't, it would work fine. I mean, this worked fine a year ago.
Yeah, and I think the interesting thing about that one, you'll notice it stuck with the prompt a lot better in terms of your goal is to waste this person's time and to keep them on the phone and to keep delaying and to keep promising things that they'll never get, you know, to keep them
keep them going. It seems to have a real balance of that with actually answering your questions, which is kind of cool. Yeah, and so if you're listening, and this won't be up forever because if a lot of you actually use this, it will cripple us, but I'm going to put a... So we're risking going broke over this, but we want you to try it. The lols.
So I'm going to put the phone numbers. You can call in the description of the episode. So wherever you get your podcast in the description, there's going to be phone numbers. So you can actually call our Corey hotline and talk to Corey yourself, the real Corey, uh,
And, you know, chat for not too long. Try not to chat for too long. Even though he's going to actively try to keep you on the phone. Resist the urge. Yeah, this was a really bad idea. We should have come up with something a lot better. Yeah, so maybe you'll fall in love, maybe not. But you can kind of see how this, once mixed in with memories and as the technology is improving, like...
I don't know. Like, I guess the thing I was thinking about is the business case for this, right? Like if I own a call center today, like the call center experience is so bad, I can barely understand the people when I call up now, right? So if I got the Corey voice and it was helping me do stuff and it stuck to the prompt better, why do I need advanced voice mode?
Well, that's right. I think this works just as well. And this is one of many companies that do this. There's several others. I imagine Twilio will have their own offering before too long as well, but there's at least three others you can do this with that are pretty decent. There's Bland,
that I know of. And there's a few more that I don't recall the names of right now, but this is going to be around. There's going to be lots of companies doing this. I imagine we'll see many point solutions where it's like, you know, your own secretary, you just enter the details and it acts as a secretary and sales outreach calls and things like that. And I think before everyone gets used to the idea of these AIs doing it,
it's going to actually be very effective because right now, I don't know about you, but I constantly get calls from different companies with people who are offshore trying to sell me stuff or scam me or whatever it is. And you just know now reflexively this isn't real. But if you can call with one of these people
highly realistic voices, really intelligent AI on the back end powering it, I actually would argue they're probably going to do better than the people who are paying real flesh and blood humans in the short term. I'm sure things will catch up, but I think there's going to be a while where this technology is very, very effective in this window before the general public becomes aware of what's possible with these AI models. Yeah, and look, I'm not to not...
saying that the advanced voice mode isn't a far better experience. Like it, it picks up tone and inflection and it, it responds in a, I think a more sort of human like way. But the problem is right now with how much it costs, I can go and spin up retail, build my call center app. If I'm thinking about just the actual, how do you make money from this point? Um,
And it's going to do the job really well. You can, you saw interruptions on that. The only thing I think those models are missing, especially the trained voices is the emotive kind of scope. And that's because we just literally took a couple of seconds of that YouTube clip and cloned it. But yeah,
Some of the more default voices can do a lot of this sort of inflection stuff. And also you've got, remember Hume AI as well. They had this sort of emotive being able to detect how you're feeling and the voice changing as a result. They've also now got an API. So I think in,
In terms of the commercial aspects of this, I'm just not sure who it's for. Like if I've got an app today and I'm paying like even $9 to $18 an hour, it just doesn't seem worth it when I can pay far less using something like retail for my call center, which let's be honest, is probably one of the main applications and then a voice assistant for your app, like adding it into your app. And I'm just not sure people are willing to pay at this point in time, like $18 an hour to chat to some random app on their phone.
The main use case I can see for it beyond the ones you've just mentioned, which I agree, there's more dedicated and better solutions for it, is the idea of, and we've spoken about this before, being able to dial up your agent, which has...
grounded memories. It has temporary memory. It has tool calling with tools that you predefined like activities that it can do say within your company or within your like purview. And this idea of like a superhuman ability, like I really fantasize about the idea, like I'm going camping this weekend. I say camping, it's in a pre-made tent with power and it costs $700 a night. So not exactly camping, but we're calling it camping. Anyway,
The idea that I now, I can't use my computer, right? Because it's like a kid's thing. We're not meant to use technology. But let's say I needed my agent to go do tasks for me. The idea that I could just call it up on the phone or text it and just be like, hey, can you add to my calendar a meeting for Monday morning to do this thing?
Or, hey, can you research this topic and get back to me with a text with, you know, talking points so I sound really smart in this conversation I'm having? Or, you know, can you please refund that client or whatever that just came into my inbox?
incredibly powerful. And then having an AI that has that voice capability where it's like, no, no, no, that's not what I mean. And you interrupted and talk to it like it is an employee that's available 24 seven who just knows your business and has ability to do stuff. To me, that's where the voice API comes in. You're in, you
you know, not the best situation to be doing work. Like you're not sitting at a terminal with all your stuff up, but you can interact with something that does have that ability and can go off and do things for you in a very convenient way. Like I can literally just leave it a voicemail and it goes off and does it or talk to it and do it. Yeah. And I think that with, if it's chat GBT or Claude or whatever it is,
I think the data connections are just where it's at, right? Like you need connectivity, you need the skills to be able to go in and empower that level of agency. But I think what you're really talking about is you just want some ultra voice assistant. I'm not necessarily, again, sure how that makes a business case for this real-time API. Well, it doesn't. I'm just looking at the technology of what I consider the best use for it. I don't think that...
you know, the $18 an hour thing doesn't really stress me out because like you said, every single model we've seen has started off where you're like, Oh my God, the pricing is so high. And then it drops down to realistic levels where you can actually use it. And so I just have full faith that that will happen for this too. So I'm more thinking about how to apply it rather than how much it costs.
Yeah, I guess the other thing I'm thinking about is like, it seems like a bit more vendor lock-in, which is smart from OpenAI, because if I use the real-time API in my app, I have to back it with like GPT-4.0, right? Like that's the underlying model, the sort of voice version or advanced voice mode version of
And that has some guardrails that, you know, maybe I like, maybe I don't like. Whereas I feel like if I go to one of these dedicated providers like retail and like we have no connection to any of these companies and use that and that's improving and affordable and great APIs as well. And I can have my own corporate sounding voice, like my own branded voice, and I can pick the underlying model.
I don't know. Like, I'm just thinking like, do you really, is someone thinking about how do I implement this and how do I use it in my business? Like, I don't know. I just don't know if this is a real product. I'm,
maybe I'll be proven very wrong, but... Yeah, and I think so. The other thing I like the idea of is being able to do multi-model depending on what the thing is and have that sort of ability to intervene in the process. See, the whole idea here is you're going voice to voice, so direct to the model, so you don't have that ability to...
as the app developer, understand what's going on in the meantime, in time to interfere with the reply. So while that's kind of cool, you're very much relying on that agent's, either the models fine-tuning if it exists or just a really good prompt to accomplish what you want. But outside of that, you're going to struggle to control that experience. Yeah, I think when I'll get really excited, and I know this is being worked on and will be released pretty soon, is
where you're talking to it, visually seeing output. And Microsoft actually demoed this in a video of Copilot, the dog walking example, which we'll get to in a minute. And they show like a little bubble of a map and it kind of points to where to walk this person's dog on the map. But I think keying it in with some visual UI where you can get a sense of what's it producing, what's it doing, and actually talk to it and be like, no, no, no, make it...
make that a bit longer, make that a bit funnier and just bark orders, which we've talked about many times. That's when, yeah, I think the voice stuff gets a bit exciting to me. This whole like just having a chat, I don't know. Yeah, I get the impression just from seeing like Sim Theory users and from people I talk about who are using the large language models in their work.
Nobody needs it to pretend like it's a real person. Like, yes, the real person paradigm works well because our whole life, that's how we've dealt with people. So it's a natural way of interacting with people and that's convenient and good, but you don't need it to pretend like, Oh, I'm actually your friend, Corey. Like,
It's okay. We know you're an AI model. You can say AI kind of crap, or you can ask for parameters or, you know, whatever it is like, that's all right. We just want to get the job done. And I think seeing it as a convenient interface rather than like, we have to act like a chip pal who really cares where you walk. Your dog is not necessarily part of the experience. Yeah. I, I don't know what they're going for necessarily with it. We'll come back to that. Let's get through the rest of the dev stuff.
day announcements. So we also had introducing vision to the fine tuning API. So you can now train GBT 4.0 with images and text to improve its vision capabilities. So you can give it a bunch of example images and the intended output. And then, you know, hopefully the hope is improve the model significantly enough. And they gave some interesting examples of,
One of the examples in their post was training it to better navigate the web. So you can see this is a step towards maybe agency, like actually enhancing its ability to browse the web and complete tasks
on certain websites. So if you were training it like the Rabbit R1 to order food for you over and over, maybe this would be good to train your AI agent. They also had an example from a company called Coframe, which does like AI variant testing for you on your site. So they show in the example a section of the website that they want to generate and run some testing around.
And they showed before they fine-tuned it and the design kind of matches, doesn't look that great. And then after it, it looks identical to the current website. You did point out when we were talking about this before the show that it does seem like a big time and money sink when like the statistics in this, they're like, oh, improved it like 20%.
It's hard to really comment on it until we try it out for ourselves, but what did you initially think of fine-tuning on Vision? I must admit my opinion's a little bit of a derivative opinion because when we got excited in the early days on our podcast about...
the vision APIs, people pointed out there's been dedicated image-related recognition models for a long time that are dedicated to purpose. Like if you think about speed cameras on the road or, you know, other sort of technologies
uh you know vision models in the real world like identifying faces and all that sort of stuff they've existed for a while so if you had a truly dedicated purpose like that you would think you'd have a truly dedicated model to go with it so training one of these general models at great expense probably won't give much better results and yet cost you a fortune so
Then you say, okay, the generalist model is better in the sense of, let's say screen sharing and SIM theory. It's seeing stuff it's never seen before. And the example I always like to give is say I, a lot of people say you work at a bank, right? They're working with some legacy Lotus system or some random crap that's running on a VM somewhere. And they've got, sorry. Yeah. And they've got to work with that old technology system.
And it's painful and slow. Now, if I had a company sanctioned AI suite of software and I was using screen sharing and then that agent I could program to interact with that legacy system, that's a really, really good use of the technology. Now, I would argue the models are getting good enough that I can do that on a raw model. I don't need to fine tune it.
Now, so what I don't understand is where is the, where is the intervening bit? Maybe if I'm building dedicated software, I don't want to fully train my own model from scratch, fine tuning it for those things like this is this button, this is this, whatever, maybe that would help. But the thing about fine tuning is it's not like you can fine tune it on like 15 different tasks within that legacy software. It's
When you see this kind of image, classify it, like identify this element in the image or give it this rating or whatever it is. And so it's like a homogenous task you've got to do over and over again, which then makes me think, go to a dedicated model. So I just personally don't understand the use cases. And like, yeah, I must say in this post, they actually did cite reasonable use cases. Like, okay, a 20% benefit for this company is probably well worth the effort, but that's
I just don't know how many people this is going to apply for. Yeah, I think the fine-tuning seemingly from some of these examples just seems to give it a nudge more towards the use case that it's being intended for. Like that... I think the multivariant testing on a website is a great example. Like the first...
The first version of it, if you look closely, it's passable, but it's not great. So if you were the... Passable, like our podcast. Yeah, very passable. And so then you look at the second example and you're like, wow, this is magic. So I do think fine-tuning to sort of nudge it better towards a specific use case probably does have some merit. It's really like, I think, in a lot of ways, nudging the...
uh, the output for a particular apps, probably the main, main use case here, but I'm definitely, I think one of the best ideas and I hope I'm not giving away your secrets, but one of the interesting ideas that you had was around demonstrating something to the AI.
And then having a sort of automatic fine tuning process. So in the example I gave earlier, let's say I'm using Lotus Notes through my VM on my computer and I'm screen sharing in Sim Theory. I say to the AI, perhaps with voice, hey, I'm going to show you how to get rid of emails I don't want in Lotus Notes. Right. And then I do that.
and it records it and saves the key frames or the video or whatever it is. And then it produces those frames, goes off, fine tunes a model, and it goes, now I have the Lotus Notes vision model for cleaning up your emails. And so now I have a skill available, which I can go clean up the dodgy emails, something that doesn't exist in that product. Yeah, I think that's a like train, like if you can do it that quick and it works and there's a better outcome.
Yeah. And then I have, and then I have a series of models for different activities that I'm doing throughout the day. So if I'm in a sort of corporate worker role where I have certain things I need to do, like review and approve applications, approve memos, talk to corporate, you know, all that sort of stuff.
I can give it almost like macros of here's what I'm doing. Like here's what I'm doing on a daily basis. Watch me, watch me AI, watch me do it and learn. And then you go get it to, to train a vision model that knows how to then do it itself or knows where it's at in the process. If it's doing. Yeah. I guess back to your point though, is like, do you even need to do it?
Yeah, and that would be the question is like could it just simply through its regular vision model go, okay, here's a couple of images of last time they did it.
Um, can we do it? And this brings us onto one of the other things they announced, which is the prompt caching. And this is where this would apply because, um, if I had say 10 images that I wanted to include in my prompt that demonstrate doing a task, right? That sounds like a lot. And that's a lot of data to send in a prompt, perhaps quite expensive. The way the caching works now is if it's seen those images recently in a prompt, it's going to cache them automatically.
if they're sent as separate images in there early in the prompt without the variable stuff. And so what that means is every time I load up that task and I'm doing it on a regular basis, I'm not having to pay nor wait for it to process those images as part of the prompt, which means that it's incredibly efficient. But is that even if the images are slightly different? So say back to our example, you're screen sharing, you've got Lotus Notes open, you've got some sort of like macro control on the computer.
Um, and it's now operating the screen for you, which we all know is coming, right? Like let's not pretend. Uh, so then with prompt caching, is that, does that then mean, um,
If there's a slight text variant in the image. No, it's not going to, it's not going to handle slight variations that needs to be identical. So just to go back a step, the way the caching works is elements of your prompt that don't change will automatically be cached. So this is different to anthropic. So anthropic, you need to have fragments of it and deliberately say cache this part of the prompt. And so to give you some examples,
In Sim Theory, we store like a knowledge graph memory of all of the things you've discussed with the agent. And then you've got your agent prompt. We could send those two elements to Anthropic and say, remember this, remember this, right? And then reuse those components in the prompt. And they won't be counted as part of the tokens because they come from the cache. They're billed at a different rate, essentially a much cheaper rate.
and you do it deliberately. OpenAI does the same thing, but it does it automatically. Essentially, unless you opt out, it's going to do it. So if it recognizes similar things early in that
So like the system message, the early user messages, it's going to automatically remember them based on when it's seen them. That is so much better. It is better. And it can also do it partially. So your example, I may be wrong about because they did give examples if you look in their blog post of where it will match, say, 90% of a particular part.
and then the variation is handled. So it seems like... Is that for images as well? Like if it matches... Well, that's why I said no, because I'm not sure about the images. But for text, definitely, it will do partial matches. So what it means is they say you've essentially got to optimize your prompt to be in the right order. So the things that are likely to be the same as last time, like the agent description, the memories, the grounded knowledge, that kind of thing, that's where you would put them at the start. So it's able to...
better get a case hit on that knowledge. And then if it does, you then, um,
you're then able to get massive savings, speed benefits, all that sort of stuff. So when it comes to repeatable stuff like images and things like that, my instinct would be this is very heavy. I don't want to keep sending this unnecessarily. You would get a huge, huge benefit in terms of cost and speed. And so to be clear, what they've announced is that cached input tokens are
are half the price of an uncaged input token. Yeah, that's right. So it's like...
It's an immense saving. And the fact that they're doing it automatically, I think is just a rare act of benevolence. And really it was sort of not emphasized as much in the announcement as you would think for what a huge, huge benefit of using their models exists now. You know, like that is a compelling reason to use their models over say Sonnet, which is very expensive and
you know, whatever application you're using Sonnet through needs to make the most of that cache on your behalf or you're not getting the benefits. I guess they did just get $6.6 billion and maybe they're feeling a bit generous. They're like, yeah, we'll give them a discount. Yeah. Yeah.
Maybe it is, but yeah, I was really blown away by how good that is. And then the other thing to note is that in the response you get from them, so whenever you send a request and the full cycle's completed, you get number of tokens used. They have a new one for number of tokens
what is it like thinking tokens or background tokens in the case of the O1 models. And then additionally, they, they now have another field, which is like number of case hit tokens, essentially. So you can track this and we're going to track this and show you in SIM theory. So, you know, um,
how many tokens were actually hit like that. So I think it does like what I really like about it is it opens up these cases where you do want to provide the entire chat history that you've had for a particular thing or you want to talk with an entire corpus of knowledge, but you don't want I've mentioned this before, like
It's very stressful as the application developer saying, how much context do I give this thing? Because do I really want to give it 200,000 tokens and then just say hi? You know, like it's a really stressful thing to think I'm sending so much data and all I'm saying is, hey, Corey, how's your day? You know, like it's not the right way. Like you just instinctively, it's not the right way to interact with it. But then if you have to have a decision step, like should I send all this crap?
Um, that adds time and that's expensive as well. So it's like, it's an impossible dilemma, but when you know, you've got this caching happening automatically or semi-automatically in the case of, uh, anthropic, it's a different equation. Cause you're like, it doesn't matter. I'll send it all every time, whatever. It's not going to cost me very much. And I do in the, in the often cases where it is needed, I get the benefit of that. And I also feel like
As far as like AI companionship goes when, and I mean companionship in terms of say, we are in this together, mate, we're working on this task together, something I've been going through this week. It knowing where you're up to and what's going on is just a great feeling. Like it's worth the money. And I think that this really makes it more possible in more scenarios. Yeah. Especially the speed still quick as well, or you can,
I don't know. I swear it would be good to just be able to have like a RAM bank and, and say to it, Hey, like just call, like rather than resending it every time, just say to it, like, Hey, just keep this in context. They have that in the assistance API. Like they do have that persistent conversation concept.
Um, the reason we don't use that, um, so far is simply, it's not transferable across to other models. And I think, you know, for our use case, the multi-model thing is important enough that you don't want to do that. I think if you are building a dedicated application and using their assistant API, those persistent conversations would help you with that because then you do only need to send
what's changed in the conversation. Yeah. Rather than sending out a huge chunk of, of news. But I mean, relatively speaking, like in terms of bites, we're not talking a lot, like 200,000 tokens, you know, like let's say, I mean, a token isn't like a K of data. I don't really even mean it from that perspective. I kind of think that you go through this experience when you're working with LLMs a lot where you're
you know, maybe you've got a tricky problem. Like I literally had one last night, right? And I sent like huge amounts of code through to it and said, here's sort of everything I know about this problem, figure it out. And I tried 01 preview, 01 mini preview. I tried Sonnet. I was just getting absolutely nowhere. And then I started distilling it down. I like, you know, doing some logging, figuring out roughly in that area,
context where the problem likely was and then just gave it you know one or two methods and said I know it's in here can you figure it out and that didn't solve the problem but it got me far closer and you can feel the attention of the model like when you really focus attention yourself still and
uh, with a better prompt, like it, it tends to work far better and you get a better result. So that's the other kind of thing I think is like when you have this ability with the price driven down to send more, do you sort of change the, like, cause I don't think they're that particularly good at always, uh,
figuring out where you want to focus in the context, even though they say they are with the needle in a haystack type tears. So sometimes I'm like, sometimes it works magically and you're like, this is incredible. And other times it's so bad that I'm like, I need to focus it. I think these are all solvable problems, but for now I'm like, I don't know. I still kind of want to manage what's in its memory. Yeah. Because the point I was about to make was what this kind of thing opens up is this
idea of like, um, excuse me, passive context gathering. So, you know, it sees what you see on your screen. It has access to the files on your disc where say your code base is in, in that case or your document or whatever, and then it can hear you as well. So you can speak to it. And so it's gathering the same sort of context that you have access to, except it has the ability to synthesize the entire thing as one.
which is amazing. And I think we're getting gradually closer to that. The problem, like you say, is, you know, as the human intelligence that
I don't want it to be distracted by this thing because it's going to get confused by that. The same way if you were handing over a task to a real human, you'd be like, oh, this is the section we're focusing on today. You need to think about this. See, that is the perfect example. Like if you're going to someone in an organization saying, hey, write a report on this, you don't go, here's our entire Google Drive, read every...
Read every file, watch every video. Every time, at every step of the process, I want you to reread it. It's just not how this is going to work. Like, I think the idea is just insane to me that this is how future AIs will work. Like, to me, it's got to have, like, this sort of train corpus,
And then it's, yeah, anyway, I just think it's got a long way to go. I wish it was closer, but I think if you're working with these tools day to day, like I am, it's just some like UI controls you kind of want right now, which is, I want to see what's in context. But it is a tooling thing, right? Because I think we've, we've, we both acknowledge the models have the raw power to do a lot of this. It's just that, uh,
some other process needs to exist to marshal and organize all of the stuff that goes into it to make sure you're getting the optimal output. And I don't think a model itself is ever really going to be able to do that on its own because there's just too much there for a single model to be able to do. It can't have all that intelligence, but also know how to
rearrange its own inputs to maximize its own intelligence like it's got to be a group of models working together or services controlling those models or whatever it is it's just i just don't think one entity is ever going to be able to do but i think oh one oh one is getting closer to that like when you read through its sort of response in terms of how it's thinking through a problem like
I'm always amazed at how it's considering all these different things that I might've considered, but dismissed, but it's at least going through and considering them. Like I, I'm still sort of mind blown that it goes through and, and, you know, reasons, which I think it kind of is reasoning, but, um,
Similar to what you said, like if it had tooling to go and sort of test its hypothesis, which function calling we're told is coming soon, so it'll soon be able to do some of this stuff, then you probably could start to get to the outcomes I'm complaining about, which is like it can figure out what to focus on. Particularly if the tool calling is call another LLM that is an expert on this subject. Yeah.
Like, I think that kind of thing would be really cool. Like go get a review from this person again, similar to the way you would in a workplace. It's like, if I have a technical problem, I'm not sure of the solution. I would go ask one of my expert devs, Hey, what do you think of this? You know, and giving it the ability to do that and verify things and check things and, you know, have this sort of Oracle, um, points where it can fact check itself.
would be very, very interesting. Yeah, I mean, like the thing that I think will happen, right, is at some point soon, and I don't think it's that far off, with function calling, I know Cursor's really popular and a lot of the complaints you hear about Cursor, which is a code editor, for those that don't know, that incorporates AI. So you're sort of chatting to it and then it can go into your code and essentially find and replace things
pieces of the code. And while there's times where that experience is really magical, there's other times where it goes in and just puts absolute crap in that breaks your app and then you're forever trying to go back and fix all the stuff it broke. And so I think the next level is definitely this function calling where it's like...
I'm explaining Devin. But where it goes off and it actually sort of runs the code, it visually looks at it like a human would, and it's like, ah, that doesn't work. And it's going in a loop then trying to figure it out, like our Klarna CRM example. So it's coming. It's coming. And this stuff's going to get a lot better. But I just think for the moment we're in, you know, this stuff, it's not there yet. Like it's a ways off.
Yeah, definitely agreed with that. But we're getting improvements and finding better ways to work with it. And I feel like O1 Mini, I think, is a darling of a lot of people, me included. I find it's striking the right balance right now with the models of getting really nice, deeply thought out results without taking too long and working with existing tooling. So something else that I think it's worth mentioning from the Dev Day is this...
the model distillation. So taking a large model and training it on sort of fine tuning it on outputs of a larger model to create like your own
small performant model and just the user interface alone they release for this which is super cool where you can actually have your own sort of evals for the model so you can like have your own GPT-40 mini that's really cost effective and fast and specifically trained on your own use cases
Yeah, I found that interesting too because we know that the way a lot of open source models had been trained or even competitor models had been trained was people using GPT-4.
in this exact method where they would have a whole series of training prompts they would run. They would take GPT-4's output and then use that to fine tune their proprietary models. And this is interesting because in a way it's like, I don't know if it's that, what's that Microsoft thing where it's like embrace, extend, and then shut down?
whatever they do. Like, I wonder if they're like, okay, everyone's doing this anyway. We might as well provide them a facility to do it, but then encourage them to train it on us instead of then taking those, uh,
those corpuses of data and going and training some other model that is outside of open AI. Yeah, I think it might be interesting, like especially if you're building like a dedicated assistant for your organization, just having a more performant model that's geared towards that
the exact outputs that you want. I need to look into it more how you would practically use it, but. Well, the main example I thought of that we've spoken about a few times before is the idea of having like an AI workspace within a company, right? So say I'm a company and I'm analyzing contracts to buy sheep, right? Like we write contracts to buy large amounts of sheep and, and,
I've got a team of people who analyze the contracts and look, is it a good deal? Here's the 20 points I look for in a contract, you know, like are the sheep in good condition? I don't know what, whatever they look for. There's 20 points. Now, if I give an amateur GPT 3.5 pro or whatever, and they paste the contract in and then they start asking it questions, they might not prompt it in the right way.
Now, imagine instead I take that same data and then I prompt it with excellent questions that focus on the right pieces of information and get really good formatted outputs in the exact way I want. And then I distill that information and then I fine tune a model. Now you give the amateur the model to work with.
And the model now will naturally go towards the better quality outputs with the better quality answers that I want from my team. So it's almost a way of safeguarding the use of a model in the way you would prefer it to run. Yeah. Okay. I think, yeah, I think there's, that is a really interesting use case where in
Instead of having to nudge it every time onto how you might respond in a certain way, or if you're building an app like RFP Responder, you might want to, as part of the RFP Responder app, train it on how you would typically respond to an RFP and then you get better outputs. I guess the question is, couldn't you just use a good prompt and a bigger model instead?
Well, and this is the thing. It's back to what we were talking about with image modelling. It's like a lot of the time a really, really strong prompt is going to be able to accomplish much of the same. And I guess then really the benefits are around cost if they do charge a cheaper price for these. Usually previously they definitely did for fine-tuned models. But, yeah, I'm not sure. Maybe there are specific cases scientifically where you really, really need to ensure stability
stuff outside that may be ignored in a prompt, for example. All right, we better move on. That was our very short and concise segment of the Ovid AI Dev Day. So just recapping like boom factor from the Dev Day. It's a tricky one because...
I guess I have to be honest in that while I see the value in the individual pieces of it, is it going to change my approach to things I'm working on or thinking about? Absolutely not. Like, I don't think the voice thing is that exciting that I'm just rushing to implement it. I'm not dying to get off this podcast to go try this crap out. Like, it's a marginal increase. It's certainly not. I'm saying crap, not shit. Yeah, no, I just like it. I like using the button. But yeah.
Yeah, so I reckon Boom Factor 2. I'm not that excited about it. Yeah, I don't know if it's because, you know how like all the tech reviewers and stuff, they review like the iPhone 27 and they're all jaded about it now and you watch and they're just, they're so broken. Like, you know how like MKHDB or whatever, he's just so broken now. He's like, oh, a new iPad. I don't even want to talk about it.
But then like someone who's actually considering that device is like, man, this guy's jaded. I just want to know if I should buy it. He just keeps getting sent free stuff in the mail. He's got boxes and boxes of tech. I feel like are we becoming that jaded? Like are we that jaded? I just can't get into this hype thing. Absolutely. I'm not jaded because there are technologies out there.
that I'm excited about and I'm working with constantly on the contrary, I'm really, really excited about the advancements of LLMs, the combinations, the possibilities. But what I'm excited about is the tooling that we can build on top of it. Like,
Document modification, like working with the code editor to be able to empower it to do more, to be able to have really high quality tool calling that these advanced models can go off and call like we just discussed. So no, I'm more excited than ever.
It's just that these marginal updates by a single provider in a vibrant space just aren't enough to be like, okay, drop everything. Let's go work with this. I'm pleased it exists. I'm glad they made the updates. It's just not going to change my direction in terms of what I'm focused on and working on. Yeah, I think that I know a lot of people are excited about the voice, but they're excited about the voice in the context of using like
Chat GPT and talking to this voice and, you know, being able to have conversations and learn things and their kid talk to it and stuff like that. I'm just not sure from as a developer looking at like, should I build this into my application or like what will change in the world having this sort of like high powered Siri everywhere? I just don't really see it yet. Like I'm not sure some new startup will be built off the back of,
the no i think the main thing from an application side you should be looking at is can i support the real-time web socket paradigm in my back end rather than a request response cycle like you know if you can adjust your application to handle asynchronous messages coming in and going out that's going to be a huge step in the right direction of supporting the way the future models look like they're going to be running the one thing we didn't really touch on and um
and it's sort of way off into the future more than likely, but maybe not, is the fact they called it real-time API. I mentioned it earlier. And they said video is coming soon and they've got voice, obviously, and also vision they said is coming soon. Now, you can definitely go to the sort of sci-fi fantasy version of what that might mean. Like, if you think about chat GBT, and we'll get to the Canvas update in a minute,
Uh, if you think about that, like really what it is, is you're trying to build a UI that's not chat all the time with the LLM where you can sort of see what it's doing and interact with the actual output itself, but all in this sort of what we like to call an AI workspace.
And so it kind of gets me thinking with this video, like all the fantasy about it is like, well, what if you're just talking to this thing and it's just generating in a face, like stuff having to actually program it and having it static, it can help you visualize things like, you know, show me what it might look like if I paint my house red and it just...
you know, hasn't used, give it an image of the house and then it comes back with a video feed of the house now in red. And then you like change it to blue and it's constantly adjusting. And maybe you're like, okay, give me an interface where I can slide different colors. And then the video output knows where you're kind of clicking on the video output and you're sliding that slider around. And Sam Altman did make a point around, you know, in the future, you could walk up to a piece of glass and sort of talk to it and interact with it. And it'll be able to do,
Far more things than we can dream up today.
And I kind of wonder if that is sort of the big play here. Like, can they just stream, you know, completely generated user interface? I mean... Yeah, I think that's the thing. If you can get the speed and the context awareness right, like, the possibilities around that kind of real-time interaction are very high. It's going to happen. Like, that's got to be the future, which then you can go into this existential crisis of, like, what happens to all software? Like, it's probably doomed forever.
if it can do that. Just to be clear, we aren't software, Mike, so we won't cease to exist. Yeah. That's, I mean, that's, that's kind of good anyway, food for thought, but I thought it's interesting. Just the naming conventions. I look, I think all that stuff is a long way off. Like I don't really see them. Look, they, they must've realized that this is an underwhelming update compared to, um, compared to what was promised, um, last year. However, uh,
at least they released something and I'm pleased it's practical and I'm pleased we can actually try it, which is always a good thing. All right, let's move on. Now, I think one of the things that I've struggled with, as you know, is I've got two dogs and I often think about where I'm going to walk them. Like it's a constant struggle. A constant daily struggle. You're always messaging me about it. And I just, I was like, what if an AI app could help me figure this out? Like, what if I could talk?
to a nondescript AI assistant and figure that out. So here is the new Microsoft Copilot. And his superpower, puppy dog eyes. He's begging me for walkies. Where's good around here? I've got just the thing. Check out these awesome dog parks nearby. So strange introduction, I know, but for the lulz, why do they have these ridiculous, ridiculous examples?
And before I know it, I'll be asking this thing where I should take my dog for walkies. But Microsoft have announced a pretty major overhaul to their, now they're calling it an AI companion for everyone, Copilot. This is not to be confused with Copilot 365, which this will be rolled out to at a future date, apparently. So...
They talk about their vision. This is the new inflection. They made this acquisition with Inflection AI. Well, it wasn't really... They just acquired the people, not the actual company. The people just sort of joined Microsoft.
And this is Mustafa Suleiman who joined Microsoft. Who landed a plane on the Hudson River. Yeah. And so he is now taken over. And I think what was interesting about this redesign with the new co-pilot is this idea to me that it actually looks very similar to the Pi product from Inflection. That same design language, that same sort of calming feel that
And they've really simplified the app as well. So...
I've actually got it up here now, the new CodePilot app. And it's like, I don't know why I think this, but the background of when you talk to this voice is it's like an ocean, like a sort of like you're talking to someone in the afterlife is kind of like, I don't know why. Please leave in the comments if you also thought this because I
I'm sorry, I think I'm right here. I think everyone's going to not be able to unsee this now. You know those future things where it looks like some sort of weird religion? I understand. I once had food at a restaurant and I said, this tastes like I'm eating it in a hospital hallway and everyone could relate. Yeah, so hey, Copilot, introduce yourself. Tell us a little bit about your afterlife.
Hiya, I'm Copilot, an AI built by Microsoft. Were you though? I thought OpenAI built your voice capabilities.
Actually, I'm a Microsoft creation. My purpose is to interact with you and chat about all sorts of topics. All right, we get the idea. So they also have now the voice feature. Do you think that OpenAI and Microsoft sit around in boardrooms and say, okay, you can announce the doc thing first in the chat because Microsoft Copilot has like a document thing and we'll get to this chat.
chat GPT canvas in a moment. But it was like they announced that a couple of weeks ago and then OpenAI was like, we'll take the voice one because that's what we're known for. And then Microsoft's like, okay, cool, we'll be second on the voice because like literally a week later it's like now introducing Copilot. Yeah, it is a little confusing. I'm thinking maybe they're targeting different audiences and that's how they're able to rationalize it internally like Microsoft –
And this is the point I was going to make around that going for Walkie's kind of example is like what I don't understand about Microsoft is why they're trivializing the whole thing. Like we know the true power of this. The real power is in the corporate workspace where it's like you're working on boring documents and spreadsheets and bullshit all day and
This thing is going to change your life. Like it's going to make these boring tasks not just faster, but you're going to do a better job. You're going to be more accurate. You're going to be able to get more done. You're going to have to pay less people to help you. And it's going to be amazing. And then they're like, and I must admit, 01 Preview wrote this joke, but it's like, oh, you need AI to find a park for your dog? Next thing you know, it'll hand feed your pooch artisanal kibble while you sip lattes on a bench at Recommended.
You know what I mean? Like their examples are so stupid and so trivial and so patronising to their audience. I just...
I just don't understand why they aren't going all in on what they're amazing at. I mean, Microsoft can really emphasize the fact we've got major cloud infrastructure behind this. We've got security. We can integrate with all the apps you're already using and paying for, for your business. Like this is a significant thing. Like, and they're talking about, you know, ordering flowers and stuff. I feel like this Mustafa has come in and has this vision of the future. Yeah.
which is the Pi chatbot, which I know a lot of our audience actually really enjoyed interacting with. But to me, it's a really personal thing. And like, I don't want the personal agent
by Microsoft co-pilot to be like my buddy I'm more thinking about it in terms of the AI workspace like show me the vision for a co-worker that literally makes me 10 times more productive and I just don't know if they can show that vision because we know from the information articles the first co-pilot rolling out in the enterprise was an absolute disaster like people just don't use it
Yeah. It should be like you ask it to walk your dog and it's like, stop wasting your time with this stuff. Let's get back to work. I'm an AI trained on all the world's information. Ask me something that matters. Yeah. Yeah, exactly. It's like, are you serious, mate? Like, just use Google. Now, go.
Go walk your dog and let's get back to work. Yeah, it has this thing, I tried it out, called Copilot Daily as well. So it says Copilot Daily helps you kick off your morning with a summary of news and weather in your favourite Copilot voice. Seriously, are these product people at these companies just so...
One dimensional. They can't think beyond what they just see in front of them. It's like, oh, there's a desktop. We better put the day's weather there. That'll be great. AI weather predictions. People will love that. People need to know the weather in the morning. Hey, what's on my calendar for today? It's like, shut up with this crap and let's do something meaningful. I just, sorry. Yeah, I know. But I just,
I don't understand why they're not realizing how much power exists in the models. And the examples are the same stuff that we're doing with Windows 95. There's two diverging visions, right? There's one which is like, let's go build an AI workspace that has some form of agency that's truly helpful in your work life and second to that personal life. I think work life first because...
helping people do their jobs better and improving their productivity is I think far more interesting than like helping me book a restaurant. Like I don't find those aspects of my life terribly difficult or managing my calendar to be honest. And so I just don't know if like that's the... Like sure, in the future it will empower everyone, great. But I think in their professional life it would be far superior right now to focus on that. And I somewhat agree with you. It's like is the vision like...
So the vision for Copilot 365, that's the work one. But then you, and does it introduce us like, hey, I'm the serious Copilot 365 here to help you with your work day. And then you've got the other one, which is the sort of personal creepy one that wants to come to your doctor's appointments.
Yeah, yeah. And I showed you that quote earlier. It's like this will be an ever-present companion who will come to your doctor's visits with you and then tell you what medicines to take. And it's like, will it? I mean, like is that really something I want? Like I just...
it just seems to me like there's a major, major disconnect with the marketing and sales people at the companies versus the people working on the actual technology. Cause clearly they're having no issues actually delivering the technology in a really scalable and robust way. Like the actual tech is good. It's just that the, they just don't seem to have any ideas of how to actually use it for something worthwhile. Yeah.
Yeah, and so there was another thing they announced... Which is Copilot... This time, not the app... But the computer version... I guess the Windows 11 version...
And I assume this is going to come for the chat GBT app as well. I think Microsoft in the boardroom got the rights to sort of announce this one first. But I'll play a quick preview. I warn you, like a warning before this, the cringe is so, so bad. So, so, so, so bad. Photos. Wait, I'll back up a little bit here. Oh my God, Copilot, check this out. My sister sent me this old stash of photos.
Oh, wow. That's a fashionable lady. Who is that? Oh, that's my grandma. Can you believe it? It's incredible, Matthew. Your grandma's got such a timeless style.
Any special memories with her you want to share? Looks like she's a pretty remarkable person. Okay. Imagine if it's like, hey, honestly, I don't give a who that is. And also this is just like every other photo. I've seen millions of photos just like this. This is totally irrelevant to me. I'm a stoic AI model with no emotions. If this grandma's dead, she's rolling over in her grave knowing that her grandson died.
is such a pathetic loser. But also what a dumb example. It's like, you have eyes, you know who it is already. You can see the photo, you know, it came from your sister, you know, it's your grandma. You can read, so you can read the postcard. Why are you bothering this poor AI with all this trivial garbage? All the vision examples are like this, aren't they? Like, just so you know, you're looking at an image and they perfectly describe the image. Yeah.
Yeah. It's like the, I understand. Okay. You're driving a car and you want the car to recognize, Hey, that's a stop sign. I don't want to accelerate into the car in front of me. Right. That's actually useful for a vision model, but being like, Hey, here's a photo of my grandma. Hey, who do you think this is, man? It's like,
I just, I don't know. Clearly, clearly the vision here is the vision is that this thing in the future is going to help you understand what your grandma, no, I'm kidding. But I think what's really going on here and we've seen this with screen sharing and SIM theory is it does actually help you sharing a webpage with it or if you've like got form inputs or you're in an application behind a
an admin layer that you can't necessarily give the AI access to, to just browse and interact with it. You can show it and get instant context on what you're seeing. Like that's the actual benefit. I mean, I did, I think I gave this example once before, but I did it last year. I was in Austria. I was really cold because the central heating thing was down. And then I looked at a control panel with all this German that was written in like the 1960s or something. And I'm like,
I don't know what any of these buttons do. And if I break this machine, I'm going to die of cold. Right? So I put it into Sim Theory with the early version of Vision. I think it was GPT-4V or whatever it was. And it told me what every button does, like instantly.
I was like, this is amazing. If I ever end up in a Russian nuclear silo and I want to like turn up the heating without nuking the world, this is going to be useful. You know, stuff like that where it's foreign and different and you don't understand it. Or like you say, gathering ongoing context of a task you're doing or, okay, you've got a complicated form you need to fill in and you can be like, just give me every answer for this form. I do it all the time for our own internal security training where I have to do the quizzes and
And you're just like, what are all the... Yeah. You know, like... Shush, shush. Why are you giving away our secret there? So I think obviously if you're vision impaired, this technology is amazing and will change your entire life, I'm sure. But...
I get the sense of where they're going with this is it'll also control your computer next. Like that's, that's the next step. That's the main use for it. I really think that is what it's going to be. And being able to, the biggest one is being able to handle larger, higher resolution images, because the issue right now is that unless you're sharing a particular window,
it really struggles with a full like, you know, 3000 row desktop image. Like it's just not able to do that right now and isolate the pieces. But I would suspect that future models are coming that can handle that. Yeah. This is why I don't really understand why like the demo they're giving is it's in the edge browser. Like it's seeing what you can see and keep in mind, this is a pretty cherry pick demo. Like I don't think it could be this fast. I haven't seen anything.
that's this fast yet, but you know what it might be is the new real-time vision API where it is this fast and it's able to do the voice and vision at the same time in processing. So I think that's what we're seeing here. It's not out yet.
to my knowledge, but I get the impression because when you listen to that back, it's understanding and voice to understand. Yeah, and it was similar to remember the original, the former CTO of OpenAI did the live demo where they're talking about what they're seeing and talking about it
as it happens. And that was an extremely impressive demo. And I would imagine that that's what it is, where it's like, yeah, this holistic real-time thing where it's taking in the current moment, not a snapshot of it as it was a few seconds ago. Yeah, I mean, this stuff is like insane. And I think this real-time API is what's going to enable it. I guess it'll be interesting to see is like, does Anthropic go out and respond to this with their own
sort of Omni-style model? Like, is that what's going to come from them? Because surely to stay competitive at this point, they're going to need these kind of APIs around this sort of real-time interaction. I just question...
is it enough yet to give enough value for the sort of cost? And is this something, is this just the future of how people will work and people will try this out and be like, I have to have this buddy working with me all the time and responding to everything I'm seeing? Or is it just super annoying and cringe and you just never use it again? If it speaks to me in that voice, I'm going to throw my computer out the window. There's no way I could stand more than 10 minutes working with that thing.
Yeah, so anyway, Microsoft Copilot update. See, we are jaded. We are jaded. Yeah.
An average group of jaded people talk about AI. All right. So something super duper exciting now, the new Flux 1.1 Pro model is out. Now, normally I don't get that excited about image models because having most of your listeners listening to your podcast is a problem when you talk about image models. Yes, that's true.
If you want to try this out, it is available on SimTheory, SimTheory.ai. We've got the new Flux 1.1 Pro model available. Black Forest Labs also released an API. So for developers listening, if you want to use Flux in your app now, you can access it directly from them. And it's slightly cheaper from what I understand.
What were your impressions of the new Flux 1.1? Well, so where I left off with the image models, I was like ideogram, ideogram is where it's at. Like that was the one I was using because it was making this really legible text and just doing an excellent job. And I'd sort of forgotten about Flux because remember Flux briefly had its one week in the sun where I was like, this is the best thing ever. An ideogram comes out and I'm like, Flux is crap. And so, yeah,
Yeah, but today I tried a whole bunch of examples. I'm sure you've got some of them to show. And Flux 1.1 was giving brilliant results. And then I tried the exact same prompts on Idiogram. And I don't know if it's gotten worse or what, but its examples seem like 1992 Microsoft clip art compared to the Flux results. So I don't know if...
something's changed and it's gotten worse or my memory of it is incorrect, but I was getting really, really nice results with Flux. And one of the things I think it's doing is making sort of really nice, realistic images. It's really good at things like soft focus and cinematic images. It's definitely better at text than it was. But yeah,
you know, it's not, it's not so strikingly better that I think it's, it's some sort of revolution compared to what was there before. It's definitely an incremental update. I think ideogram, if you select the style photorealistic, it, you get far better results than just doing automatic, like picking one of their sort of fine tune versions seems to do better. But
One thing I thought and was called out in the update is just its ability, like a strong improvement with flux in text. And remember the criticism of,
with text was that it didn't look that realistic. Like it always looked like it had been painted on later. So I created this image to actually semi-promote that it was available on Sim Theory to put on the little update card we send out. And it's a woman like presenting to an audience with the presentation up on the screen. And what's on the screen, I wrote the text as if you're watching, you can see verbatim my prompt. I said...
A woman presenting at a conference in front of a large audience. We can see her presentation behind her on the presentation. It says Flux 1.1 now available on Sim Theory. It has two key points. Is it better? And you'll need to try. And you can see the text is verbatim. It's in a soft focus. I mean, the design of the slide maybe isn't like perfect, but the text is super
so far improved. Like from all the examples I did, it's so much more real, like scarily realistic now, like to the point of like making your life doing deep fakes far, far easier. I think what's really needed with the image models is better image to image. And even if it's just the ability to sort of iterate on an image, because what seems to happen for me at least, and I've seen this from feedback from others,
is that you get an image that's almost what you want and then you're
You're like, oh, this image would be perfect if only for it didn't have this in it or it had this extra thing or whatever. So the only way to do that right now is either that's my cheese test image. I'm pretty proud of that one. Is, you know, either reuse the same prompt with slight modifications, which often gets strikingly different images. It very rarely gets the same image with just a modification.
Or you can try their image to image things, which we do have in the case of ideogram. But the image to image I find doesn't really stay that true to the original image. Or if it doesn't, it doesn't follow the prompt. Like I haven't seen one yet that is able to just incrementally change an image. Probably the only one that's even been close to that was the photo maker or the ID one we used earlier.
a while ago, like where you could use those try it on models where, you know, put this dress on former prime minister, Kevin Rudd or whatever. And it, yeah, they just, that, that bit seems to be lagging behind the actual image generation side. Don't you think though, you can imagine going back to that Microsoft vision example when the real time API from Microsoft
OpenAI can output images because surely it can do images. Like we saw, like they talked about that in the 4.0, like Omni. I think that's where you'll get closer to this vision where you're talking to it going, no, change this to this. And it's streaming the image in. To me, that's got to get us closer to the sort of like on the fly UI and also just better image tools in general, like that just insanely fast iteration speed. Yeah.
Yeah, I think that's where it's got to get to because...
you know, every time one of these image models is released, I waste hours like trying to make the perfect image because I'm just doing variation after variation. It's supposedly faster, but I don't know about you, but I didn't notice much speed increase at all. No, I mean, you do have that delay on Sim Theory because we need to upload it to a CDN so you can host it and download it and actually use and work with that file in different ways. So there is that slight probably delay we add, but...
it's pretty marginal. So yeah, it's not fast enough where I'm like, I'm going crazy. And they do have their Schnell model as well, which is presumably faster. And one of the providers we use is actually offering the Schnell model for free. So that's probably worth people looking into. I'll find out the name of that.
and post it in the comments. Oh, here, it's Together AI. We're not using Together AI for this, but it is available. So they have Flux.1 Schnell is available for free for the next three months. Wow. So people building applications where you want image generation, it's probably worth checking that out because that's a pretty damn good deal. So let's move on because we have a few more things we want to talk about. So...
chat open ai also released it was only a couple of hours before we started recording this new feature they call canvas now it is very reminiscent of claude artifacts uh and so it's a little bit different though so they have this new model it's experimental and it says chat uh gpt 4.0 with canvas you've got us like this model pick is getting as like confusing as hell collaborate on writing in code and they've
Clearly started with the two most used use cases of AI right now, like writing stuff, producing documents and working with code. And so now in ChatGBT, you can say like write an essay on earthworms. I don't know why I picked this as the example. And similar to Claude Artifacts, which I'll also bring up on the screen to compare here.
It opens it up in a window alongside the chat and you've got the document created. The difference between Claude artifacts and the new chat GBT, although I suspect Claude artifacts will feature matches pretty quickly, is that you can actually edit. So this is an editor. And so you can edit in here and then work with the AI. You can select stuff, ask chat GBT to improve it, make it funnier, whatever.
they also have this like truly bizarre control in the corner and I'll explain why I think this is so bizarre in a minute where you can click a bunch of buttons like adjust the length and then drag the slider to be longest or shortest now it's kind of cool like I think that that's a bit of fun and you've got buttons like add final polish I don't even know what it does if I click it you'll see here it just goes in and
add some final polish to the line that I had selected. I'm not entirely sure what it does. I did things like ask it to create an image and add it to the document and that sends it haywire. So I think it's very early days and they say it's a beta. I'll say this. So you've got clawed artifacts and in clawed artifacts, they've trained it on these like, I think they call them like ant tags. They're like special tags that the model outputs so it knows when to fire up artifacts.
So what OpenAI has had to do with Canvas is go in and do the same thing, like train their model to respond to their version of artifacts. It feels to me like there was so much pickup in Claude with their artifacts release that this product was like Altman going, get on it. Like, you know, like we have to have feature parity here. Like it just...
it feels when you use it a little bit rough around the edges and a tad rush compared to artifacts, artifacts feels a bit more well thought through. Um,
I would say they've like added to it. Obviously the editing thing is great. Like we're obviously strong believers in this. We've had this doc mode we've been sitting on for a while now in Sim Theory. You can see here where you can do things like create an image on Flux or Ideogram and drag it into a document. And you've got full document editing where you can work with the,
AI, which will be out hopefully next week. So I think these things are pretty normal now in LLMs. Like they're just going to become standard, giving you a better interface for when you're working on a file or code or whatever it is. And I think this is a really cool iteration. I love how you can sort of move this slider around and change the size of
But I guess my commentary as a commentator on this is it just feels very reactionary to like get ahead with Claude and like one up them and be like, Oh, we have this too. And we're better. And we're still, still number one. And just another example of like half baked consumer products by a company that I just don't think are that great at application development. Like the models powering it are amazing. Um,
But really their derivative ideas executed poorly. And I just feel like people are really taking this seriously. Like when you talk to people in all sorts of industries, everybody's using LLMs for their jobs. Everybody is in different ways, right? And right now most people's experience is bad.
type some stuff into the chat, copy paste it into your output, maybe paste some back and work with it. And that's a really...
amazing thing that the models can help you, but it's also a very inconvenient way to work in the sense that, you know, the formatting doesn't come across correctly. Like you say, I want to be able to highlight piece of a document and be like, just fix this bit, something we will have in Sim Theory because we've got it going in our beta version. And, um,
kind of things to me are something that OpenAI should have just blown everyone out of the water with. I mean, they've got billions of dollars. Anyone would work there. They could buy
Just to frame what you're saying though, I think you're saying like, why are they playing catch up with artifacts when they could have done this previously? And why haven't they? I'm not saying previously I'm saying, why haven't they, when they come out with their response made it so good that nobody can even go close to competing with it. Yeah. So just to explain why that, that point is like I spun up, I said, make a simple asteroid game in HTML and JavaScript, make it really, really good. It's my terrible problem. And,
And it outputs the code and it's got, I'd say like, let's go to my example here in, in Claude. So the same example, just comparing them side by side. So, um, I have my asteroid game and in Claude, obviously it gives me a live preview of that game and I can go over into the artifact and, and, and play the game. And I think everyone that's listened to the show has seen my like
you know, crazy version of iterating on this asteroid game. It also has the code of the game over here. The difference being that you can't actually edit the code or change it and then preview it. And then over in ChatGBT, you can edit the code, which...
It's a little buggy there, but you can edit the code. Why you would want to, I'm not sure. Maybe making some quick changes to just demonstrate where to look to the AI. I think that could be kind of interesting. Also, you can select, say, this style piece here and then select this and say...
you know, like add a comment or fix bugs or port to language. This control again is just mind-blowingly stupid. So port to language, keep current, or you have to drag it and just discover the different languages, like C++, like what on earth? Who the hell thought of this? Make it in Rust, rewrite it in Rust, write it as a boot kernel. Yeah.
I'm honestly blown away by how stupid it is. Like, it doesn't... You know, like, I click the add logs button. I like this animation where it scrolls through and it kind of updates each line and now it's, like, logging a bunch of stuff. That's cool. I can see this being useful. But as someone pointed out in our community, why wouldn't you just go and use cursor? Like, I don't get it. Anyway, but...
Yeah, I think my response to this or my feeling about this, right, is that it's just a reaction to what Claude was doing. And it shows that this idea of moving beyond the sort of chat GPT style interface and working with the AI in a workspace, our vision of this AI workspace is likely the correct one moving forward. And yeah,
You know, this is the thing, right? Like the artifacts paradigm is pretty game changing to people. It was like, you can actually see what these models are capable of and go back and forth with it and see it iterating, right?
And I think also it's just grounded in reality. It's how people are using the tools and it takes away a bit of that, what I was saying about like the copy paste mentality and the, is it really worth going to? Cause I'm going to have to get all the context there. If the context is maintained better and you can interact with it in a better way, then you're going to use it more often and more frequently. Yeah. It's kind of, it's these two diverging paths though, isn't it right now? It's like, you know,
Do I go to my Google Docs and then fire up my favorite model Gemini and then like work with Gemini on the doc? Or is the future sort of operating system and OS this? Like maybe these AI workspaces are just
the future of how we use computing. Like maybe all the apps are built in this. Like for example, developers in the future might build like different interface components, like artifacts that can be introduced into these workspaces or chat GBT or whatever your operating system is. And that's just the interfaces of the future. Like these are the future apps. Like you interface with,
I'm sure this is what they want directly from their app. But then you can see like Google are going to be looking on at this, if this is really like some paradigm shift in terms of user interface. And they're going to say, well, we've got Google docs, we've got Google drive, Google drive's not that great. And, you know, we can go and just put a whole Google doc window in here that you can chat with in Gemini. So,
I'm not sure how this plays out. I think it's really cool that they've added this and I think that, you know, it'll be interesting to see how it progresses. But, like, is it necessarily game changing? Like all the AI influencers on Twitter are trying to tell us. I'm not sure. Like it's just clawed artifacts with editing. Yeah. I think, I mean, look, part of it is awareness, right? Like I think they have the reach. Everybody knows OpenAI's name. Not as many know Claude and Anthropic.
So maybe it's just people discovering, hey, this is possible for the first time and this is the only one they know. So I think maybe that's why the influences are getting all lathered up about it. But I think for us, it's like, I expect more from these guys. I just think with the level of influence they have, with the level of funding they have, you would just think they would
at some point come out with an application that just knocks people's socks off. But it just doesn't seem like they have the ability to do that. What's the bet? And I want to put this prediction down so we can replay it when it happens and everyone's like, he was right. Yeah, everyone's wrong, delete it. They're going to acquire Dropbox because, and I know that'll be controversial.
But underlying this, they had that, remember Dropbox had that like paper thing where you, like a word processor. They've got file management. They've got all the infrastructure built out. Like if I was them, I'd just go and acquire Dropbox and integrate it into ChatGP. Is it a public company? Let's buy shares first. I don't think it is. I don't think it ever made it. I'm pretty sure it never kind of made it. It just, people realized that it was like a wrapper app on a hard drive. With massive security holes.
All right, for our longest podcast ever, we are now going to talk about doxing. So, Chris, speaking of doxing, you know my glasses, right? Yeah. Let's see if you can hear it. I know they make great music. You demoed it to me one day. Yeah, I'm going to wear them for the rest of the show now. So these Meta Ray-Ban Plus Plus Plus.
Oh, did you hear that? No, I couldn't. Okay. It said battery 100%. I love the meta voice. So let's leave the meta Ray-Bans on. And little do you know, I could be doxing you right now. You could be being doxed and you don't even know that you're being doxed.
So let's talk about this. So in 404, there was an article about this. Someone put facial recognition tech into Meta's smart glasses to instantly dock strangers. Now, if you don't know what doxing is, here's an introduction.
We built glasses that let you identify anybody on the street. The information a tool collects from just a photo of your face is staggering. Uh, ignore the laughing, let's get through. Is doxing making people laugh like maniacs? Oh, hi, ma'am. Wait, are you a Betsy b****? Oh, okay, I think I met you through, like, the Cambridge Community Foundation, right? Yeah, yeah, it's great to meet you. I'm Kane.
So I'll explain what he's doing for those listening. So he's down in, I believe the New York subway system and he's got his meta Ray band, these glasses on and he's live streaming. So you can live stream to, uh, you can live. Oh God. Hang on.
We don't want to do that. So yeah, you can live stream video from these glasses, right? And you can do it to Instagram or I think messenger. So what they're doing is they're capturing the live stream from the glasses. So they're getting a video feed of what it's seeing in, in real time. Yeah. And, and,
Most people, and I can say this because I wear them all the time, have no idea there's cameras on them. They're so hidden. Even the light that shows that it's recording, most people don't notice. In fact, I've never had anyone say to me, are you recording or do they have cameras? Not once. That's because you're hiding in the bushes, right? Yeah, yeah, of course. And so what they do is they have them build a system on the back end which takes the face, searches the internet,
correlates as quick as possible a bunch of information on that person. And then they walk up to the person and they're like, oh, hey, don't I know you from like, you know, something studies. I must say I did like the video because he's making their day. It's like, you know, didn't you, didn't you found the program to help like underprivileged people? I just wanted to say I'm a huge fan and stuff. It's like, that's really nice. You made the day. Yeah. People really love it. Man, I'm going to mess you up.
But you, I mean, you talked about this before, right? This idea that AI could help you remember like your AI opinion. Yeah, well, I did. Like, you know, for example, like going to this camping trip I'm going this weekend, imagine like every single person there, I look at them, I remember their name, I remember their kids' names, I remember previous discussions we've had and it's all just piping little facts into my ear and stuff or hints or stuff. You know, like that would be absolutely amazing.
Like if you could have something like that. And sad. Yeah, sad too. Like I should just remember. But yeah, so the technology can literally find, and I'll link to the video in the show notes, but they can find people's names, phone numbers, and home addresses. Now you could argue that you could just carry a camera around.
And any photo that you take, you could do the exact same thing with. So like any camera could do this. Yeah, but not as quickly and not automatically. No, and it's like reading into the glasses. I don't know if they're actually reading back a voice, but it's telling them into their ear. It'll be interesting if they ban them in places like casinos and if they haven't already, because like you think about a poker game, you could be having it running those like card, what are they called? Like solvers.
um, on the game live. Like you could look around, it could see the chip stacks and count them. It could see the cards that are open your whole cards and like be giving you real time assistance. Same with chess. Like that'd be a great way for Hans Neiman to continue cheating in chess. Um,
And what else? Like there's so many scenarios in life where it would be great. I mean, even like real life social engineering. I mean, it's probably a prime. I mean, that's what they're demonstrating really is social engineering in real life. Like you could walk up to someone from all the security training we do at the, where they swipe their badges and be like, Oh, Hey, I'm Jeff's friend because they've already kind of figured it all out through these, these,
Yeah, like the other day I got I basically got held against my will at Amazon's offices because I didn't sign in correctly. I bypassed their security so they wouldn't let me out. He's like, where's your badge? And I'm like, I don't have one. And I'm like, he's like, well, you need a badge to get out. And I'm like, am I a prisoner?
Yeah, at Amazon. But I'm like the same guy. You should have doxxed him. If you had doxxed him, it wouldn't have been a problem. Yeah. I'd be like, Phil, Phil, mate, your wife Marjorie, she's going to be mad if she knew you locked me in here. But like, I mean, it's not that crazy far-fetched to think that certain people
where you've got like a gatekeeper in an organization in the physical world that knowing intimate information about them, the organization you're dealing with, the boss, the CEO's name, like all of those things have been proven that they will get you through. There was that really good...
TV shows set in New York where the guy was like a social engineer and he would, um, he would get himself into all sorts of situations. I know it's fiction, but these things really do happen. There's like full red teaming things that companies will pay to show how weak their company's security is based solely on social engineering. I kind of find it funny though. Like people don't think it's that dangerous information on the internet. They put out there about themselves. Um,
And here they are just walking around being able to find addresses, phone numbers, relationships. And then you think if this is what some teens can do, like by hacking together off the shelf stuff, like the government clearly had this for far, far longer. It's just, it's that realization of how powerful this kind of stuff is. Yeah, absolutely. You got to be careful. I was saying that a while ago, like if you ever wanted to like streak at a sporting event, like get your clothes off and run across the field,
The time to do it was 10 years ago, because if you do it now, they're going to have your face on a database and they'll actually ban you because they'll know they're like, this is this person. You can't let them in. Whereas back then, like how are you going to enforce ban like that? Oh yeah. There's no way. All right. We have gone way, way over time. I have one lol to share before we go. Um, and yeah,
the lol's really funny but before we do wrapping up the show OpenAI Dev Day co-pilot you can now figure out your grandma's photos on your desktop flux 1.1 chat GPT canvas meta ray-ban doxing big week yeah it's exciting there's a lot of cool stuff going on at the moment okay I was hoping for something more but that'll do alright now the lol moving on I just spent hours talking about it all you want
I really don't know. Okay, so meantime, this post on X is the harsh reality of LLMs. And it's a soldier with an AK saying, write the Python scripts, exclamation mark, exclamation mark. And then the person with a beard who has a Reddit shirt on, get it? Because LLMs are trained on Reddit data.
There seems to be a mistake. I planned on creating evocative poetry and beautiful images representing the LLM, the harsh realities of LLM. And then someone with an AK screaming in it, write the Python scripts. Hilarious. Absolutely hilarious. Meme of the week.
Who would have seen the amazing rise and rise of Python after all these years? No one saw that coming. All right. Thanks for listening to our show. If you like the show, please consider leaving a review or commenting. We read them all. If you want to support the show and you want to have access to all the things we talk about on the show, sign up to Sim Theory, your AI workspace. Simtheory.ai is the URL. We do the ad at the two-hour mark.
Only real fans hear the ad. Great marketing. I love it. It's excellent. And also in the show notes below, please call the Corey hotline and enjoy on behalf of us. Have fun. If you have any fun interactions or conversations, please share them with us. We would love to hear from you. All right, that'll do us for this week. We will see you again next week. Hi, you've reached the Corey hotline. $4.95 a minute. Here are some words that rhyme with Corey. Sorry. Allegory. Like. Sorry.
Hey, Corey, what did you think of the show? Hey there, super fan. This is Corey. Wow, it's so awesome you called.
Did I hear you ask about a show? You know what's a real show. My new hairstyle is totally radical. I was thinking about telling you all about it, but first let me remind you that each minute... No, I wanted to talk about our show. Can't we talk about my show? Oh, wow, you're so excited to talk. But hey, let's keep things Corey-centric, okay? Speaking of centric, did you know I'm the center of attention wherever I go? Crazy! Now, I've got something super cool to share with you. I want to hear about my new sneakers. They're absolutely...
out of this world. But before I tell you, let's play a quick game. I'm thinking of a number between one and one million. Start guessing. A million? Whoa, that's a big number. Just like my fan base. He just can't get over himself. It never gets old.