Production AI Engineering starts with Evals — with Ankur Goyal of Braintrust

2024/10/11

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Shownotes Transcript

We are in 🗽 NYC this Monday! Join *the AI Eng NYC meetup)*, bring demos and vibes!

It is a bit of a meme that the first thing developer tooling founders think to build in AI is all the non-AI operational stuff outside the AI. There are well over 60 funded LLM Ops startups) all with hoping to solve the new observability, cost tracking, security, and reliability problems that come with putting LLMs in production, not to mention new LLM oriented products from incumbent, established ops/o11y players like Datadog and Weights & Biases.

2 years in to the current hype cycle, the early winners have tended to be people with practical/research AI backgrounds rather than MLOps heavyweights or SWE tourists:

**LangSmith: **We covered how Harrison Chase) worked on AI at Robust Intelligence and Kensho, the alma maters of many great AI founders
HumanLoop: We covered how Raza Habib) worked at Google AI during his PhD
BrainTrust: Today’s guest Ankur Goyal) founded Impira pre-Transformers and was acquihired to run Figma AI before realizing how to solve the Ops problem.

There have been many VC think pieces) and market maps) describing what people thought were the essential pieces of the AI Engineering stack, but what was true for 2022-2023 has aged poorly. The basic insight that Ankur had is the same thesis that Hamel Husain is pushing in his World’s Fair talk) and podcast with Raza and swyx):

Evals are the centerpiece of systematic AI Engineering.

REALLY believing in this is harder than it looks with the benefit of hindsight. It’s not like people didn’t know evals were important. Basically every LLM Ops feature list has them. It’s an obvious next step AFTER managing your prompts and logging your LLM calls. In fact, up til we met Braintrust, we were working on an expanded version of the Impossible Triangle Theory) of the LLM Ops War) that we first articulated in the Humanloop writeup):

The single biggest criticism of the Rise of the AI Engineer piece) is that we neglected to split out the role of product evals (as opposed to model evals) in the now infamous “API line” chart:

With hindsight, we were very focused on the differentiating 0 to 1 phase that AI Engineers can bring to an existing team of ML engineers. As swyx says on the Day 2 keynote of AI Engineer), 2024 added a whole new set of concerns as AI Engineering grew up:

A closer examination of Hamel’s product-oriented virtuous cycle and this infra-oriented SDLC would have eventually revealed that Evals, even more than logging, was the first point where teams start to get really serious about shipping to production, and therefore a great place to make an entry into the marketplace, which is exactly what Braintrust did.

Also notice what’s NOT on this chart: shifting to shadow open source models, and finetuning them… per Ankur, Fine-tuning is not a viable standalone product:

“The thing I would say is not debatable is whether or not fine-tuning is a business outcome or not. So let's think about the other components of your triangle. Ops/observability, that is a business… Frameworks, evals, databases [are a business, but] Fine-tuning is a very compelling method that achieves an outcome. The outcome is not fine-tuning, it is can I automatically optimize my use case to perform better if I throw data at the problem? And fine-tuning is one of multiple ways to achieve that.”

OpenAI vs Open AI Market Share

We last speculated about the market shifts in the End of OpenAI Hegemony) and the Winds of AI Winter), and Ankur’s perspective is super valuable given his customer list:

Some surprises based on what he is seeing:

Prior to Claude 3, OpenAI had near 100% market share. This tracks with what Harrison told us last year).
Claude 3.5 Sonnet and also notably Haiku have made serious dents
**Open source model adoption is Contra to Eugene Cheah’s ideal marketing pitch), virtually none of Braintrust’s customers are really finetuning open source models for cost, control, or privacy. This is partially caused by…
Open source model hosts, aka Inference providers, aren’t as mature as OpenAI’s API platform. Kudos to Michelle’s team) as if they needed any more praise!
**Adoption of Big Lab models via their Big Cloud Partners, aka Claude through AWS, or OpenAI through Azure, is low. **Surprising! It seems that there are issues with accessing the latest models via the Cloud partners.

swyx [01:36:51]: What % of your workload is open source?

*Ankur Goyal [01:36:55]: Because of how we're deployed, I don't have like an exact number for you. *Among customers running in production, it's less than 5%.

Full Video Episode

Check out the Braintrust demo on YouTube)! (and like and subscribe etc)

Show Notes

Ankur’s companies
MemSQL/SingleStore) → now Nikita Shamgunov) of Neon)
Impira)
Braintrust)
Papers mentioned
AlexNet)
BERT Paper)
Layout LM Paper)
GPT-3 Paper)
Voyager Paper)
AI Engineer World's Fair)
Ankur and Olmo’s talk at AIEWF)
Together.ai)
Fireworks)
People
Nikita Shamgunov)
Alana Goyal)
Elad Gil)
Clem Delangue)
Guillermo Rauch)
Prior episodes
HumanLoop) episode
Michelle Pokrass episode)
Dylan Patel episode)

Timestamps

[00:00:00] Introduction and background on Ankur career
[00:00:49] SingleStore and HTAP databases
[00:08:19] Founding Impira and lessons learned
[00:13:33] Unstructured vs Structured Data
[00:25:41] Overview of Braintrust and its features
[00:40:42] Industry observations and trends in AI tooling
[00:58:37] Workload types and AI use cases in production
[01:06:37] World's Fair AI conference discussion
[01:11:09] AI infrastructure market landscape
[01:24:59] OpenAI vs Anthropic vs other model providers
[01:38:11] GPU inference market discussion
[01:45:39] Hypothetical AI projects outside of Braintrust
[01:50:25] Potentially joining OpenAI
[01:52:37] Insights on effective networking and relationships in tech

Transcript

swyx [00:00:00]: Ankur Goyal, welcome to Latent Space.

Ankur Goyal [00:00:06]: Thanks for having me.

swyx [00:00:07]: Thanks for coming all the way over to our studio.

Ankur Goyal [00:00:10]: It was a long hike.

swyx [00:00:11]: A long trek. Yeah. You got T-boned by traffic. Yeah. You were the first VP of Eng at Signal Store. Yeah. Then you started Impira. You ran it for six years, got acquired into Figma, where you were at for eight months, and you just celebrated your one-year anniversary at Braintrust. I did, yeah. What a journey. I kind of want to go through each in turn because I have a personal relationship with Signal Store just because I have been a follower and fan of databases for a while. HTAP is always a dream of every database guy. It's still the dream. When HTAP, and Signal Store I think is the leading HTAP. Yeah. What's that journey like? And then maybe we'll cover the rest later.

Ankur Goyal [00:00:49]: Sounds good.

swyx [00:00:50]: We can start Signal Store first. Yeah, yeah.

Ankur Goyal [00:00:52]: In college, as a first-generation Indian kid, I basically had two options. I had already told my parents I wasn't going to be a doctor. They're both doctors, so only two options left. Do a PhD or work at a big company. After my sophomore year, I worked at Microsoft, and it just wasn't for me. I realized that the work I was doing was impactful. I was working on Bing and the distributed compute infrastructure at Bing, which is actually now part of Azure. There were hundreds of engineers using the infrastructure that we were working on, but the level of intensity was too low. It felt like you got work-life balance and impact, but very little creativity, very little room to do interesting things. I was like, okay, let me cross that off the list. The only option left is to do research. I did research the next summer, and I realized, again, no one's working that hard. Maybe the times have changed, but at that point, there's a lot of creativity. You're just bouncing around fun ideas and working on stuff and really great work-life balance, but no one would actually use the stuff that we built, and that was not super energizing for me. I had this existential crisis, and I moved out to San Francisco because I had a friend who was here and crashed on his couch and was talking to him and just very, very confused. He said, you should talk to a recruiter, which felt like really weird advice. I'm not even sure I would give that advice to someone nowadays, but I met this really great guy named John, and he introduced me to like 30 different companies. I realized that there's actually a lot of interesting stuff happening in startups, and maybe I could find this kind of company that let me be very creative and work really hard and have a lot of impact, and I don't give a s**t about work-life balance. I talked to all these companies, and I remember I met MemSQL when it was three people and interviewed, and I thought I just totally failed the interview, but I had never had so much fun in my life. I remember I was at 10th and Harrison, and I stood at the bus station, and I called my parents and said, I'm sorry, I'm dropping out of school. I thought I wouldn't get the offer, but I just realized that if there's something like this company, then this is where I need to be. Luckily, things worked out, and I got an offer, and I joined as employee number two, and I worked there for almost six years, and it was an incredible experience. Learned a lot about systems, got to work with amazing customers. There are a lot of things that I took for granted that I later learned at Impira that I had taken for granted, and the most exciting thing is I got to run the engineering team, which was a great opportunity to learn about tech on a larger stage, recruit a lot of great people, and I think, for me personally, set me up to do a lot of interesting things after.

swyx [00:03:41]: Yeah, there's so many ways I can take that. The most curious, I think, for general audiences is, is the dream real of SingleStore? Should, obviously, more people be using it? I think there's a lot of marketing from SingleStore that makes sense, but there's a lot of doubt in people's minds. What do you think you've seen that is the most convincing as to when is it suitable for people to adopt SingleStore and when is it not?

Ankur Goyal [00:04:06]: Bear in mind that I'm now eight years removed from SingleStore, so they've done a lot of stuff since I left, but maybe the meta thing, I would say, or the meta learning for me is that, even if you build the most sophisticated or advanced technology in a particular space, it doesn't mean that it's something that everyone can use. I think one of the trade-offs with SingleStore, specifically, is that you have to be willing to invest in hardware and software cost that achieves the dream. At least, when we were doing it, it was way cheaper than Oracle Exadata or SAP HANA, which were kind of the prevailing alternatives. So, not ultra-expensive, but SingleStore is not the kind of thing that, when you're building a weekend project that will scale to millions, you would just spin up SingleStore and start using. I think it's just expensive. It's packaged in a way that is expensive because the size of the market and the type of customer that's able to drive value almost requires the price to work that way. You can actually see Nikita almost overcompensating for it now with Neon and attacking the market from a different angle.

swyx [00:05:11]: This is Nikita Shamgunov, the actual original founder. Yes. Yeah, yeah, yeah.

Ankur Goyal [00:05:15]: So, now he's doing the opposite. He's built the world's best free tier and is building hyper-inexpensive Postgres. But because the number of people that can use SingleStore is smaller than the number of people that can use free Postgres, yet the amount that they're willing to pay for that use case is higher, SingleStore is packaged in a way that just makes it harder to use. I know I'm not directly answering your question, but for me, that was one of those sort of utopian things. It's the technology analog to, if two people love each other, why can't they be together? SingleStore, in many ways, is the best database technology, and it's the best in a number of ways. But it's just really hard to use. I think Snowflake is going through that right now as well. As someone who works in observability, I dearly miss the variant type that I used to use in Snowflake. It is, without any question, at least in my experience, the best implementation of semi-structured data and sort of solves the problem of storing it very, very efficiently and querying it efficiently, almost as efficiently as if you specified the schema exactly, but giving you total flexibility. So it's just a marvel of engineering, but it's packaged behind Snowflake, which means that the minimum query time is quite high. I have to have a Snowflake enterprise license, right? I can't deploy it on a laptop, I can't deploy it in a customer's premises, or whatever. So you're sort of constrained to the packaging by which one can interface with Snowflake in the first place. And I think every observability product in some sort of platonic ideal would be built on top of Snowflake's variant implementation and have better performance, it would be cheaper, the customer experience would be better. But alas, it's just not economically feasible right now for that to be the case.

swyx [00:07:03]: Do you buy what Honeycomb says about needing to build their own super wide column store?

Ankur Goyal [00:07:09]: I do, given that they can't use Snowflake. If the variant type were exposed in a way that allowed more people to use it, and by the way, I'm just sort of zeroing in on Snowflake in this case. Redshift has something called Super, which is fairly similar. Clickhouse is also working on something similar, and that might actually be the thing that lets more people use it. DuckDB does not. It has a struct type, which is dynamically constructed, but it has all the downsides of traditional structured data types. For example, if you infer a bunch of rows with the struct type, and then you present the n plus first row, and it doesn't have the same schema as the first n rows, then you need to change the schema for all the preceding rows, which is the main problem that the variant type solves. It's possible that on the extreme end, there's something specific to what Honeycomb does that wouldn't directly map to the variant type. And I don't know enough about Honeycomb, and I think they're a fantastic company, so I don't mean to pick on them or anything, but I would just imagine that if one were starting the next Honeycomb, and the variant type were available in a way that they could consume, it might accelerate them dramatically or even be the terminal solution.

swyx [00:08:19]: I think being so early in single store also taught you, among all these engineering lessons, you also learned a lot of business lessons that you took with you into Impira. And Impira, that was your first, maybe, I don't know if it's your exact first experience, but your first AI company.

Ankur Goyal [00:08:35]: Yeah, it was. Tell the story. There's a bunch of things I learned and a bunch of things I didn't learn. The idea behind Impira originally was I saw when AlexNet came out that you were suddenly able to do things with data that you could never do before. And I think I was way too early into this observation. When I started Impira, the idea was what if we make using unstructured data as easy as it is to use structured data? And maybe ML models are the glue that enables that. And I think deep learning presented the opportunity to do that because you could just kind of throw data at the problem. Now in practice, it turns out that pre-LLMs, I think the models were not powerful enough. And more importantly, people didn't have the ability to capture enough data to make them work well enough for a lot of use cases. So it was tough. However, that was the original idea. And I think some of the things I learned were how to work with really great companies. We worked with a number of top financial services companies. We worked with public enterprises. And there's a lot of nuance and sophistication that goes into making that successful. I'll tell you the things I didn't learn though, which I learned the hard way. So one of them is when I was the VP of engineering, I would go into sales meetings and the customer would be super excited to talk to me. And I was like, oh my god, I must be the best salesperson ever. And after I finished the meeting, the sales people would just be like, yeah, okay, you know what, it looks like the technical POC succeeded and we're going to deal with some stuff. It might take some time, but they'll probably be a customer. And then I didn't do anything. And a few weeks later or a few months later, they were a customer.

swyx [00:10:09]: Money shows up. Exactly. And like,

Ankur Goyal [00:10:11]: oh my god, I must have the Midas touch, right? I go into the meeting. I've been that guy. I sort of speak a little bit and they become a customer. I had no idea how hard it was to get people to take meetings with you in the first place. And then once you actually sort of figure that out, the actual mechanics of closing customers at scale, dealing with revenue retention, all this other stuff, it's so freaking hard. I learned a lot about that. I thought it was just an invaluable experience at Empira to sort of experience

swyx [00:10:41]: that myself firsthand. Did you have a main salesperson or a sales advisor?

Ankur Goyal [00:10:45]: Yes, a few different things. One, I lucked into, it turns out, my wife, Alana, who I started dating right as I was starting Empira. Her father, who is just super close now, is a seasoned, very, very seasoned and successful sales leader. So he's currently the president of CloudFlare. At the time, he was the president of Palo Alto Networks, and he joined just right before the IPO and was managing a few billion dollars of revenue at the time. And so I would say I learned a lot from him. I also hired someone named Jason, who I worked with at MemSQL, and he's just an exceptional account executive. So he closed probably like 90 or 95% of our business over our years at Empira. And he's just exceptionally good. I think one of the really fun lessons, we were trying to close a deal with Stitch Fix at Empira early on. It was right around my birthday, and so I was hanging out with my father-in-law and talking to him about it. And he was like, look, you're super smart. Empira sounds really exciting. Everything you're talking about, a mediocre account executive can just do and do much better than what you're saying. If you're dealing with these kinds of problems, you should just find someone who can do this a lot better than you can. And that was one of those, again, very humbling things that you sort of...

swyx [00:11:57]: Like he's telling you to delegate? I think in this case, he's actually saying,

Ankur Goyal [00:12:01]: yeah, you're making a bunch of rookie errors in trying to close a contract that any mediocre or better salesperson will be able to do for you or in partnership with you. That was really interesting to learn. But the biggest thing that I learned, which was, I'd say, very humbling, is that at MemSQL, I worked with customers that were very technical. And I always got along with the customers. I always found myself motivated when they complained about something to solve the problems. And then most importantly, when they complained about something, I could relate to it personally. At Empira, I took kind of the popular advice, which is that developers are in a terrible market. So we sold to line of business. And there are a number of benefits to that. We were able to sell six- or seven-figure deals much more easily than we could at SingleStore or now we can at Braintrust. However, I learned firsthand that if you don't have a very deep, intuitive understanding of your customer, everything becomes harder. You need to throw product managers at the problem. Your own ability to see your customers is much weaker. And depending on who you are, it might actually be very difficult. And for me, it was so difficult that I think it made it challenging for us to one, stay focused on a particular segment, and then two, out-compete or do better than people that maybe had inferior technology that we did, but really deeply understood what the customer needed. I would say if you just asked me what was the main humbling lesson that I faced

swyx [00:13:33]: with it, it was that. I have a question on this market because I think after Impera, there's a cohort of new Imperas coming out. Datalab, I don't

Ankur Goyal [00:13:41]: know if you saw that. I get a phone call about one every week.

swyx [00:13:45]: What have you learned about this unstructured data to structured data market? Everyone thinks now you can just throw an LLM at it. Obviously, it's going to be better than what you had.

Ankur Goyal [00:13:53]: I think the fundamental challenge is not a technology problem. It is the fact that if you're a business, let's say you're the CEO of a company that is in the insurance space and you have a number of inefficient processes that would benefit from unstructured to structured data. You have the opportunity to create a new consumer user experience that totally circumvents the unstructured data and is a much better user experience for the end customer. Maybe it's an iPhone app that does the insurance underwriting survey by having a phone conversation with the user and filling out the form or something instead. The second option potentially unlocked a totally new segment of users and maybe cost you like 10 times as much money. The first segment is this pain. It affects your cogs. It's annoying. There's a solution that works which is throwing people at the problem but it could be a lot better. Which one are you going to prioritize? I think as a technologist, maybe this is the third lesson, you tend to think that if a problem is technically solvable and you can justify the ROI or whatever, then it's worth solving. You also tend to not think about how things are outside of your control. If you empathize with a CEO or a CTO who's sort of considering these two projects, I can tell you straight up, they're going to pick the second project. They're going to prioritize the future. They don't want the unstructured data to exist in the first place. That is the hardest part. It is very hard to motivate an organization to prioritize the problem. You're always going to be a second or third tier priority. There's revenue in that because it does affect people's day-to-day lives. There are some people who care enough to try to solve it. I would say this in very stark contrast to Braintrust where if you look at the logos on our website, almost all of the CEOs or CTOs or founders are daily active users of the product themselves. Every company that has a software product is trying to incorporate AI in a meaningful way. It's so meaningful that literally the exec team is

swyx [00:16:03]: using the product every day. Just to not bury the lead, the logos are Instacart, Stripe, Zapier, Airtable, Notion, Replit, Brex, Versa, Alcota, and the browser company of New York. I don't want to jump the gun to Braintrust. I don't think you've actually told the Impira acquisition story publicly that I can tell. It's on the surface. I think I first met you slightly before the acquisition. I was like, what the hell is Figma acquiring this kind of company? You're not a design tool. Any details you can

Ankur Goyal [00:16:33]: share? I would say the super candid thing that we realized, just for timing context, I probably personally realized this during the summer of 2022 and then the acquisition happened in December of 2022. Just for temporal context, NTT came out in November of 2022. At Impira, I think our primary technical advantage was the fact that if you were extracting data from PDF documents, which ended up being the flavor of unstructured data that we focused on, back then you had to assemble thousands of examples of a particular type of document to get a deep neural network to learn how to extract data from it accurately. We had figured out how to make that really small, maybe two or three examples through a variety of old-school ML techniques and maybe some fancy deep learning stuff. But we had this really cool technology that we were proud of. It was actually primarily computer vision-based because at that time, computer vision was a more mature field. If you think of a document as one-part visual signals and one-part text signals, the visual signals were more readily available to extract information from. What happened is text starting with BERT and then accelerating through and including chat GPT just totally cannibalized that. I remember I was in New York and I was playing with BERT on HuggingFace, which had made it really easy at that point to actually do that. They had this little square in the right-hand panel of a model. I just started copy-pasting documents into a question-answering fine-tune using BERT and seeing whether it could extract the invoice number and this other stuff. I was somewhat mind-boggled by how often it would get it right.

swyx [00:18:25]: That was really scary. Hang on, this is a vision-based BERT? Nope. So this was raw PDF

Ankur Goyal [00:18:31]: parsing? Yep. No, no PDF parsing.

swyx [00:18:33]: Just taking the PDF, command-A,

Ankur Goyal [00:18:35]: copy-paste. So there's no visual signal. By the way, I know we don't want to talk about brain trust yet, but this is also how these technologies were formed because I had a lot of trouble convincing our team that this was real. Part of that naturally, not to anyone's fault, is just the pride that you have in what you've done so far. There's no way something that's not trained or whatever for our use case is going to be as good, which is in many ways true. But part of it is just I had no simple way of proving that it was going to be better. There's no tooling. I could just run something and show I remember on the flight, before the flight, I downloaded the weights and then on the flight when I didn't have internet, I was playing around with a bunch of documents and anecdotally it was like, oh my god, this is amazing. And then that summer we went deep into Layout LM, Microsoft. I personally got super into Hugging Face and I think for two or three months was the top non-employee contributor to Hugging Face, which was a lot of fun. We created the document QA model type and a bunch of stuff. And then we fine-tuned a bunch of stuff and contributed it as well. I love that team. Clem is now an investor in Braintrust, so it started forming that relationship. And I realized, and again, this is all pre-Chat GPT, I realized like, oh my god, this stuff is clearly going to cannibalize all the stuff that we've built. And we quickly retooled Impira’s product to use Layout LM as kind of the base model and in almost all cases we didn't have to use our new but somewhat more complex technology to extract stuff. And then I started playing with GPT-3 and that just totally blew my mind. Again, Layout LM is visual, right? So almost the same exact exercise, like I took the PDF contents, pasted it into Chat GPT, no visual structure, and it just destroyed Layout LM. And I was like, oh my god, what is stable here? And I even remember going through the psychological justification of like, oh, but GPT-3 is expensive and blah, blah, blah, blah, blah.

swyx [00:20:37]: So nobody would call it in quantity, right?

Ankur Goyal [00:20:41]: Yeah, exactly. But as I was doing that, because I had literally just gone through that, I was able to kind of zoom out

swyx [00:20:47]: and be like, you're an idiot.

Ankur Goyal [00:20:49]: And so I realized, wow, okay, this stuff is going to change very, very dramatically. And I looked at our commercial traction, I looked at our exhaustion level, I looked at the team and I thought a lot about what would be best and I thought about all the stuff I'd been talking about, like how much did I personally enjoy working on this problem? Is this the problem that I want to raise more capital and work on with a high degree of integrity for the next 5, 10, 15 years? And I realized the answer was no. And so we started pursuing, we had some inbound interest already, given now Chat GPT, this stuff was starting to pick up. I guess Chat GPT still hadn't come out, but GPT-3 was gaining and there weren't that many AI teams or ML teams at the time. So we also started to get some inbound and I kind of realized like, okay, this is probably a better path. And so we talked to a bunch of companies and ran a process. Ilad was insanely

swyx [00:21:47]: helpful.

Ankur Goyal [00:21:49]: He was an investor in Empira. Yeah, I met him at a pizza shop in 2016 or 2017 and then we went on one of those famous really long walks the next day. We started near Salesforce Tower and we ended in Noe Valley. And Ilad walks at the speed of light. I think it was like 30 or 40, it was crazy. And then he invested. And then I guess we'll talk more about him in a little bit. I was talking to him on the phone pretty much every day through that process. And Figma had a number of positive qualities to it. One is that there was a sense of stability because of the acquisition, Figma's another is the problem... By Adobe?

swyx [00:22:31]: Yeah. Oh, oops.

Ankur Goyal [00:22:33]: The problem domain was not exactly the same as what we were solving, but was actually quite similar in that it is a combination of textual language signal, but it's multimodal. So our team was pretty excited about that problem and had some experience. And then we met the whole team and we just thought these people are great. And that's true, they're great people. And so we felt really excited about working there.

swyx [00:22:57]: But is there a question of, because the company was shut down effectively after, you're basically letting down your customers? Yeah. How does that... I mean, obviously you don't have to cover this, so we can cut this out if it's too comfortable. But I think that's a question that people have when they go through acquisition offers.

Ankur Goyal [00:23:15]: Yeah, yeah. No, I mean, it was hard. It was really hard. I would say that there's two scenarios. There's one where it doesn't seem hard for a founder, and I think in those scenarios, it ends up being much harder for everyone else. And then in the other scenario, it is devastating for the founder. In that scenario, I think it works out to be less devastating for everyone else. And I can tell you, it was extremely devastating. I was very, very sad for

swyx [00:23:45]: three, four months. To be acquired, but also to be shutting down.

Ankur Goyal [00:23:49]: Yeah, I mean, just winding a lot of things down. Winding a lot of things down. I think our customers were very understanding, and we worked with them. To be honest, if we had more traction than we did, then it would have been harder. But there were a lot of document processing solutions. The space is very competitive. And so I think I'm hoping, although I'm not 100% sure about this, but I'm hoping we didn't leave anyone totally out to pasture. And we did very, very generous refunds and worked quite closely with people and wrote code

swyx [00:24:23]: to help them where we could.

Ankur Goyal [00:24:25]: But it's not easy. It's one of those things where I think as an entrepreneur, you sometimes resist making what is clearly the right decision because it feels very uncomfortable. And you have to accept that it's your job to make the right decision. And I would say for me, this is one of N formative experiences where viscerally the gap between what feels like the right decision and what is clearly the right decision, and you have to embrace what is clearly the right decision, and then map back and fix the feelings along the way. And this was definitely one of those cases.

swyx [00:25:03]: Thank you for sharing that. That's something that not many people get to hear. And I'm sure a lot of people are going through that right now, bringing up Clem. He mentions very publicly that he gets so many inbounds acquisition offers. I don't know what you call it. Please buy me offers. And I think people are kind of doing that math in this AI winter that we're somewhat going through. Maybe we'll spend a little bit on Figma. Figma AI. I've watched closely the past two configs. A lot going on. You were only there for eight months. What would you say is interesting going on at Figma, at least from the time that you were there and whatever you see now as an outsider?

Ankur Goyal [00:25:41]: Last year was an interesting time for Figma. One, Figma was going through an acquisition. Two, Figma was trying to think about what is Figma beyond being a design tool. And three, Figma is kind of like Apple, a company that is really optimized around a periodic, annual release cycle rather than something that's continuous. If you look at some of the really early AI adopters, like Notion for example, Notion is shipping stuff constantly. It's a new thing.

swyx [00:26:13]: We were consulted on that. Because Ivan liked World's Fair.

Ankur Goyal [00:26:17]: I'll be there if anyone is there. Hit me up. Very iterative company. Ivan and Simon and a couple others hacked the first versions of Notion AI

swyx [00:26:27]: at a retreat.

Ankur Goyal [00:26:29]: In a hotel room. I think with those three pieces of context in mind, it's a little bit challenging for Figma. Very high product bar. Of the software products that are out there right now, one of, if not the best, just quality product. It's not janky, you sort of rely on it to work type of products. It's quite hard to introduce AI into that. And then the other thing I would just add to that is that visual AI is very new and it's very amorphous. Vectors are very difficult because they're a data inefficient representation. The vector format in something like Figma chews up many, many, many, many, many more tokens than HTML and JSX. So it's a very difficult medium to just sort of throw into an LLM compared to writing problems or coding problems. And so it's not trivial for Figma to release like, oh, this company has blah-blah AI and Acme AI and whatever. It's not super trivial for Figma to do that. I think for me personally, I really enjoyed everyone that I worked with and everyone that I met. I am a creature of shipping. I wake up every morning nowadays to several complaints or questions from people and I just like pounding through stuff and shipping stuff and making people happy and iterating with them. And it was just literally challenging for me to do that in that environment. That's why it ended up not being the best fit for me personally. But I think it's going to be interesting what they do. Within the framework that they're designed to, as a company, to ship stuff when they do sort of make that big leap, I think it could be very compelling.

swyx [00:28:11]: I think there's a lot of value in being the chosen tool for an industry because then you just get a lot of community patience for figuring stuff out. The unique problem that Figma has is it caters to designers who hate AI right now. When you mention AI, they're like, oh, I'm going to...

Ankur Goyal [00:28:27]: The thing is, in my limited experience and working with designers myself, I think designers do not want AI to design things for them. But there's a lot of things that aren't in the traditional designer toolkit that AI can solve. And I think the biggest one is generating code. So in my mind, there's this very interesting convergence happening between UI engineering and design. And I think Figma can play an incredibly important part in that transformation which, rather than being threatening, is empowering to designers and probably helps designers contribute and collaborate with engineers more effectively, which is a little bit different than the focus around actually designing things in the editor.

swyx [00:29:09]: Yeah, I think everyone's keen on that. Dev mode was, I think, the first segue into that. So we're going to go into Braintrust now, about 20-something minutes into the podcast. So what was your idea for Braintrust? Tell the full origin story.

Ankur Goyal [00:29:23]: At Impira, while we were having an existential revelation, if you will, we realized that the debates we were having about what model and this and that were really hard to actually prove anything with. So we argued for like two or three months and then prototyped an eval system on top of Snowflake and some scripts, and then shipped the new model like two weeks later. And it wasn't perfect. There were a bunch of things that were less good than what we had before, but in aggregate, it was just way better. And that was a holy s**t moment for me. I kind of realized there's this, sometimes in engineering organizations or maybe organizations more generally, there are what feel like irrational bottlenecks. It's like, why are we doing this? Why are we talking about this? Whatever. This was one of those obvious irrational bottlenecks.

swyx [00:30:13]: Can you articulate the bottleneck again? Was it simply

Ankur Goyal [00:30:17]: evals? Yeah, the bottleneck is there's approach A, and it has these trade-offs. And approach B has these other trade-offs. Which approach should we use? And if people don't very clearly align on one of the two approaches, then you end up going in circles. This approach, hey, check out this example. It's better at this example. Or, I was able to achieve it with this document, but it doesn't work with all of our customer cases, right? And so you end up going in circles. If you introduce evals into the mix, then you sort of change the discussion from being hypothetical or one example and another example into being something that's extremely straightforward and almost scientific. Like, okay, great. Let's get an initial estimate of how good LayoutLM is compared to our hand-built computer vision model. Oh, it looks like there are these 10 cases, invoices that we've never been able to process that now we can suddenly process, but we regress ourselves on these three. Let's think about how to engineer a solution to actually improve these three, and then measure it for you. And so it gives you a framework to have that. And I think, aside from the fact that it literally lets you run the sort of scientific process of improving an AI application, organizationally, it gives you a clear set of tools, I think, to get people to agree. And I think in the absence of evals, what I saw at Empira, and I see with almost all of our customers before they start using BrainTrust, is this kind of stalemate between people on which prompt to use or which model to use or which technique to use, that once you embrace engineering around evals, it just

swyx [00:31:51]: goes away. Yeah. We just did an episode with Hamil Hussain here, and the cynic in that statement would be like, this is not new. All ML engineering deploying models to production always involves evals. Yeah. You discovered it, and you built your own solution, but everyone in the industry has their own solution. Why the conviction that there's a company here?

Ankur Goyal [00:32:13]: I think the fundamental thing is, prior to BERT, I was, as a traditional software engineer, incapable of participating in what happens behind the scenes in ML development. Ignore the CEO or founder title, just imagine I'm a software engineer who's very empathetic about the product. All of my information about what's going to work and what's not going to work is communicated through the black box of interpretation by ML people. So I'm told that this thing is better than that thing, or it'll take us three months to improve this other thing. What is incredibly empowering about these, I would just maybe say that the quality that transformers bring to the table, and even BERT does this, but GPT 3 and then 4 very emphatically do it, is that software engineers can now participate in this discussion. But all the tools that ML people have built over the years to help them navigate evals and data generally are very hard to use for software engineers. I remember when I was first acclimating to this problem, I had to learn how to use Hugging Face and Weights and Biases. And my friend Yanda was at Weights and Biases at the time, and I was talking to him about this, and he was like, yeah, well, prior to Weights and Biases, all data scientists had was software engineering tools, and it felt really uncomfortable to them. And Weights and Biases brought software engineering to them. And then I think the opposite happened. For software engineers, it's just really hard to use these tools. And so I was having this really difficult time wrapping my head around what seemingly simple stuff is. And last summer, I was talking to a lot about this, and I think primarily just venting about it. And he was like, well, you're not the only software engineer who's starting to work on AI now. And that is when we realized that the real gap is that software engineers who have a particular way of thinking, a particular set of biases, a particular type of workflow that they run, are going to be the ones who are doing AI engineering, and that the tools that were built for ML are fantastic in terms of the scientific inspiration, the metrics they track, the level of quality that they inspire, but they're just not usable for software engineers. And that's really where the opportunity is.

swyx [00:34:35]: I was talking with Sarah Guo at the same time, and that led to the rise of the AI engineer and everything that I've done. So very much similar philosophy there. I think it's just interesting that software engineering and ML engineering should not be that different. It's still engineering at the same... You're still making computers boop. Why?

Ankur Goyal [00:34:53]: Well, I mean, there's a bunch of dualities to this. There's the world of continuous mathematics and discrete mathematics. I think ML people think like continuous mathematicians and software engineers, like myself, who are obsessed with algebra. We like to think in terms of discrete math. What I often talk to people about is, I feel like there are people for whom NumPy is incredibly intuitive, and there are people for whom it is incredibly non-intuitive. For me, it is incredibly non-intuitive. I was actually talking to Hamel the other day. He was talking about how there's an eval tool that he likes, and I should check it out. And I was like, this thing, what? Are you freaking kidding me? It's terrible. Yeah, but it has data frames. I was like, yes, exactly. You don't like data frames? I don't like data frames. It's super hard for me to think about manipulating data frames and extracting a column or a row out of data frames. And by the way, this is someone who's worked on databases for more than a decade. It's just very, very programmer-wise. It's very non-ergonomic for me to manipulate a data frame.

swyx [00:35:55]: And what's your preference then?

Ankur Goyal [00:35:57]: For loops.

swyx [00:35:59]: Okay. Well, maybe you should capture a statement of what is BrainTrust today because there's a little bit of the origin story. And you've had a journey over the past year, and obviously now with Series A, which will, like, woohoo, congrats. Put a little intro for the Series A stuff. What is BrainTrust today?

Ankur Goyal [00:36:15]: BrainTrust is an end-to-end developer platform for building AI products. And I would say our core belief is that if you embrace evaluation as the sort of core workflow in AI engineering, meaning every time you make a change, you evaluate it, and you use that to drive the next set of changes that you make, then you're able to build much, much better AI software. That's kind of our core thesis. And we started probably as no surprise by building, I would say, by far the world's best evaluation product, especially for software engineers and now for product managers and others. I think there's a lot of data scientists now who like BrainTrust, but I would say early on, a lot of, like, ML and data science people hated BrainTrust. It felt, like, really weird to them. Things have changed a little bit, but really, like, making evals something that software engineers, product managers can immediately do, I think that's where we started. And now people have pulled us into doing more. So the first thing that people said is, like, okay, great, I can do evals. How do I get the data to do evals? And so what we realized, anyone who's spent some time in evals knows that one of the biggest pain points is ETLing data from your logs into a dataset format that you can use to do evals. And so what we realized is, okay, great, when you're doing evals, you have to instrument your code to capture information about what's happening and then render the eval. What if we just capture that information while you're actually running your application? There's a few benefits to that. One, it's in the same familiar trace and span format that you use for evals. But the other thing is that you've almost accidentally solved the ETL problem. And so if you structure your code so that the same function abstraction that you define to evaluate on equals the abstraction that you actually use to run your application, then when you log your application itself, you actually log it in exactly the right format to do evals. And that turned out to be a killer feature in Braintrust. You can just turn on logging, and now you have an instant flywheel of data that you can collect in datasets and use for evals. And what's cool is that customers, they might start using us for evals, and then they just reuse all the work that they did, and they flip a switch, and boom, they have logs. Or they start using us for logging, and then they flip a switch, and boom, they have data that they can use and the code already written to do evals. The other thing that we realized is that Braintrust went from being kind of a dashboard into being more of a debugger, and now it's turning into kind of an IDE. And by that I mean, at first you ran an eval, and you'd look at our web UI and sort of see a chart or something that tells you how your eval did. But then you wanted to interrogate that and say, okay, great, 8% better. Is that 8% better on everything, or is that 15% better and 7% worse? And where it's 7% worse, what are the cases that regressed? How do I look at the individual cases? They might be worse on this metric. Are they better on that metric? What are the cases that differ? Let me dig in detail. And that sort of turned us into a debugger. And then people said, okay, great, now I want to take action on that. I want to save the prompt or change the model and then click a button and try it again. And that's kind of pulled us into building this very, very souped-up playground. And we started by calling it the Playground, and it started as my wish list of things that annoyed me about the OpenAI Playground. First and foremost, it's durable. So every time you type something, it just immediately saves it. If you lose the browser or whatever, it's all saved. You can share it, and it's collaborative, kind of like Google Docs, Notion, Figma, etc. And so you can work on it with colleagues in real time, and that's a lot of fun. It lets you compare multiple prompts and models side-by-side with data. And now you can actually run evals in the Playground. You can save the prompts that you create in the Playground and deploy them into your code base. And so it's become very, very advanced. And I remember actually, we had an intro call with Brex last year, who's now a customer. And one of the engineers on the call said, he saw the Playground and he said, I want this to be my IDE. It's not there yet. Here's a list of 20 complaints, but I want this to be my IDE. I remember when he told me that, I had this very strong reaction, like, what the F? We're building an eval observability thing, we're not building an IDE. But I think he turned out to be right, and that's a lot of what we've done over the past few months and what we're looking to in the future.

swyx [00:40:42]: How literally can you take it? Can you fork VS Code? It's not off the table.

Ankur Goyal [00:40:48]: We're friends with the cursor people and now part of the same portfolio. And sometimes people say, AI and engineering, are you cursor? Are you competitive? And what I think is like, cursor is taking AI and making traditional software engineering insanely good with AI. And we're taking some of the best things about traditional software engineering and bringing them to building AI software. And so, we're almost like yin and yang in some ways with development. But forking VS Code and doing crazy stuff is not off the table. It's all ideas that we're cooking at this point.

swyx [00:41:27]: Interesting. I think that when people say analogies, they should often take it to the extreme and see what that generates in terms of ideas. And when people say IDE, literally go there. Because I think a lot of people treat their playground and they say figuratively IDE, they don't mean it. And they should. They should mean it.

Ankur Goyal [00:41:45]: So, we've had this playground in the product for a while. And the TLDR of it is that it lets you test prompts. They could be prompts that you save in Braintrust or prompts that you just type on the fly against a bunch of different models or your own fine-tuned models. And you can hook them into the datasets that you create in Braintrust to do your evals. So, I've just pulled this press-release dataset. And this is actually one of the first features we built. It's really easy to run stuff. And by the way, we're trying to see if we can build a prompt that summarizes the document well. But what's kind of happened over time is that people have pulled us to make this prompt playground more and more powerful. So, I kind of like to think of Braintrust as two ends of the spectrum. If you're writing code, you can create evals with infinite complexity. You don't even have to use large language models. You can use any models you want. You can write any scoring functions you want. And you can do that in the most complicated code bases in the world. And then we have this playground that dramatically simplifies things. It's so easy to use that non-technical people love to use it. Technical people enjoy using it as well. And we're sort of converging these things over time. So, one of the first things people asked about is if they could run evals in the playground. And we've supported running pre-built evals for a while. But we actually just added support for creating your own evals in the playground. And I'm going to show you some cool stuff. So, we'll start by adding this summary quality thing. And if we look at the definition of it, it's just a prompt that maps to a few different choices. And each one has a score. We can try it out and make sure that it works. And then, let's run it. So, now you can run not just the model itself, but also the summary quality score and see that it's not great. So, we have some room to improve it. The next thing you can do is let's try to tweak this prompt. So, let's say like in one to two lines. And let's run it again.

swyx [00:43:49]: One thing I noticed about the... you're using an LLM as a judge here. That prompt about one to two lines should actually go into the judge input. It is. Oh, okay. Was that it? Oh, this was generated?

Ankur Goyal [00:44:07]: No, no, no. This is how...

swyx [00:44:09]: I pre-wrote this ahead of time. So, you're matching up the prompt to the eval that you already knew.

Ankur Goyal [00:44:15]: Exactly. So, the idea is like it's useful to write the eval before you actually tweak the prompt so that you can measure the impact of the tweak. So, you can see that the impact is pretty clear, right? It goes from 54% to 100% now. This is a little bit of a toy example, but you kind of get the point. Now, here's an interesting case. If you look at this one, there's something that's obviously wrong with this. What is wrong with this new summary?

swyx [00:44:41]: It has an intro.

Ankur Goyal [00:44:43]: Yeah, exactly. So, let's actually add another evaluator. And this one is Python code. It's not a prompt. And it's very simple. It's just checking if the word sentence is here. And this is a really unique thing. As far as I know, we're the only product that does this. But this Python code is running in a sandbox. It's totally dynamic. So, for example, if we change this, it'll put the Boolean. Obviously, we don't want to save that. We can also try running it here. And so, it's really easy for you to... It's really easy for you to actually go and tweak stuff and play with it and create more interesting scores. So, let's save this. And then we'll run with this one as well. Awesome. And then let's try again. So, now let's say, just include summary. Anything else?

Ankur Goyal [00:45:47]: Amazing. So, the last thing I'll show you, and this is a little bit of an allude to what's next, is that the Playground experience is really powerful for doing this interactive editing. But we're already running at the limits of how much information we can see about the scores themselves and how much information is fitting here. And we actually have a great user experience that, until recently, you could only access by writing an eval in your code. But now you can actually go in here and kick off full brain trust experiments from the Playground. So, in addition to this, we'll actually add one more. We'll add the embedding similarity score. And we'll say, original summarizer,

swyx [00:46:31]: short

Ankur Goyal [00:46:33]: summary, and no sentence

swyx [00:46:37]: wording.

Ankur Goyal [00:46:39]: And then to create... And this is actually going to kick off full experiments.

swyx [00:46:43]: So,

Ankur Goyal [00:46:45]: if we go into one of these things,

Ankur Goyal [00:46:51]: now we're in the full brain trust UI. And one of the really cool things is that you can actually now not just compare one experiment, but compare multiple experiments. And so you can actually look at all of these experiments together and understand, like, okay, good. I did this thing which said, like, please keep it to one to two sentences. Looks like it improved the summary quality and sentence checker, of course, but it looks like it actually also did better on the similarity score, which is my main score to track how well the summary compares to a reference summary. And you can go in here and then very granularly look at the diff between two different versions of the summary and do this whole experience. So, this is something that we actually just shipped a couple weeks ago. And it's already really powerful. But what I wanted to show you is kind of what, like, even the next version or next iteration of this is. And by the time the podcast airs, what I'm about to show you will be live. So, we're almost done shipping it. But before I do that, any questions on this stuff? No, this is

swyx [00:47:53]: a really good demo. Okay, cool. So,

Ankur Goyal [00:47:55]: as soon as we showed people this kind of stuff, they said, well, you know, this is great, and I wish I could do everything with this experience, right? Like, imagine you could, like, create an agent or do rag, like, more interesting stuff with this kind of interactivity. And so, we were like, huh, it looks like we built support for you to do, you know, to run code. And it looks like we know how to actually run your prompts. I wonder if we can do something more interesting. So, we just added support for you to actually define your own tools. I'll sort of shell two different tool options for you. So, one is Browserbase, and the other is Exa. I think these are both really cool companies. And here, we're just writing, like, really simple TypeScript code that wraps the Browserbase API and then, similarly, really simple TypeScript code that wraps the Exa API. And then we give it a type definition. This will get used as a, um, as the schema for a tool call. And then we give it a little bit of metadata so Braintrust knows, you know, where to store it and what to name it and stuff. And then you just run a really simple command, npx braintrust push, and then you give it these files, and it will bundle up all the dependencies and push it into Braintrust. And now you can actually access these things from Braintrust. So, if we go to the search tool, we could say, you know, what is the tallest mountain...

swyx [00:49:19]: Oops. ... ... ...

Ankur Goyal [00:49:27]: And it'll actually run search by Exa. So, what I'm very excited to show you is that now you can actually do this stuff in the Playground, too. So, if we go to the Playground, um, let's try playing with this. So, uh, we'll create a new session.

swyx [00:49:45]: ... ... ... ...

Ankur Goyal [00:49:53]: And let's create a dataset.

swyx [00:49:57]: ... ...

Ankur Goyal [00:50:01]: Let's put one row in here, and we'll say,

swyx [00:50:03]: um,

Ankur Goyal [00:50:05]: what is the premier conference for AI engineers?

swyx [00:50:11]: Ooh, I wonder what we'll find.

Ankur Goyal [00:50:15]: Um, following question, feel free to search the internet. Okay, so, let's plug this in, and let's start without using any tools.

swyx [00:50:27]: ... ...

Ankur Goyal [00:50:31]: Uh, I'm not sure I agree with this statement.

swyx [00:50:33]: That was correct as of his training data. ...

Ankur Goyal [00:50:37]: Okay, so, let's add this Exa tool in, and let's try running it again. Watch closely over here. So, you see it's actually running.

swyx [00:50:45]: Yeah. There we go. ... Not exactly accurate, but good enough. Yeah, yeah.

Ankur Goyal [00:50:55]: So, I think that this is really cool, because for probably 80 or 90% of the use cases that we see with people doing this, like, very, very simple, I create a prompt, it calls some tools, I can, like, very ergonomically write the tools, plug into popular services, et cetera, and then just call them, kind of like, assistance API-style stuff. It covers so many use cases, and it's honestly so hard to do. Like, if you try to do this by yourself, you have to write a for loop,

swyx [00:51:25]: you have to

Ankur Goyal [00:51:27]: host it somewhere. You know, with this thing, you can actually just access it through our REST API, so every prompt gets a REST API endpoint that you can invoke. And so, we're very, very excited about this, and I think it kind of represents the future of AI engineering, one where you can spend a lot of time writing English, and sort of crafting the use case itself. You can reuse tools across different use cases, and then, most importantly, the development process is very nicely and kind of tightly integrated with evaluation, and so you have the ability to create your own scores and sort of do all of this very interactively as you actually build stuff.

swyx [00:52:05]: I thought about a business in this area, and I'll tell you why I didn't do it. And I think that might be generative for insights onto this industry that you would have that I don't. When I interviewed for Anthropic, they gave me Cloud and Sheets, and with Cloud and Sheets, I was able to build my own evals. Because I can use Sheets formulas, I can use LLM, I can use Cloud to evaluate Cloud, whatever. And I was like, okay, there will be AI spreadsheets, there will all be plugins, spreadsheets is like the universal business tool of whatever. You can API spreadsheets. I'm sure Airtable, you know, Howie's an investor in you now, but I'm sure Airtable has some kind of LLM integration. The second thing was that HumanLoop also existed, HumanLoop being like one of the very, very first movers in this field where same thing, durable playground, you can share them, you can save the prompts and call them as APIs. You can also do evals and all the other stuff. So there's a lot of tooling, and I think you saw something or you just had the self-belief where I didn't, or you saw something that was missing still, even in that space from DIY no-code Google Sheets to custom tool, they were first movers.

Ankur Goyal [00:53:11]: Yeah, I mean, I think evals, it's not hard to do an initial eval script. Not to be too cheeky about it, I would say almost all of the products in the space are spreadsheet plus plus. Like, here's a script generates an eval, I look at the cells, whatever, side by side

swyx [00:53:33]: and compare it. The main thing I was impressed by was that you can run all these things in parallel so quickly. Yeah, exactly.

Ankur Goyal [00:53:41]: So I had built spreadsheet plus plus a few times. And there were a couple nuggets that I realized early on. One is that it's very important to have a history of the evals that you've run and make it easy to share them and publish in Slack channel, stuff like that, because that becomes a reference point for you to have discussions among a team. So at Impira, when we were first ironing out our layout LM usage, we would publish screenshots of the evals in a Slack channel and go back to those screenshots and riff on ideas from a week ago that maybe we abandoned. And having the history is just really important for collaboration. And then the other thing is that writing for loops is quite hard. Like, writing the right for loop that parallelizes things is durable, someone doesn't screw up the next time they write it, you know, all this other stuff. It sounds really simple, but it's actually not. And we sort of pioneered this syntax where instead of writing a for loop to do an eval, you just create something called eval, and you give it an argument which has some data, then you give it a task function, which is some function that takes some input and returns some output. Presumably it calls an LLM, nowadays it might be an agent, you know, it does whatever you want, and then one or more scoring functions. And then Braintrust basically takes that specification of an eval and then runs it as efficiently and seamlessly as possible. And there's a number of benefits to that. The first is that we can make things really fast, and I think speed is a superpower. Early on we did stuff like cache things really well, parallelize things, async Python is really hard to use, so we made it easy to use. We made exactly the same interface in TypeScript and Python, so teams that were sort of navigating the two realities could easily move back and forth between them. And now what's become possible, because this data structure is totally declarative, an eval is actually not just a code construct, but it's actually a piece of data. So when you run an eval in Braintrust now, you can actually optionally bundle the eval and then send it. And as you saw in the demo, you can run code functions and stuff. Well, you can actually do that with the evals that you write in your code. So all the scoring functions become functions in Braintrust. The task function becomes something you can actually interactively play with and debug in the UI. So turning it into this data structure actually makes it a much more powerful thing. And by the way, you can run an eval in your code base, save it to Braintrust, and then hit it with an API and just try out a new model, for example. That's more recent stuff nowadays, but early on just having the very simple declarative data structure that was just much easier to write than a for loop that you sort of had to cobble together yourself, and making it really fast, and then having a UI that just very quickly gives you the number of improvements or regressions and filter them, that was kind of the key thing that worked. I give a lot of credit to Brian from Zapier, who was our first user, and super harsh. I mean, he told me straight up, I know this is a problem, you seem smart, but I'm not convinced of the solution. And almost like Mr. Miyagi or something, I'd produce a demo and then he'd send me back and be like, eh, it's not good enough for me to show the team. And so we sort of iterated several times until he was pretty excited by the developer experience. That core developer experience was just more helpful enough and comforting enough for people that were new to evals that they were willing to try it out. And then we were just very aggressive about iterating with them. So people said, you know, I ran this eval, I'd like to be able to rerun the prompt. So we made that possible. Or I ran this eval, it's really hard for me to group by model and actually see which model did better and why. I ran these evals, one thing is slower than the other. How do I correlate that with token counts? That's actually really hard to do. It's annoying because you're often doing LLM as a judge and generating tokens by doing that too. And so you need to instrument the code to distinguish the tokens that are used for scoring from the tokens that are used for actually computing the thing. Now we're way out of the realm of what you can do with clod and sheets, right? In our case at least, once we got some very sophisticated early adopters of AI using the product, it was a no-brainer to just keep making the product better and better and better. I could just see that from the first week that people were using the product,

swyx [00:58:11]: that there was just a ton of depth here. There is a ton of depth. Sometimes it's not even just that the ideas are not worth anything. It's almost just the persistence and execution that I think you do very well. So whatever, kudos. We're about to zoom out a little bit to industry observations, but I want to spend time on Braintrust. Any other area of Braintrust or part of the Braintrust story that you think people should appreciate or which is personally insightful to you that you want to

Ankur Goyal [00:58:37]: discuss it? There's probably two things I would point to. The first thing, actually there's one silly thing and then two maybe less silly things. So when we started, there were a bunch of things that people thought were stupid about Braintrust. One of them was this hybrid on-prem model that we have. And it's funny because Databricks has a really famous hybrid on-prem model and the CEO and others sort of have a mixed perspective on it. And sometimes you talk to Databricks people and they're like, this is the worst thing ever. But I think Databricks is doing pretty well and it's hard to know how successful they would have been without doing that. But because of that and Snowflake was doing really well at the time, everyone thought this hybrid thing was stupid. But I was talking to customers and Zapier was our first user and then Coda and Airtable quickly followed. And there was just no chance they would be able to use the product unless the data stayed in their cloud. Maybe they could a year from when we started or whatever, but I wanted to work with them now. And so it never felt like a question to me. I remember there's so many VCs

swyx [00:59:41]: that I talked to.

Ankur Goyal [00:59:43]: Yeah, exactly. Like, oh my god, look, here's a quote from the Databricks CEO Here's a quote from this person. You're just clearly wrong. I was like, okay, great. See ya. Luckily, you know, Elad, Alanna, Sam, and now Martin were just like, that's stupid. Don't worry about that.

swyx [00:59:58]: Martin is king of not being religious in cloud stuff.

Ankur Goyal [01:00:02]: But yeah, I think that was just funny because it was something that just felt super obvious to me and everyone thought I was pretty stupid about it. And maybe I am, but I think it's helped us quite a bit.

swyx [01:00:15]: We had this issue at Temporal and the solution was like cloud VPC peering. And what I'm hearing from you is you went further than that. You're bundling up your package software and you're shipping it over and you're charging by seat.

Ankur Goyal [01:00:27]: You asked about single store and lessons from single store. I have been through the ringer with on-prem software and I've learned a lot of lessons. So we know how to do it really well. I think the tricks with brain trust are, one, that the cloud has changed a lot even since Databricks came out and there's a number of things that are easy that used to be very hard. I think serverless is probably one of the most important unlocks for us because it sort of allows us to bound failure into something that doesn't require restarting servers or restarting Linux processes. So even though it has a number of problems, it's made it much easier for us to have this model. And then the other thing is we literally engineered brain trust from day zero to have this model. If you treat it as an opportunity and then engineer a very, very good solution around it, just like DX or something, you can build a really good system, you can test it well, etc. So we viewed it as an opportunity rather than a challenge. The second thing is the space was really crowded. You and I even talked about this and it doesn't feel very crowded now. Sometimes people literally ask me if we have any

swyx [01:01:35]: competitors. We'll go into that industry stuff later.

Ankur Goyal [01:01:39]: I think what I realized then, my wife, Alana, actually told me this when we were working on Impira. She said, based on your personality, I want you to work on something next that is super competitive. And I kind of realized there's only one of two types of markets in startups. Either it's not crowded or it is crowded. Each of those things has a different set of trade-offs and I think there are founders that thrive in either environment. As someone who enjoys competition, I find it very motivating. Personally, it's better for me to work in a crowded market than it is to work in an empty market. Again, people are like, blah, blah, blah, stupid, blah, blah, blah. And I was like, actually, this is what I want to be doing. There were a few strategic bets that we made early on at Braintrust that I think helped us a lot. So one of them I mentioned is the hybrid on-prem thing. Another thing is we were the original folks who really prioritized TypeScript. Now, I would say every customer and probably north of 75% of the users that are running evals in Braintrust are using the TypeScript SDK. It's an overwhelming majority. And again, at the time, and still, AI is at least nominally dominated by Python, but product building is dominated by TypeScript. And the real opportunity to our discussion earlier is for product builders to use AI. And so, even if it's not the majority of typists using AI stuff, writing TypeScript, it worked out to be this magical niche for us that's led to a lot of, I would say, strong product market fit among product builders. And then the third thing that we did is, look, we knew that this LLM ops or whatever you want to call it space is going to be more than just evals. But again, early on, evals, I mean, there's one VC, I won't call them out. You know who you are because I assume you're going to be listening to this. But there's one VC who insisted on meeting us. And I've known them for a long time, blah, blah, blah. And they're like, you know what, actually, after thinking about it, we don't want to invest in Braintrust because it reminds me of CICD and that's a crappy market. And if you were going after logging and observability, that was your main thing, then that's a great market. But of all the things in LLM ops or whatever, if you draw a parallel to the previous world of software development, this is like CICD and CICD is not a great market. And I was like, okay, it's sort of like the hybrid on-prem thing. Go talk to a customer and you'll realize that this is the, I mean, I was at Figma when we used Datadog and we built our own prompt playground. It's not super hard to write some code that, you know, Vercel has a template that you can use to create your own prompt playground now. But evals were just really hard. And so I knew that the pain around evals was just significantly greater than anything else. And so if we built an insanely good solution around it, the other things would follow. And lo and behold, of course, that VC came back a few months later and said, oh my God, you guys are doing observability now. Now we're interested. And that was another kind of interesting thing.

swyx [01:04:47]: We're going to tie this off a little bit with some customer motivations and quotes. We already talked about the logos that you have, which are all really very impressive. I've seen what Stripe can do. I don't know if it's quotable, but you said you had something from Vercel, from Malte.

Ankur Goyal [01:05:01]: Yeah, yeah. Actually, I'll let you read it. It's on our website. I don't want to butcher

swyx [01:05:07]: his language. So Malta says, we deeply appreciate the collaboration. I've never seen a workflow transformation like the one that incorporates evals into mainstream engineering processes before. It's astonishing.

Ankur Goyal [01:05:19]: Yeah. I mean, I think that is a perfect encapsulation of

swyx [01:05:23]: our goal. For those who don't know, Malte used to work on Google Search.

Ankur Goyal [01:05:29]: He's super legit. Kind of scary, as are all of the Vercel people.

swyx [01:05:35]: My funniest quote of Malte is a recent incident of Malte. He published this very, very long guide to SEO, like how SEO works. And people are like, this is not to be trusted. This is not how it works. And literally, the guy worked on the search algorithm. Yeah.

Ankur Goyal [01:05:51]: That's really funny.

swyx [01:05:53]: People don't believe when you are representing a company. I think everyone has an angle. In Silicon Valley, it's this whole thing where if you don't have skin in the game, you're not really in the know, because why would you? You're not an insider. But then once you have skin in the game, you do have a perspective. You have a point of view. And maybe that segues into a little bit of industry talk. Sounds good. Unless you want to bring up your World's Fair, we can also riff on just what you saw at the World's Fair. You were the first speaker, and you were one of the few who brought a customer, which is something I think I want to encourage more. I think the DVT conference also does. Their conference is exclusively vendors and customers, and then sharing lessons learned and stuff like that. Maybe plug your talk a little bit and people can

Ankur Goyal [01:06:37]: go watch it. Yeah. First, Olmo is an insanely good engineer. He actually worked with Guillermo on

swyx [01:06:43]: Mutools back in the day.

Ankur Goyal [01:06:45]: This was mafia. I remember when I first met him, speaking of TypeScript, we only had a Python SDK. And he was like, where's the TypeScript SDK? And I was like, here's some curl commands you can use. This was on a Friday. And he was like, okay. And Zapier was not a customer yet, but they were interested in brain trust. And so I built the TypeScript SDK over the weekend, and then he was the first user of it. And what better than to have one of the core authors of Mutools bike-shedding SDK from the beginning. I would give him a lot of credit for how some of the ergonomics of our product have worked out. By the way, another benefit of structuring the talk this way is he actually worked out of our office earlier that week and built the talk and found a ton of bugs in the product or usability things. And it was so much fun. He sat next to me at the office. He'd find something or complain about something, and I'd point him to the engineer who works on it, and then he'd go and chat with them. And we recently had our first off-site, we were talking about some of people's favorite moments in the company, and multiple engineers were like, that was one of the best weeks to get to interact with a customer that way.

swyx [01:07:51]: You know, a lot of people have embedded engineer. This is embedded customer. Yeah.

Ankur Goyal [01:07:57]: I mean, we might do more of it. Sometimes, just like launches, sometimes these things are a forcing function for you to improve.

swyx [01:08:05]: Why did he discover it preparing for the talk and not as a user?

Ankur Goyal [01:08:09]: Because when he was preparing for the talk, he was trying to tell a narrative about how they use brain trust. And when you tell a narrative, you tend to look over a longer period of time. And at that point, although I would say we've improved a lot since, that part of our experience was very, very rough. For example, now, if you are working in our experiments page, which shows you all of your experiments over time, you can dynamically filter things, you can group things, you can create like a scatter plot, actually, which Hamel sort of helping me work out when we're working on a blog post together. But there's all this analysis you can do. At that time, it was just a line. And so he just ran into all these problems and complained. But the conference was incredible. It is the conference that gets people who are working in this field together. And I won't say which one, but there was a POC, for example, that we had been working on for a while, and it was kind of stuck. And I was the guy at the conference, and we chatted, and then a few weeks later, things worked out. There's almost nothing better I could ask for or say in a conference than it leading to commercial activity and success for a company like us. And it's just true.

swyx [01:09:23]: Yeah, it's marketing, it's sales, it's hiring. And then it's also, honestly, for me as a curator, I'm trying to get together the state of the art and make a statement on, here's where the industry is at this time. And 10 years from now, we'll be able to look back at all the videos and go like, you know, how cute, how young, how naive we were. One thing I fear is getting it wrong. And there's many, many ways for you to get it wrong. I think people give me feedback and keep

Ankur Goyal [01:09:51]: me honest. Yeah, I mean, the whole team is super receptive to feedback. But I think, honestly, just having the opportunity and space for people to organically connect with each other, that's the most important

swyx [01:10:01]: thing. And you asked for dinners and stuff. We'll do that next year. Excellent. Actually, we're doing a whole syndicated track thing. So, you know, Brain Trust Con or whatever might happen. One thing I think about when organizing, like literally when I organize a thing like that, or I do my content or whatever, I have to have a map of the world. And something I came to your office to do was this, I call this the three ring circus or the impossible triangle. And I think what ties into what that VC that rejected you did not see, which is that eventually everyone starts somewhere and they grow into each other's circles. So this is ostensibly, it started off as the sort of AI LM ops market. And then I think we agreed to call it like the AI infra map, which is ops, frameworks and databases. Databases are sort of a general thing and gateways and serving. And Brain Trust has beds and all these things, but started with evals. It's kind of like an evals framework and then obviously extended into observability, of course. And now it's doing more and more things. How do you see the market? Does that jive with your view of the world?

Ankur Goyal [01:11:09]: I think the market is very dynamic and it's interesting because almost every company cares. It is an existential question and how software is built is totally changing. And honestly, the last time I saw this happen, it felt less intense, but it was cloud. I still remember I was talking to I think it was 2012 or something. I was hanging out with one of our engineers at MemSQL or SingleStore, MemSQL at the time, and I was like, is cloud really going to be a thing? It seems like for some use cases it's economic, but for the oil company or whatever that's running all these analytics and they have this hardware and it's very predictable, is cloud actually going to be worth it? Like security? He was right, but he was like, yeah, if you assume that the benefits of elasticity and whatnot are actually there, then the cost is going to go down, the security is going to go up, all these things will get solved. But for my naive brain at that point, it was just so hard to see. I think the same thing, to a more intense degree, is happening in AI. When I talk to AI skeptics, I often rewind myself into the mental state I was in when I was somewhat of a cloud skeptic early on. But it's a very dynamic marketplace and I think there's benefit to separating these things and having best-of-breed tools do different things for you, and there's also benefits to some level of vertical integration across the stack. As a product-driven company that's navigating this, I think we are constantly thinking about how do we make bets that allow us to provide more value to customers and solve more use cases while doing so durably. We had Guillermo from Vercel, who is also an investor and a very sprightly character.

swyx [01:12:59]: I don't know.

Ankur Goyal [01:13:01]: But anyway, he gave me this really good advice, which was, as a startup, you only get to make a few technology bets and you should be really careful about those bets. Actually, at the time, I was asking him for advice about how to make arbitrary code execution work, because obviously they've solved and in JavaScript, arbitrary code execution is itself such a dynamic thing. There's so many different ways of, there's workers and Deno and Node and Firecracker, there's all this stuff. Ultimately, we built it in a way that just supports Node, which I think Vercel has sort of embraced as well. But where I'm kind of trying to go with this is, in AI, there are many things that are changing, and there are many things that you've got to predict whether or not they're going to be durable. If something's durable, then you can build depth around it. But if you make the wrong predictions about durability and you build depth, then you're very, very vulnerable. Because a customer's priorities might change tomorrow, and you've built depth around something that is no longer relevant. And I think what's happening with frameworks right now is a really, really good example of that playing out. We are not in the app framework universe, so we have the luxury of sort of observing it, as intended, from the side.

swyx [01:14:17]: You are a little bit... I captured when you said if you structure your code with the same function extraction, triple equals to run evals. Sure, yeah.

Ankur Goyal [01:14:27]: But I would argue that it's kind of like a clever insight. And we, in the kindest way, almost trick you into writing code that doesn't require ETL.

swyx [01:14:37]: It's good for you.

Ankur Goyal [01:14:39]: Yeah, exactly. But you don't have to use... It's kind of like a lesson that is designed to brain trust itself.

swyx [01:14:45]: Sure. I buy that. There was an obvious part of this market for you to start in, which is maybe... Curious, we're spending two seconds on it. You could have been the VectorDB CEO. Right? Yeah, I got a lot of calls about that. You're a database guy. Why no vector database?

Ankur Goyal [01:15:01]: Oh, man. I was drooling over that problem. It just checks everything. It's performance and potentially serverless. It's just everything I love to type. The problem is that... I had a fantastic opportunity to see these things play out at Figma. The problem is that the challenge in deploying vector search has very little to do with vector search itself and much more to do with the data adjacent to vector search. So, for example, if you are at Figma, the vector search is not actually the hard problem. It is the permissions and who has access to what design files or design system components blah, blah, blah. All of this stuff that has been beautifully engineered into a variety of systems that serve the product. You think about something like vector search and you really have two options. One is, there's all this complexity around my application and then there's this new little idea of technology, sort of a pattern or paradigm of technology which is vector search. Should I cram vector search into this existing ecosystem? And then the other is, okay, vector search is this new, exciting thing. Do I kind of rebuild around this new paradigm? And it's just super clear that it's the former. In almost all cases, vector search is not a storage or performance bottleneck. And in almost all cases, vector search involves exactly one query which is nearest neighbors.

swyx [01:16:29]: The hard part... Yeah, I mean, that's the implementation of it.

Ankur Goyal [01:16:33]: But the hard part is how do I join that with the other data? How do I implement RBAC and all this other stuff? And there's a lot of technology that does that. In my observation, database companies tend to succeed when the storage paradigm is closely tied to the execution paradigm. And both of those things need to be rewired to work. Remember that databases are not just storage, but they're also compilers. It's the fact that you need to build a compiler that understands how to utilize a particular storage mechanism that makes the nplusfirst database something that is unique. If you think about Snowflake, it is separating storage from compute and the entire compiler pipeline around query execution hides the fact that separating storage from compute is incredibly inefficient, but gives you this really fast query experience. The arbitrary code is a first-class citizen, which is a very powerful idea, and it's not possible in other database technologies. Arbitrary code is a first-class citizen in my database system. How do I make that work incredibly well? And again, that's a problem which spans storage and compute. Today, the query pattern for vector search is so constrained that it just doesn't have that property.

swyx [01:17:59]: I think I fully understand and mostly agree. I want to hear the opposite view. I think yours is not the consensus view, and I want to hear the other side. I mean, there's super smart people working on this, right? We'll be having Chroma and I think Qtrends on maybe Vespa, actually. One other part of the triangle that I drew that you disagree with, and I thought that was very insightful, was fine-tuning. So I had all these overlapping circles, and I think you agreed with most of them, and I was like, at the center of it all, because you need logging from Ops, and then you need a gateway, and then you need a database with a framework, or whatever, was fine-tuning. And you were like, fine-tuning is not a thing. It's not a business.

Ankur Goyal [01:18:39]: So there's two things with fine-tuning. One is the technical merits, or whether fine-tuning is a relevant component of a lot of workloads. And I think that's actually quite debatable. The thing I would say is not debatable is whether or not fine-tuning is a business outcome or not. So let's think about the other components of your triangle. Ops slash observability, that is a business thing. Do I know how much money my app costs? Am I enforcing, or sorry, do I know if it's up or down? Do I know if someone complains? Can I retrieve the information about that? Frameworks, evals, databases, do I know if I changed my code? Did it break anything? Gateway, can I access this other model? Can I enforce some cost parameter on it? Whatever. Fine-tuning is a very compelling method that achieves an outcome. The outcome is not fine-tuning, it is can I automatically optimize my use case to perform better if I throw data at the problem? And fine-tuning is one of multiple ways to achieve that. I think the DSPY-style prompt optimization is another one. Turpentine, you know, just like tweaking prompts with wording and hand-crafting few-shot examples and running evals, that's another... Is Turpentine a framework? No, sorry, it's just a metaphor. But maybe it should be a framework.

swyx [01:20:03]: Right now it's a podcast network by Eric Tornberg.

Ankur Goyal [01:20:05]: Yes, that's actually why I thought of that word. Old-school elbow grease is what I'm saying, of hand-tuning prompts, that's another way of achieving that business goal. And there's actually a lot of cases where hand-tuning a prompt performs better than fine-tuning because you don't accidentally destroy the generality that is built into the world-class models. So in some ways it's safer, right? But really the goal is automatic optimization. And I think automatic optimization is a really valid goal, but I don't think fine-tuning is the only way to achieve it. And so, in my mind, for it to be a business, you need to align with the problem, not the technology. And I think that automatic optimization is a really great business problem to solve. And I think if you're too fixated on fine-tuning as the solution to that problem, then you're very vulnerable to technological shifts. There's a lot of cases now, especially with large context models, where in-context learning just beats fine-tuning. And the argument is sometimes, well, yes, you can get as good a performance as in-context learning, but it's faster or cheaper or whatever. That's a much weaker argument than, oh my god, I can really improve the quality of this use case with fine-tuning. It's somewhat tumultuous. A new model might come out, it might be good enough that you don't need to use, or it might not have fine-tuning, or it might be good enough that you don't need to use fine-tuning as the mechanism to achieve automatic optimization with the model. But automatic optimization is a thing. And so that's kind of the semantic thing, which I would say is maybe, at least to me, it feels like more of an absolute. I just don't think fine-tuning is a business outcome. There are several means to an end, and the end is valuable. Now, is fine-tuning a technically valid way of doing automatic optimization? I think it's very context-dependent. I will say, in my own experience with customers, as of the recording date today, which is September or something, very few of our customers are currently fine-tuning models. And I think a very, very small fraction of them are running fine-tuned models in production. More of them were running fine-tuned models six months ago than they are right now. And that may change. I think what OpenAI is doing with basically making it free and how powerful Llama 3 AB is and some other stuff, that may change. Maybe by the time this airs, more of our customers are fine-tuning stuff. But it's changing all the time. But all of them want to do automatic optimization.

swyx [01:22:35]: Yeah, it's worth asking a follow-up question on that. Who's doing that today well that you would call out?

Ankur Goyal [01:22:41]: Automatic optimization? No one.

swyx [01:22:43]: Wow. DSPy is a step in that direction. Omar has decided to join Databricks and be an academic. And I have actually asked who's making the DSPy startup. Somebody should.

Ankur Goyal [01:22:57]: There's a few. My personal perspective on this, which almost everyone, at least hardcore engineers, disagree with me about, but I'm okay with that, I think DSPy, I think there's two elements to it. One is automatic optimization. And the other is achieving automatic optimization by writing code. In particular, in DSPy's case, code that looks a lot like PyTorch code. And I totally recognize that if you were writing only TensorFlow before, then you started writing PyTorch. It's a huge improvement. And, oh my god, it feels like so much nicer to write code. If you are a TypeScript engineer and you're writing Next.js, writing PyTorch sucks. Why would I ever want to write PyTorch? And so I actually think the most empowering thing that I've seen is engineers and non-engineers alike writing really simple code. And whether it's simple TypeScript code that's auto-completed with cursor, or it's English, I think that the direction of programming itself is moving towards simplicity. And I haven't seen something yet that really moves programming towards simplicity. And maybe I'm a romantic at heart, but I think there is a way of doing automatic optimization that still allows us to write simpler code.

swyx [01:24:21]: Yeah, I think that people are working on it, and I think it's a valuable thing to explore. I'll keep a lookout for it and try to report on it through Latentspace.

Ankur Goyal [01:24:29]: And we'll integrate with everything. I don't know if you're working on this. We'd love to collaborate

swyx [01:24:33]: with you. For Ops people in particular, you have a view of the world that a lot of people don't get to see. You get to see workloads and report aggregates, which is insightful to other people. Obviously, you don't have them in front of you, but I just want to give rough estimates. You already said one which is kind of juicy, which is open-source models are a very, very small percentage. Do you have a sense of OpenAI versus Anthropic versus Cohere, MarketShare, at least through the segment that

Ankur Goyal [01:24:59]: you're in? So pre-Cloud 3, it was close to 100% OpenAI. Post-Cloud 3, and I actually think Haiku has slept on a little bit, because before 4.0 MIDI came out, Haiku was a very interesting reprieve for people to have very, very

swyx [01:25:15]: ...

Ankur Goyal [01:25:17]: Everyone knows Sonnet, right? But when Cloud 3 came out, Sonnet was like the middle child. Who gives a s**t about Sonnet? It's neither the super-fast thing Really, I think it was Haiku that was the most interesting foothold, because Anthropic is talented at figuring out either deliberately or not deliberately a value proposition to developers that is not already taken by OpenAI and providing it. I think now Sonnet is both cheap and smart, and it's quite pleasant to communicate with. But when Haiku came out, it was the smartest, cheapest, fastest model that was refreshing, and I think the fact that it supported tool calling was incredibly important. An overwhelming majority of the use cases that we see in production involve tool calling, because it allows you to write code that reliably ... Sorry, it allows you to write prompts that reliably plug in and out of code. And so, without tool calling, it was a very steep hill to use a non-OpenAI model with tool calling, especially because Anthropic embraced JSON schema

swyx [01:26:23]: and also did OpenAI. I mean, they did it first.

Ankur Goyal [01:26:27]: Outside of OpenAI. Yeah, OpenAI had already done it, and Anthropic was smart, I think, to piggyback on that versus trying to say, hey, do it our way instead. Because they did that, now you're in business, right? The switching cost is much lower because you don't need to unwind all the tool calls that you're doing, and you have this value proposition which is cheaper, faster, especially now, every new project that people think about, they do evaluate OpenAI and Anthropic. We still see an overwhelming majority of customers using OpenAI, but almost everyone is using Anthropic and Sonnet specifically for their side projects, whether it's via cursor or prototypes

swyx [01:27:09]: or whatever that they're doing. Yeah, it's such a meme.

Ankur Goyal [01:27:13]: It's actually kind of funny. I made fun of it. Yeah, I mean, I think one of the things that OpenAI does, an extremely exceptional job of this, is availability, rate limits, and reliability. It's just not practical outside of OpenAI to run use cases at scale in a lot of cases. You can do it, but it requires quite a bit of work, and because OpenAI is so good at making their models so available, I think they get a lot of credit for the science behind O1 and wow, it's like an amazing new model. In my opinion, they don't deserve credit for showing up every day and keeping the servers running behind one endpoint. You don't need to provision an OpenAI endpoint or whatever. It's just one endpoint. It's there. You need higher rate limits. It's there. It's reliable. That's a huge part

swyx [01:28:03]: of what they do well. We interviewed Michelle from that team. They do a ton of work, and it's a surprisingly small team. It's really amazing. That actually opens the way to a little bit of something I assume that you would know, which is, I would assume that small developers like us use those model lab endpoints directly, but the big boys, they all use Amazon for Anthropic because they have the special relationship. They all use Azure for OpenAI because they have that special relationship, and then Google has Google. Is that not true? It's not true. Isn't that weird? You wouldn't have all this committed spend on AWS that you're like, okay, fine, I'll use Cloud because I already have that.

Ankur Goyal [01:28:41]: In some cases, it's yes and. It hasn't been a smooth journey for people to get the capacity on public clouds that they're able to get through OpenAI directly. I mean, I think a lot of this is changing, catching up, etc., but it hasn't been perfectly smooth. I think there are a lot of caveats, especially around access to the newest models. With Azure early on, there's a lot of engineering that you need to do to actually get the equivalent of a single endpoint that you have with OpenAI. Most people built around assuming there's a single endpoint, so it's a non-trivial engineering effort to load balance across endpoints and deal with the credentials. Every endpoint has a slightly different set of credentials, has a different set of models that are available on it. There are all these problems that you just don't think about when you're using OpenAI, etc., that you have to suddenly think about. Now, for us, that turned into some opportunity. A lot of people use our proxy as a

swyx [01:29:35]: ... This is the gateway.

Ankur Goyal [01:29:37]: Exactly, as a load balancing mechanism to have that same user experience with more complicated deployments. But I think that in some ways, maybe a small fish in that pond, but I think that the ease of actually a single endpoint is, it sounds obvious or whatever, but it's not. And for people that are constantly, a lot of AI energy is spent on, and inference is spent on R&D, not just stuff that's running in production. And when you're doing R&D, you don't want to spend a lot of time on maybe accessing a slightly older version of a model or dealing with all these endpoints or whatever. And so I think the time to value and ease of use of what the model labs themselves have been able to provide, it's actually quite compelling.

swyx [01:30:23]: That's good for them. Less good for the public cloud partners to them.

Ankur Goyal [01:30:27]: I actually think it's good for both. It's not a perfect ecosystem, but it is a healthy ecosystem now with a lot of trade-offs and a lot of options. And as we're not a model lab, as someone who participates in the ecosystem, I'm happy. OpenAI released O1. I don't think Anthropic and Meta are sleeping on that. I think they're probably invigorated by it, and I think we're going to see exciting stuff happen. And I think everyone has a lot of GPUs now. There's a lot of ways of running LLAMA. There's a lot of people outside of Meta who are economically incentivized for LLAMA to succeed. And I think all of that contributes to more reliable points, lower costs, faster speed, and more options for you and me who are just using these

swyx [01:31:09]: models and benefiting from them. It's really funny. We actually interviewed Thomas from the LLAMA 3 post-training team. He actually talks a little bit about LLAMA 4, and he was already down that path even before O1 came out. I guess it was obvious to anyone in that circle, but for the broader worlds, last week was the first time they heard about it. I mean, speaking of O1, let's go there. How has O1 changed anything that you perceive? You're in enough circles that you already knew what was coming. Did it surprise you in any way? Does it change your roadmap in any way? It is long inference, so maybe it changes some assumptions?

Ankur Goyal [01:31:45]: I talked about how way back, rewinding to Impira, if you make assumptions about the capabilities of models and you engineer around them, you're almost guaranteed to be

swyx [01:31:57]: screwed. And I got screwed, not

Ankur Goyal [01:31:59]: necessarily a bad way, but I sort of felt that twice in a short period of time. I think that shook out of me, that temptation as an engineer that you have to say, GPT-4.0 is good at this, but models will never be good at that. So let me try to build software that works around that. I think probably you might actually disagree with this, and I wouldn't say that I have a perfectly strong structural argument about this. I'm open to debate, and I might be totally wrong, but I think one of the things that felt obvious to me and somewhat vindicated by O1 is that there's a lot of code and paths that people went down with GPT-4.0 to achieve this idea of more complex reasoning, and I think agentic frameworks are kind of like a little Cambrian explosion of people trying to work around the fact that GPT-4.0 or related models have somewhat limited reasoning capabilities. I look at that stuff and writing graph code that returns edge indirections and all this, it's like, oh my god, this is so complicated. It feels very clear to me that this type of logic is going to be built into the model. Anytime there is control flow complexity or uncertainty complexity, I think the history of AI has been to push more and more into the model. In fact, no one knows whether this is true or whatever, but GPT-4.0 was famously a mixture of experts.

swyx [01:33:31]: You mentioned it on our podcast.

Ankur Goyal [01:33:33]: Exactly. Yeah, I guess you broke the news, right?

swyx [01:33:35]: There were two breakers, Dylan and us. George was the first loud enough person to make noise about it. Prior to that,

Ankur Goyal [01:33:43]: a lot of people were building these round-robin routers that were like, you know, and you look at that and you're like, okay, I'm pretty sure if you train a model to do this problem and you vertically integrate that into the LLM itself, it's going to be better. And that happened with GPT-4. And I think O1 is going to do that to agentic frameworks as well. I think, to me, it seems very unlikely that you and me sort of like sipping an espresso and thinking about how different personified roles of people should interact with each other and stuff. It seems like that is just going to get pushed into the model. That was the main takeaway for me.

swyx [01:34:23]: I think that you are very perceptive in your mental modeling of me, because I do disagree 15-25%. Obviously, they can do things that we cannot, but you as a business always want more control than OpenAI will ever give you. They're charging you for thousands of reasoning tokens and you can't see it. That's ridiculous. Come on.

Ankur Goyal [01:34:45]: Well, it's ridiculous until it's not, right? I mean, it was ridiculous to GPT-3 too.

swyx [01:34:49]: Well, GPT-3, I mean, all the models had total transparency until now where you're paying for tokens you can't see.

Ankur Goyal [01:34:55]: What I'm trying to say is that I agree that this particular flavor of transparency is novel. Where I disagree is that something that feels like an overpriced toy, I mean, I viscerally remember playing with GPT-3 and it was very silly at the time, which is kind of annoying if you're doing document extraction. But I remember playing with GPT-3 and being like, okay, yeah, this is great, but I can't deploy it on my own computer and blah, blah, blah, blah. So it's never going to actually work for the real use cases that we're doing. And then that technology became cheap, available, hosted, now I can run it on my hardware or whatever. So I agree with you if that is a permanent problem. I'm relatively optimistic that, I don't know if Llama4 is going to do this, but imagine that Llama4 figures out a way of open sourcing some similar thing and you actually do

swyx [01:35:47]: have that kind of control on it. Yeah, it remains to be seen. But I do think that people want more control and this part of the reasoning step is something where if the model just goes off to do the wrong thing, you probably don't want to iterate in the prompt space, you probably just want to chain together a bunch of model calls to do what you're trying to do.

Ankur Goyal [01:36:07]: Perhaps, yeah. It's one of those things where I think the answer is very gray, like the real answer is very gray. And I think for the purposes of thinking about our product and the future of the space and just for fun debates with people I enjoy talking to like you, it's useful to pick one extreme of the perspective and just sort of latch onto it. But yeah, it's a fun debate to have and maybe I would say more than anything, I'm just grateful to participate in an ecosystem where we can have these debates.

swyx [01:36:39]: Very, very helpful. Your data point on the decline of open source in production is actually very...

Ankur Goyal [01:36:47]: Decline of fine-tuning in production.

swyx [01:36:51]: Can you put a number? Like 5%, 10% of your workload?

Ankur Goyal [01:36:55]: Is open source? Yeah. Because of how we're deployed, I don't have like an exact number for you. Among customers running in production, it's less than 5%.

swyx [01:37:03]: That's so small. The counters are the thesis that people want more control, that people want to create IP around their models and all that stuff.

Ankur Goyal [01:37:15]: I think people want availability.

swyx [01:37:17]: You can engineer availability with OpenWeights. Good luck. Really? Yeah. You can use Together, Fireworks, all these guys. They are nowhere

Ankur Goyal [01:37:25]: near as reliable as... I mean, every single time I use any of those products and run a benchmark, I find a bug, text the CEO, and they fix something. It's nowhere near where OpenAI is. It feels like using Joyent instead of using AWS or something. Yeah, great. Joyent can build single-click provisioning of instances and whatever. I remember one time I was using... I don't remember if it was Joyent or something else. I tried to provision an instance and the person was like, BRB, I need to run to Best Buy to go buy the hardware. Yes, anyone can theoretically do what OpenAI has done, but they just haven't.

swyx [01:38:01]: I will mention one thing that I'm trying to figure out. We obliquely mentioned the GPU inference market. Is anyone making money? Will anyone make money? In the GPU inference market,

Ankur Goyal [01:38:11]: people are making money today, and they're making money with really high margins.

swyx [01:38:15]: Really? Yeah. Because I calculated the grok numbers. Dylan Patel thinks they're burning cash. I think they're about break-even.

Ankur Goyal [01:38:23]: It depends on the company. So there are some companies that are software companies, and there are some companies that are hardware bets, right? I don't have any insider information, so I don't know about the hardware companies, but I do know for some of the software companies, they have high margins and they're making money. I think no one knows how durable that revenue is, but all else equal, if a company has some traction and they have the opportunity to build relationships with customers, I think independent of whether their margins erode for one particular product offering, they have the opportunity to build higher margin products. And so inference is a real problem, and it is something that companies are willing to pay a lot of money to solve. To me, it feels like there's opportunity. Is the shape of the opportunity inference API? Maybe not, but we'll see.

swyx [01:39:11]: We'll see. Those guys are definitely reporting very high ARR numbers.

Ankur Goyal [01:39:17]: From all the knowledge I have, the ARR is real. Again, I don't have any insider

swyx [01:39:21]: information. Together's numbers were leaked or something on the Kleiner Perkins podcast. And I was like, I don't think that was public, but now it is. So that's kind of interesting. Any other industry trends you want to discuss? Nothing else that I can think of. I want to hear yours. Just generally workload market share. You serve superhuman. They have superhuman AI, they do title summaries and all that. I just would really like type of workloads, type of evals. What is AI being used in production today to do?

Ankur Goyal [01:39:55]: I think 50% of the use cases that we see are what I would call single prompt manipulations. Summaries are often but not always a good example of that. And I think they're really valuable. One of my favorite gen AI features is we use linear at Braintrust. And if a customer finds a bug on Slack, we'll click a button and then file a linear ticket. And it auto generates a title for that ticket. No idea how it's implemented. I don't care. Loom has some really similar features which I just find amazing.

swyx [01:40:27]: So delightful. You record the thing,

Ankur Goyal [01:40:29]: it titles it properly. And even if it doesn't get it all the way properly, it sort of inspires me to maybe tweak it a little bit. It's so nice. And so I think there is an unbelievable amount of untapped value in single prompt stuff. And the thought exercise I run is anytime I use a piece of software, if I think about building that software as if it were rebuilt today, which parts of it would involve AI? Almost every part of it would involve running a little prompt here or there to have a little bit of delight.

swyx [01:41:01]: By the way, before you continue, I have a rule for building Smalltalk which we can talk about separately. It should be easy to do those AI calls. Because if it's a big lift, if you have to edit five files, you're not going to do it. But if you can just sprinkle intelligence everywhere, then you're going to do it more.

Ankur Goyal [01:41:17]: I totally agree. And I would say, that probably brings me to the next part of it. I'd say probably 25% of the remaining usage is what you could call a simple agent. Which is probably a prompt plus some tools. At least one, or perhaps the only tool is a rag type of tool. And it is kind of like an enhanced chatbot or whatever that interacts with someone. Then I'd say probably the remaining 25% are what I would say are advanced agents, which are things that you can maybe run for a long period of time or have a loop or do something more than that simple but effective paradigm. And I've seen a huge change in how people write code over the past six months. When this stuff first started being technically feasible, people created very complex programs that almost reminded me of studying math again in college. It's like, you compute the shortest path from this knowledge center to that knowledge center, and then blah, blah, blah. It's like, oh my god. You write this crazy continuation passing code. In theory, it's amazing. It's just very, very hard to actually debug this stuff and run it. Almost everyone that we work with has gone into this model that actually exactly what you said, which is sprinkle intelligence everywhere and make it easy to write dumb code. It's a prevailing model that is quite exciting for people on the frontier today. I dearly hope as a programmer succeeds, is one where what is AI code? It's not a thing, right? It's just, I'm creating an app, NPX, create next app, or whatever, like FastAPI, whatever you're doing, and you just start building your app, and some parts of it involve some intelligence, some parts don't. Maybe you do some prompt engineering, maybe you do some automatic optimization, you do evals as part of your CI workflow. I'm just building software, and it happens to be quite intelligent as I do it because I happen to have these things available to me. That's what I see more people doing. The sexiest intellectual way of thinking about it is that you design an agent around the user experience that the user actually works with rather than the technical implementation of how the components of an agent interact with each other. When you do that, you almost necessarily need to write a lot of little bits of code, especially UI code, between the LLM calls. The code ends up looking kind of dumber along the way because you almost have to write code that engages the user and crafts the user experience as the LLM

swyx [01:44:03]: is doing its thing. Guy Podjarny So here are a couple things that you did not bring up. One is doing the Code Interpreter agent, the Voyager agent where the agent writes code, and then it persists that code and reuses that code in the future.

Ankur Goyal [01:44:17]: I don't know anyone who's doing that.

swyx [01:44:19]: When Code Interpreter was introduced last year, I was like,

Ankur Goyal [01:44:21]: this is AGI. There's a lot of people, it should be fairly obvious if you look at our customer list, who they are, but I won't call them out specifically, that are doing CodeGen and running the code that's generated in arbitrary environments, but they have also morphed their code into this dumb pattern that I'm talking about, which is like, I'm going to write some code that calls an LLM, it's going to write some code, I might show it to a user or whatever, and then I might just run it. I like the word Voyager that you use.

swyx [01:44:53]: I don't know anyone who's doing that. Guy Podjarny Voyager is in the paper. My term for this, if you want to use the term, you can use mine, is core versus LLM core. This is a direct parallel from systems engineering, where you have functional core imperative shell. This is a term that people use. You want your core system to be very well defined and imperative outside to be easy to work with. The AI engineering equivalent is that you want the core of your system to not be this Shoggoth, where you just chuck it into a very complex agent. You want to sprinkle LLMs into a database. Because we know how to scale systems, we don't know how to scale agents that are quite hard to be reliable.

Ankur Goyal [01:45:39]: I was saying, I think while in the short term there may be opportunities to scale agents by doing silly things, it feels super clear to me that in the long term, anything you might do to work around that limitation of an LLM will be pushed into the LLM. If you build your system in a way that assumes LLMs will get better at reasoning and get better at sort of agentic tasks in the LLM itself, then I think you will build a more durable system.

swyx [01:46:05]: What is one thing you would build if you're not working on

Ankur Goyal [01:46:07]: Brain Trust? A vector database. My heart is still with databases

swyx [01:46:13]: a lot. I mean, sometimes I... Non-ironically.

Ankur Goyal [01:46:17]: Not a vector database. I'll talk about this in a second, but I think I love the Odyssey. I'm not Odysseus, I don't think I'm cool enough, but I sort of romanticize going back to the farm. Maybe just like, Alanna and I move to the woods someday and I just sit in a cabin and write C++ or Rust code on my MacBook Pro and build a database or whatever. So that's sort of what I drool and dream about. I think practically speaking, I am very passionate about this variant-type issue that we've talked about because I now work in observability, where that is a cornerstone to the problem. And I mean, I've been ranting to Nikita and other people that I enjoy interacting with in the database universe about this, and my conclusion is that this is a very real problem for a very small number of companies. And that is why Datadog, Splunk, Honeycomb, et cetera, et cetera, built their own database technology, which is in some ways, it's sad because all of the technology is a remix of pieces of Snowflake and Redshift and Postgres and other things, Redis, whatever, that solve all of the technical problems. And I feel like if you gave me access to all the codebases and locked me in a room for a week or something, I feel like I could remix it into any database technology that would solve any problem. Back to our HTAP thing, it's kind of the same idea. But because of how databases are packaged, which is for a specific set of customers that have a particular set of use cases and a particular flavor of wallet, the technology ends up being inaccessible for these use cases like observability that don't fit a template that you can just sell and resell. I think there are a lot of these little opportunities, and maybe some of them will be big opportunities, maybe they'll all be little opportunities forever, but there's probably a set of such things, the variant type being the most extreme right now, that are high frustration for me and low value for database companies that are all interesting things for me to work on.

swyx [01:48:23]: Well, maybe someone listening is also excited and maybe they can come to you for advice and funding. Maybe I need to refine my question. What AI company or product would you work on if you're not working on

Ankur Goyal [01:48:37]: Braintrust? Honestly, I think if I weren't working on Braintrust, I would want to be working either independently or as part of a lab and training models. I think I with databases and just in general, I've always taken pride in being able to work on the most leading version of things and maybe it's a little bit too personal, but one of the things I struggled with post-single store is there are a lot of data tooling companies that have been very successful that I looked at and was like, oh my god, this is stupid. You can solve this inside of a database much better. I don't want to call out any examples because I'm friends with a lot of these people. Yeah, maybe. But what was a really sort of humbling thing for me and I wouldn't even say I fully accepted it is that people that maybe don't have the ivory tower experience of someone who worked inside of a relational database but are very close to the problem, their perspective is at least as valuable in company building and product building as someone who has the ivory tower of like, oh my god, I know how to make in-memory skip list that's durable and lock-free. And I feel like with AI stuff, I'm in the opposite scenario. I had the opportunity to be in the ivory tower and at open air, train a large language model, but I've been using them for a while now and I felt like an idiot. I kind of feel like I'm one of those people that I never really understood in databases who really understands the problem but is not all the way in the technology and so that's probably what I'd work on.

swyx [01:50:13]: This might be a controversial question, but whatever. If OpenAI came to you with an offer today, would you take it? Competitive fair market value, whatever that means for your investors.

Ankur Goyal [01:50:25]: Fair market value, no. But I think that I would never say never, but I really...

swyx [01:50:33]: Because then you'd be able to work on their platform, bring your tools to them, and then also talk to the researchers.

Ankur Goyal [01:50:39]: Yeah, I mean, we are very friendly collaborators with OpenAI and I have never had more fun day-to-day than I do right now. One of the things I've learned is that many of us take that for granted. Now having been through a few things, it's not something I feel comfortable taking for

swyx [01:50:59]: granted again.

Ankur Goyal [01:51:01]: I wouldn't even call it independence. I think it's being in an environment that I really enjoy. I think independence is a part of it, but I wouldn't say it's the high-order bit. I think it's working on a problem that I really care about for customers that I really care about with people that I really enjoy working with. Among other things, I'll give a few shout-outs. I work with my brother. Did I see him? No.

swyx [01:51:25]: He was sitting right behind us.

Ankur Goyal [01:51:27]: And he's my best friend, right? I love working with him. Our head of product, Eden, he's a designer at Airtable and Cruise. He is an unbelievably good designer. If you use the product, you should thank him. He's just so good, and he's such a good engineer as well. He destroyed our programming interviews, which we gave him for fun. But it's just such a joy to work with someone who's just so good, and so good at something that I'm not good at. Albert joined really early on, and he used to work at ABC, and he does all the business stuff for us. He has negotiated giant contracts, and I just enjoy working with these people. I feel like our whole team is just so good.

swyx [01:52:15]: Yeah, you worked really hard to get here.

Ankur Goyal [01:52:17]: I'm just loving the moment. That's something that would be very hard for me to give up.

swyx [01:52:21]: Understood. While we're in the name-dropping and doing shout-outs, I think a lot of people in the San Francisco startup scene know Alana, and most people won't. Is there one thing that you think makes her so effective that other people can learn from, or that you learn from?

Ankur Goyal [01:52:37]: Yeah, I mean, she genuinely cares about people. When I joined Figma, if you just look at my profile, I really don't mean this to sound arrogant, but if you look at my profile, it seems kind of obvious that if I were to start another company, there would be some VC interest. And literally there was. Again, I'm not that special, but...

swyx [01:52:57]: No, but you had two great runs.

Ankur Goyal [01:52:59]: It just seems kind of obvious. I mean, I'm married to Alana, so of course we're going to talk, but the only people that really talked to me during that period were Elad

swyx [01:53:09]: and Alana. Why?

Ankur Goyal [01:53:11]: It's a good question. You didn't try

swyx [01:53:13]: hard enough.

Ankur Goyal [01:53:15]: It's not like I was trying to talk to VCs.

swyx [01:53:19]: So in some sense, while talking to Elad is enough, and then Alana can fill in the rest,

Ankur Goyal [01:53:25]: that's it? Yeah, so I'm just saying that these are people that genuinely care about another human. There are a lot of things over that period of getting acquired, being at Figma, starting a company, that they're just really hard. And what Alana does really, really well is she really, really cares about people. And people are always like, oh my god, how come she's in this company before I am or whatever? It's like, who actually gives a s**t about this person and was getting to know them before they ever sent an email? You know what I mean? Before they started this company and 10 other VCs were interested and now you're interested. Who is actually talking to this person?

swyx [01:54:05]: She does that consistently. Exactly. The question is obviously how do you scale that? How do you scale caring about people? Do they have a personal CRM?

Ankur Goyal [01:54:15]: Alana has actually built her entire software stack herself. She studied computer science and was a product manager for a few years, but she's super technical and really, really good at writing code.

swyx [01:54:27]: For those who don't know, every YC batch, she makes the best of the batch and she puts it all into one product. Yeah, she's just an amazing

Ankur Goyal [01:54:35]: hybrid between a product manager, designer, and engineer. Every time she runs into an inefficiency, she solves

swyx [01:54:41]: it. Cool. Well, there's more to dig there, but I can talk to her directly. Thank you for all this. This was a solid two hours of stuff. Any calls

Ankur Goyal [01:54:49]: to action? Yes. One, we are hiring software engineers, we are hiring salespeople, we are hiring a dev rel, and we are hiring one more designer. We are in San Francisco, so ideally, if you're interested, we'd like you to be in San Francisco. There are some exceptions, so we're not totally close-minded to that, but San Francisco is significantly preferred. We'd love to work with you. If you're building AI software, if you haven't heard of Braintrust, please check us out. If you have heard of Braintrust and maybe tried us out a while ago or something and want to check back in, let us know or try out the product, we'd love to talk to you. I think, more than anything, we're very passionate about the problem that we're solving and working with the best people on the problem. We love working with great customers and have some good things in place that have helped us scale a little bit, so we have a lot of capacity

swyx [01:55:49]: for more. Well, I'm sure there will be a lot of interest, especially when you announce your Series A. I've had the joy of watching you build this company a little bit, and I think you're one of the top founders I've ever met, so it's just great to sit down with you and learn a little bit. It's very kind. Thank you. Thanks. That's it.

Ankur Goyal [01:56:05]: Awesome. Get full access to Latent Space at www.latent.space/subscribe)

Production AI Engineering starts with Evals — with Ankur Goyal of Braintrust 01:56:40 Share

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Shownotes Transcript

Production AI Engineering starts with Evals — with Ankur Goyal of Braintrust