Launching AI products with Braintrust’s CEO Ankur Goyal

2024/10/8

No Priors: Artificial Intelligence | Technology | Startups

AI Deep Dive AI Chapters Transcript

People

Ankur Goyal

Elad Gil

Topics

Elad Gil介绍了Braintrust公司及其产品，指出Braintrust是一个端到端的企业平台，用于构建AI应用程序，帮助公司高效评估和管理复杂的、非确定性的AI应用程序。Braintrust帮助像Notion、Airtable、Instacart、Zapier和Vercel这样的公司进行评估、可观测性和提示开发。Ankur Goyal分享了他对AI工具和编程语言新兴趋势、开源的兴起以及数据基础设施未来的见解。他还谈到了构建有弹性的AI产品、他作为CEO的编程理念以及初创公司初始客户群的重要性。 Ankur Goyal详细阐述了Braintrust的产品功能和发展历程，以及在AI产品开发中遇到的挑战和解决方案。他指出，AI产品开发中评估的难题在LLM出现前后都存在，且持续存在。Braintrust最初的原型虽然粗糙，但由于满足了市场需求，得到了用户的认可和使用，并不断迭代改进。他解释了指令微调和微调的区别，以及为什么大多数客户转向指令微调。他还谈到了开源模型的应用现状、数据基础设施的未来发展以及如何构建AI团队。他分享了他作为CEO的编程经验，以及如何与客户合作，并对Braintrust的未来发展方向进行了展望。

Deep Dive

Chapters

Ankur Goyal, CEO of Braintrust, discusses the company's origins, its mission to help companies build AI applications at scale, and the challenges of evaluating and managing complex AI applications. He highlights the consistent need for robust AI evaluation tools across various stages of AI development.

Braintrust helps companies like Notion, Airtable, Instacart, and Zapier deploy AI solutions.
The company's initial prototype was quickly adopted by users.
The problem of AI evaluation is harder than it seems, requiring consistent and standardized methods.

Shownotes Transcript

Translations:

中文

Oh, no.

So today on NoPriors, we have Ankur Goyal, the co-founder and CEO of Braintrust. Ankur was previously vice president of engineering at SingleStore and was the founder and CEO of Impira, an AI company acquired by Figma. Braintrust is an end-to-end enterprise platform for building AI applications. They help companies like Notion, Airtable, Instacart, Zapier, Vercel, and many more with evals, observability, and prompt development for their AI products. And Braintrust just raised $36 million from Andreessen Horowitz and others.

Ankur, thank you so much for joining us today on No Priers. Very excited to be here. Can you tell us a little bit more about Braintrust, what the product does, and we could talk a little bit about how you got started in this area and AI more generally. Yeah, for sure. So I have been working on AI since what one might now think of as ancient history. Back in 2017, when we started working on Empira, things were totally different. But

Still, it was really hard to ship products that work. And so we built tooling internally as we developed our AI products to help us evaluate things, collect real user data, use it to do better evals and so on. Fast forward a few years, Figma acquired us and we actually ended up having exactly the same problems and building pretty much the same tooling system.

And I thought that was interesting for a few reasons, some of which you pointed out, by the way, when we were hanging out and chatting about stuff. But one, Empiro was kind of pre-LLM. My time at Figma was post-LLM. But these problems were the same. And I think there's some longevity that's implied by that. Problems that existed pre-LLM probably are going to exist in LLM land for a while now.

And the second thing is that, you know, having built the same tooling essentially twice, it was clear that there was a pretty consistent need. And so, you know, I have very fond memories of the two of us hanging out and talking to a bunch of folks like, you know, Brian and Mike at Zapier and Simon at Notion and, you know, many others and talking.

I've been in a lot of user interviews over time. I've never seen anything resonate like the early ideas around brain trust and really everyone's desire to have a good solution to the eval problem. So we got to work and built, honestly, a pretty crappy initial prototype, but people started using it. And brain trust just overcame.

over a year later has now kind of iterated from people's feedback and complaints and ideas into something I think that's really powerful. And yeah, that's how we kind of got started. Yeah, I remember in the early conversations we had around the company or the idea, I should say, it was meant to even potentially be open source. And it was the first time that I was involved with some sort of customer call and people would say, we don't want you to open source it, which I found really surprising. People really pushed on, we want this to exist for a long time. We want to be able to pay for it.

And so there was that kind of really interesting market poll. Why do you think there was so much interest or need for this or demand for it? Or, you know, what does Braintrust do and how does that really impact your customers? You know, many of our customers had actually built, early customers had built like internal versions of Braintrust before we engaged with them. And

There's a couple of things that sort of came out of that. One is it helped them gain an appreciation for how hard the problem is. Evals sound really easy. Oh, it's just a for loop. And then I look at console.log the for loop as I go and I look at the results.

But the reality is like, you know, the faster you can eval, the faster you can look at eval results, which start to get really complicated as you start doing things with agents and so on, the faster you can actually iterate and build stuff. It is actually a pretty hard problem to do evals well. And many of our early customers, you know,

who were kind of like the pioneers in AI engineering, had learned that the hard way. And I think the other problem is that folks, especially folks like Brian, for example, they saw that AI would be a pervasive technology throughout the whole org, not just a project that Brian might babysit and work on with one team.

And having a really consistent and standardized way of doing things was really important. I remember early on, Brian pointed me to the Vercel docs and he said, one of the things I love about this is that when new engineers are building UI now, they read these docs and they kind of learn the right way to build web applications. And you have that opportunity with AI. And I found that actually really motivating and really influenced how we think about things.

It makes a lot of sense. I guess like if you're swapping out, you know, GPT-4 for Claude or you're making a change in model or you're changing a prompt, then it just helps you really understand how that propagates and what sets of outcomes for users are better, what sets are worse and kind of troubleshoot them. And then it feels like you've built a whole other series of products around that that really helps support that. One of the biggest things when you're building AI products is this uncertainty about quality. So you might, for example, um,

get really excited about a feature, build a prototype. It works on a few examples. You ship it to some users and you realize it actually doesn't work very well. And it's just really hard to go from that prototype into something that systematically works in an excellent way. And I think what

what we have helped companies do is basically like demystify that process. So instead of having a bunch of anxiety about, hey, I shipped something, I don't know if I'm ever going to get it to be able to work well, you can implement some evals and brain trust and then sort of turn the crank and get really, really good outputs. You know, you work with a lot of the companies that I feel are the earliest adopters of AI into their own products. In other words, they've actually shipped products with AI in them and they're sort of that first wave. It's

like Notion, Airtable, Zapier, people like that for sale. What proportion of your customers do you think are adopting some of the things that people are talking about a lot? And so that would be things like fine tuning or RAG or building agents. Do you think that's a very common set of things? Or do you think that's just kind of hype? Because I think you have a very clear picture of at least one segment of the enterprise market in terms of what people are actually doing. Unambiguously, people are doing RAG. So that one is, it's like simple and obvious things.

Probably around 50% of the use cases that we see in production involve RAG of some sort. Fine-tuning is interesting. I think a lot of people think of fine-tuning as an outcome, but it's actually really a technique. And the outcome that people are looking for is automatic optimization of their workloads. Fine-tuning is one way of doing that, and it is a very, very difficult way

of automatically optimizing your use case. I think we, with our customers, have re-benchmarked fine-tuning on their workloads, I would say, every two to three months. And there was a period of time when GPT 3.5 fine-tuning came in

came out before GPT-4 was easy to execute. Now it's extremely cheap actually to run GPT-4.0, but there's this kind of period where it was really hard to have GPT-4 access and GPT-3.5 fine tuning was a way of, it's like the only lever, you know, for some use cases to improve quality. But since then, you know, honestly, I think I've

Almost, if not all of our customers have moved off of fine-tuned models onto instruction-tuned models and are seeing really good performance. We even talked about that early on. I remember when we were thinking about brain trust, we thought like, oh boy, everyone's going to need to use this to fine-tune models. And that was one of the first features we were thinking about building.

And, you know, no one is no one's really doing it. Could you explain just for the listeners, like the difference between instruction tuning and fine tuning? Yeah, I mean, I think it's it's kind of like the difference between writing Python code and creating an FPGA or something. So with instruction tuning, all you do is modify the prompt to include examples of how it should behave automatically.

You know, in some ways it's actually very similar to fine tuning. You're collecting data that guides how the model should behave and then you're feeding it into a process that kind of nudges the model towards behaving that way.

Fine tuning is a much lower level thing where you're actually like modifying or supplementing the weights in a model so that it, you know, learns from those examples. And because it's so much lower level, it tends to be a lot slower, more expensive. You know,

There's a lot of ways you can injure the model while you're fine tuning and actually make it worse on real world use cases. And so it's just a lot tougher to get right. MARK MANDEL: And then do you see a lot of open source adoption or mainly people using proprietary models? And are there other early technologies that you see people adopting right now? CHRIS BANES: We are very close to a watershed moment for open source models. Like we saw at the watershed moment for Anthropic when Cloud 3 came out, especially Cloud 3.5.

Sonnet has really taken off. We are very close to that, I think, with Lama 3.1, but we're not there yet. So we see very limited practical adoption of open source models, but I think more interest than ever. And I think a lot of what you're seeing is also just things that are in production, right? And so to some extent, there's a lot of discussion in the developer communities around what people are using and adopting and playing with.

And then I think you're really focused on the market of enterprises that are shipping AI products. And, you know, obviously it can be used by hackers and developers as well, but a lot of your usage as well as people who have things in production. And so it kind of reflects the state of the world for live systems at scale. I am a developer and I love open source software. And I have a very...

difficult time with the fact that every time I use an open AI model, I'm paying a fee per token. But then I actually look at the numbers. And of course, I've looked at them with our customers too. And in some cases, it's just negligibly cheap. And in the cases where it's pretty expensive, the ROI is actually really high. And so most of our customers are really, really focused on providing the best possible user experience for their customers. And the fastest way

iteration speed for their developers and everything else is secondary. So I think until open source can really move the needle on one of those two axes, it's going to be tough for it to be adopted broadly. The other place you spend a lot of your career is on sort of databases and data infrastructure and things like that. So you have the BP engineering at single store, which I think is

was renowned for really having an exceptional database-centric team. How do you think about the data infrastructure that exists for the AI world today? What's needed? What's lacking? What works well? What doesn't work? The shift is that people have hoarded lots and lots of semi-useful data in data warehouses. Prior to LLMs, there

There was actually this whole industry around AI where companies like DataRobot, for example, would come in and help you train models based on these proprietary structured data that you've collected in your super proprietary data warehouse. And I think the big insight or the crazy non-intuitive thing about LLMs is that something trained on the internet

outperforms what an enterprise can produce with their own data trained on data in a data warehouse. And I think not only is the nature of like the data processing problem different,

But the value of data is actually, and how we think about the value of data is very, very different. Just hoarding data about your claims history or transaction history, it might not actually be that useful. The real question is, how do you construct a model which is really good at reasoning about the problems that you're working on? And I think the way that enterprises will collect

collect data and leverage it into, you know, these AI processes does not look like doing ETL on a data warehouse that's, you know, running in Amazon or something like that. I think it's going to totally change. And I've seen, you know, like a lot of the data that gets...

stored in brain trust through people's logs, it actually never makes it to a data warehouse. And people, they just don't really care about that because if they put it in a data warehouse, what are they going to do with it? What do you think is missing from a data infrastructure perspective? So I think to your point, there's a couple of different steps. There's some sort of data cleaning step. There's some storage layer. There's different forms of labeling, etc.,

How do you think all these pieces kind of evolve over the next couple of years? And then I guess related to that, the other topic people have been talking a lot about is synthetic data and how important that will be in the future. I'm sort of curious your views on these different areas. Purely from a data standpoint, it's important to think about what you're going to do with the data and then how the infrastructure enables that. So...

Data Warehouse is really designed for ad hoc exploration on structured data. Neither of those two things is relevant in AI land. You're dealing with lots and lots of text, and you're not exploring it ad hoc using SQL queries. What we see actually as what the most advanced companies are doing is actually using embeddings and models themselves to help them

sift through tons and tons of data and find, for example, customer support tickets, which are not well represented in the data that they're using for their evals or not well represented in their fine tuning data sets. And

you know, trying to find those examples and use them. So I think the workload is going to shift. And I actually think like LLMs and, you know, specifically embeddings are going to be core to how people actually query data, not, you know, traditional algebraic relational indexes. That's going to be a huge shift. And I think

There's this huge debate about vector databases and will traditional databases do vector database things. I think that debate's kind of silly. I think relational databases are perfectly capable of adding HNSW indices to them. What will really be disrupted is the OLAP workload. So relational, you can't just slap

semantic search and stuff into the architecture of a traditional data warehouse. I think that is actually a much deeper set of things that will need to change than the OLTP workload. This is your, in some sense, third startup experience, right? You joined MemSQL slash single store quite early. You started in Pure, which Figma acquired. You're now doing Braintrust. What are the common things that you've

taken with you as you've done this new startup? Like what are the things that you've implemented early? What are the things that you've avoided? You know, one of the things that I, I honestly took for granted at MemSQL, but we've kind of re-implemented at brain trust is having a really hard technical interview. Um, you know, MemSQL, maybe, maybe we pushed it a little bit too far, but it was really known for really strong technical excellence. And I think our interview reflected that. Um, so that was actually one of the first things that we did, um,

Manu and I spent probably like two or three days working through a bunch of really, really hard interview questions. And I think it's just important that you hold the technical bar really high and try to find people that are attracted to it. Actually,

For example, if you do a front end interview at Braintrust, one of the questions involves writing some C++. And we lose a lot of candidates because of that question. But it's a good signal that maybe Braintrust isn't the right place for you to work. Because we do like to hire people who are willing to, you know, jump around in areas of the stack that they're unfamiliar with. So, you know, I think that's one of the biggest things that...

uh, we've carried over. I, another thing that I think we did really well at, at both, um, Impira and MemSQL is have an obsessive, uh,

relationship with our customers and just really, really focus on making them successful. It's sometimes really hard to prioritize customer feedback and think about, you know, 10 customers are asking for 10 different things. What do I do? So what we've done at Braintrust is actually be very deliberate about which customers we prioritize, especially early on and sort of hypothesize that, you know, the Zapiers and Notions and so on of the world would have pretty similar use cases and

And so if you focus on these kinds of customers, then when they ask for stuff, you can pretty readily assume that other similar customers are going to have the same problem. And that's allowed us to be very, very customer centric while building a product that repeats itself for more customers. And now what we're seeing is that, you know, the next wave of companies that are building with AI, both startups and more traditional enterprises, they actually want to be

engineering things like the products that they admire, most of which use brain trust. And so a lot of those best practices are now built into the product and kind of the next batch of companies is able to consume them right out of the box. Yeah, it's kind of interesting. I feel like even early on as companies were first adopting LLMs for actual live products, they would all follow kind of the same startup journey or I should say technical journey.

Initially, they'd look into, at least back then, they'd look into fine-tuning or some open source model or something else.

They'd eventually realize they should just be using GPT-4, which was the primary model at the time. And then they'd go through this big loop of starting to build internal tools and then realize that really their focus should be on product. And it was the exact same journey. And I remember in their early Braintrust customer conversations, you talk to them and they'd say, oh, we don't need this. And then three months later, they'd call and say, okay, we really need this. And it was always roughly the same timeframe.

Are you seeing any common patterns today in terms of, okay, companies that are now a year or 18 months into their journey using LLMs, like they always have the same thing come up? There's a couple things. So one is companies that are fairly deep into their journey, they have like one or two North Star products that are pretty mature and they're trying to figure out how to get those products to the next stage. Probably the most consistent thing I've seen is companies kind of walking back from the

illusion that totally free form agents will solve all of their problems. So I think maybe like two or three months ago, many of the pioneering companies went way down the agent rabbit hole. And they kind of realized like, wow, this is actually not...

This is not the right approach. It's so hard to control performance. The error rates are really high and they compound really quickly. And so most of those companies have kind of walked back and tried to build a different architecture where the control flow is actually managed deterministically by their code, but they make LLM calls kind of like throughout the entire architecture of the product.

And so that's probably the biggest thing that we're seeing now is, I don't know if there's a good term for it yet, but maybe this kind of pervasive AI engineering throughout a product rather than trying to shove everything into the, you know, while loop of an agent. Yeah, the other thing that I've heard you talk about in the past is the evolving role of what an AI team does at a company.

And so I think if you go back a couple of years, people were doing machine learning and they'd hire a big MLOps team. And then the types of things that they'd be doing day to day were very different from what they do today in the context of adopting AI. And even how you think about the role and who to hire maybe has shifted a bit. Could you talk a little bit about

what you view as the evolution of the role of the data science team, the data team, the ML or AI team, et cetera? Yeah, I think what's really interesting is many of the early adopters of LLMs didn't have any ML, you know,

when ChatGPT came out, what is it now, almost two years ago. And those companies were able to move really quickly because they kind of started with a fresh slate. Many of the smart folks that I know that are classical machine learning people or data scientists have now come around

But actually, there was this big sort of resistance among them early on that LLMs are not good at the things that we're trying to solve, or maybe it's a scam or something like that. Do you think that was just like a different problem set in terms of traditional ML and the applications of it are different from what Gen AI can do? Or do you think it was something else? Well, I went through this myself, watching the technology that we built to do document extraction at Impira become totally irrelevant. And

Personally, I think it's an emotional thing. You try GPT-3 for the first time. And first of all, back then at least, it was kind of snarky. And so that was a little bit irritating. And it was also just way better at everything than anything you could possibly train. And

I think that is so fundamentally disruptive to, you know, a lot of companies, a lot of people's individual identity. Um, it, it just is not easy to wrap your head around if, uh, you've been doing AI and ML for a while. So I, I think it was largely an emotional thing. You could argue that there is a cost security, privacy, whatever element of it, but the companies that were sort of on the leading edge, they were able to figure that out pretty quickly. Um,

Now I think more companies have come along the journey and I've seen a lot of really smart ML and data science people embrace LLMs and bring a lot of the sort of rigor that is still relevant around evals and measurement and prototyping and so on.

and become these like AI platform teams. Usually it's a combination of people with product engineering backgrounds and a few folks with statistics or data science backgrounds. And they start by building kind of like a marquee product for the company and then they evolve into a platform team that enables like the end plus first project to be really successful.

We see a lot of these teams forming as AI becomes more pervasive. So if you were to enterprise company right now and you were to try and adopt AI or LLMs, who would you have to hire or what sort of capabilities would you move over into this platform team? I would start with a group of really smart product engineers because the first thing you need to ask yourself is what

parts of my product or whatever I'm offering can be cannibalized or completely changed by modern AI. Product engineers are generally the best people to think about that. You can get really far with a really good UI and very basic AI engineering that sort of proves out a concept. I think

We've seen a number of good examples of that. I know, for example, vZero is a truly incredible piece of engineering at this point, both from an AI standpoint and also from a UI standpoint. But early on, it was pretty simple. And that's the right way to start. And then I think as you find product market fit, it's sort of the right time to think about more

more rigor, think about fine tuning, you know, maybe we should use open source models for cost or, or whatever. Although I think not many people are far along that journey. I think you said something like TypeScript is a language of AI and Python is a language of machine learning. Yeah. Uh, could you, could you extrapolate more on that? First of all, uh, a vast majority of our customers use TypeScript, um,

And early on, some of our customers were dealing with, should we use TypeScript or Python? And some teams were using TypeScript, some teams were using Python. Now, almost everyone, including people that used to write Python primarily, is using TypeScript now.

And I think that's going to continue forward. There's a few reasons for that. One is TypeScript is the language of product engineering, and product engineers are the ones who are driving most of the AI innovation, at least in the world that we participate in. And so they're just literally pulling the AI ecosystem into their world, and that is driving a lot of TypeScript stuff forward.

Another thing is that TypeScript as a language is inherently better suited for AI workloads because of the type system. So the type system basically allows you to launder the crazy stuff that comes out of an AI model into a well-defined structure that the rest of your software system can use. Python has a pretty immature type system. They're improving and I always get

trolled on Twitter when I post about this by people who make somewhat valid arguments. But TypeScript is just a much, much better language for writing software that deals with uncertain shapes of data. I think that's actually kind of its whole point. So I think it is actually literally a better suited language for working with AI. Have you seen any other shifts in terms of

usage of specific languages or tooling or other things that's happened with this wave of AI? Yeah, I think the biggest thing I've seen over the past six months is people dropping the use of frameworks. So early on, I think people thought that AI is this really unique thing and

Just like Ruby on Rails or whatever, we're going to need to build new kinds of applications with new kinds of frameworks to be able to build AI software. And really, I think people have walked back from that and they now think of AI as kind of like a core part of their software engineering as a whole. And so AI is now kind of like pervasively spreading throughout people's code base.

And it's not constrained to what you can create with a single framework. Aside of the areas that Braintrust touches from a tooling perspective, what do you think are other interesting emerging either platforms or approaches or products or infrastructure that people are starting to use? I think what we've seen from a lot of our customers is a consolidation of vendors. And this is very interesting.

very, very much driven by AWS. So AWS has its mojo back now that they have Anthropic on Bedrock and Anthropic is, you know, especially Cloud 3 and 3.5 are really, really good. And so...

Because many companies were consolidating their vendors prior to AI, AWS is so dominant. And now you can actually consolidate a lot of your AI stuff on AWS as well. We're seeing pretty dramatic vendor consolidation. There's some companies that we talked to and their AI vendors are... It's literally OpenAI, AWS, and Braintrust. And pretty much everything else has consolidated away. So...

You know, it'll be interesting to see what happens. I certainly wouldn't underestimate, you know, AWS and the hyperscalers, especially on the infrastructure side. One of the things that I think is striking is how much time you still spend coding a CEO. And there's a number of CEOs of different companies who continue to write code over the course of their careers at varying degrees. You know, like Tobias Shopify would be an interesting example of that.

How do you think about time spent coding versus marketing versus doing other things for the company and why focus there? My perspective on this has changed a lot over time. When I was much younger, I started leading the engineering team at single store and then became a CEO. And people...

people give you the conventional advice about what you should do with your time and who you should hire and stuff like that. And, um, uh, first I think, you know, the profile of CEOs is, is changing. And second, I think the market is changing. So in the world that we are in, which is enterprise software, um,

People really, really care about the polish of the UI that they're using. I think companies like Notion, for example, have really driven people's taste on those products. But when...

many VCs were having their formative experiences and observing the patterns that they would eventually mandate among their portfolio companies. Things were very different. IT bought enterprise software and they bought it based on checklists that product managers came up with. So I think a lot of this has changed. And for me, it just feels very natural to participate in that change by being very, very deep in the product. And

As hard as I've tried over the past decade plus, I just can't. I think I'm just literally addicted to writing code. It is the fastest, most efficient, and most pleasurable way for me to participate in what we're doing as a company. And so instead of trying to change that, which I've done, at Braintrust, we've kind of

engineered the company to support me spending a lot of time writing code. For example, one of the first people we hired was Albert, who was formerly an investor and investment banker. Before that, he's incredibly good at everything from selling, marketing, dealing with ops, helping with recruiting, and

Working with him has kind of freed me up to spend a lot more time doing that kind of thing. Whereas at Empira, I spent probably like half or more of my day doing those things. Yeah, we had Jensen Wang on from NVIDIA on NoPriors previously. And I thought one perspective that he shared that you don't hear very much, which you're now echoing, is you should really architect the company around the CEO first.

versus just follow the same pattern every time of what the right thing for the company is. And obviously there's areas where you just have to do the same thing every time, like sales comp. It really doesn't make sense to try and reinvent that. And everybody always tries for their first startup. And by the second startup, they're like, why did I even try that? It just kind of works. But the flip side of it is there are certain things that delegate or not. There are certain things that micromanage versus not. And it really varies by the person and what they love doing and

you know, all the rest of it. Um, are there other big differences between, um, how you've approached brain trust and Empira, for example, your prior startup? Another thing that we're really, uh, bullish on at, at brain trust is people being in the office and being really comfortable, uh, being interrupt driven. Um, these are two battles that were very difficult, uh, for us at, at Empira. Cause you know, we, we weren't very firm about it. Um, I think the second one is actually a little bit more interesting. Uh,

at brain trust, if a customer complains about something or they find something about our UI annoying, or they have an idea, we almost always fix it, um, you know, immediately. And, uh,

And that is something that for a lot of engineers is very uncomfortable. But for the right engineers, they've been craving that experience their entire career. And so we handpick those people that want to be in that environment. And then again, we engineer our roadmap and think about how we allocate our time and so on to actually be able to support that. And I think it's one of the key sort of

that has made the product really good and also creates a lot of love with our customers. Not everyone has to have the same edge, but I think you have to have some edge. And so we identified that as something we really cared about early on. And again, kind of like recruited a team of people who really want to do that. Yeah, and I guess like that's translated into sort of customer adoption and some of the logos you've landed. Are there other things that have helped drive

customer acquisition? And, you know, there have been unique ways that you've approached go-to-market. Yeah, I mean, I think I went to the, you know, Elad School of Hard Knocks and learned a bunch of stuff early on from you. But, you know, really the thing that we did was we made that list of like 50 people who we thought were leading the way in AI. And

and said, you know, let's try to figure out a way to get to these people and either recruit them as investors or as customers. And I think that was probably one of the most important, if not the most important things that we did. Some people, for example, were excited about Braintrust. We had known them for a while. They invested and they said, you know what, we've already built our own version of this internally or we don't care about this, but we think

other people will need it. So we'd love to invest. And actually many of those people have now come around and started using brain trust too. Uh, so just being very deliberate about, uh,

who our target market was. I mean, 50 companies is not a huge TAM in some ways, but those companies are very influential and they've led to many, many more customers now. So I think that was the most important thing. Yeah, it feels like people are really misdefying their initial customer envelope or people that they want to target. And so they either go too broad, you know, or do everything from Fortune 500 to, you know, small startups. And then they're not really building for any specific user or they go,

way too specific, maybe even in a segment that just isn't worth pursuing. And so it's really interesting to see how people think about that. Could you tell me a little bit more about what you view as a future of brain trust? How does it evolve as like a product and platform? And then how does it change as AI changes?

Has all eval eventually done by machines or what does the future hold for us? Yeah, I ask myself that question every month or so and surprisingly little changes. But Braintrust, we started out by solving the eval problem and I think we did that really well. And what we realized is that there's actually this whole platform that people want. One of our customers,

actually Airtable early on, they used our evals product to do observability. So they literally would create experiments every day as if they were evals and just dump their logs into those experiments. But

That's, you know, it's pretty obvious when someone starts doing that, that they're trying to do observability in your product. And we dug into why. And it turns out that in AI, the whole point of observability is to collect data into data sets that you can use to do evals. And then again, eventually fine tune models or, you know, more advanced things. But still, you know, evals is the most important element there. And the next thing that happened is that I

you know, some of our customers said, Hey, actually, um, I'm already doing, you know, observability and evals and stuff in brain trust. I'm spending so much time in this product, uh,

why do I have to go back to my IDE, which by the way, knows nothing about my evals. It knows nothing about my logs. Um, can I work on prompts in brain trust? Can I repro what I'm seeing live? Can I save the prompts and then auto deploy them to my, you know, production environment that actually, it scared the crap out of me thinking, you know, just from my, you know, traditional now old school engineering perspective. Um, but it's what people wanted and, and, you know,

I was talking to Martine, who just became a Braintrust daily active user quite recently. And, you know, he spends like half his day now tinkering with prompts in AI town in Braintrust. And so even like old school engineers, you know, like us, it's definitely the right way to do things. And I sort of see Braintrust evolving into this kind of hybrid environment.

Between, you know, in some ways it's kind of like GitHub. You, you know, create prompts. Now you can create more advanced functionality with Python code and TypeScript code and stitch it together with your prompts in the product all the way through to, you know, evals and observability. And I think we're really excited about building a universal developer platform for AI, you

In terms of quality, having lived through the pre-LLM era, I actually think a lot of the anxieties and predictions about quality are exactly the same as they were pre-LLM. Even when we were doing document processing stuff at Empira, people were like, oh, hey, all documents will be perfectly extracted within six months from now.

And LLMs, by the way, are amazing, but document processing is still not a totally solved problem. And I think it's because people will take whatever technology they have and push it to its extreme. There are things that people are trying to do today that are past the extreme, like AutoGPT is a great example of something that is, I think, a really productive experiment in pushing AI past extreme.

what it can reasonably do. But people are always going to push things to their extreme. AI is an inherently non-deterministic thing. And so I think evals are still going to be there. We might just be evaluating more and more complex and interesting problems. And then what role do you think

will play in evaling itself? I mean, AI already evals itself. So very similar to traditional math. I think, you know, if you're doing like a math homework assignment, it's way easier if someone gives you a proof to validate the proof than it is to actually generate a proof in the first place. And sort of the same principle works for LLMs. It's way easier for an LLM, especially a frontier model, to look at the work of LLMs

you know, itself or another LLM and accurately assess it. And so that's already the case. I think probably more than half of the evals that people do in brain trust are LLM based and

I think some of the interesting things that are happening as LLMs are getting better and as GPT-4 quality is getting cheaper is that people are actually starting to do LLM-based evals on their logs. So one of the really cool things that you can now do in Braintrust is you can write LLM and code-based evaluators and then run them automatically on some fraction of your logs.

Sometimes that actually even allows you to evaluate things that you're not allowed to look at. Um, and so the, you know, the LLM is allowed to read PII and, you know, crunch through something and tell you whether, uh, um,

you know, your use case is working or not, but maybe no developer or person at the company is. And so I think that is a really interesting unlock and probably represents what people will be doing over at least the next year. Super interesting. Hey, Ankur, thank you so much for joining us today. Thanks for having me.

Find us on Twitter at NoPriorsPod. Subscribe to our YouTube channel if you want to see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.

Launching AI products with Braintrust’s CEO Ankur Goyal 38:28 Share

No Priors: Artificial Intelligence | Technology | Startups

Deep Dive

Shownotes Transcript

Launching AI products with Braintrust’s CEO Ankur Goyal