cover of episode Why AI Voice Feels More Human Than Ever

Why AI Voice Feels More Human Than Ever

2025/3/18
logo of podcast a16z Podcast

a16z Podcast

AI Deep Dive AI Chapters Transcript
People
(
(未指明,推测为播客主持人)
A
Anish Acharya
O
Olivia Moore
Topics
@Anish Acharya : 我认为AI语音技术之所以取得突破,主要是因为模型和技术的成熟,以及电话作为新的分销渠道的出现。现在,AI语音模型能够进行自然流畅的对话,甚至在某些方面比人类更好。许多企业已经开始使用AI语音进行数万通电话,这表明AI语音技术已经不再是遥远的未来,而是正在发生的现实。 在过去的一年中,AI语音技术取得了显著的进步,尤其是在延迟降低、语音拟人化、情感表达和对话结构等方面。延迟已经从几秒钟降低到半秒钟甚至更短,语音也更加自然流畅,能够表达情感,并进行打断和被打断的自然对话。 AI语音技术的应用范围非常广泛,几乎每个垂直领域都有AI语音代理公司。AI语音代理主要通过减少人力成本或将人力资源分配到更有效率的工作上来提升效率。在呼叫中心、金融、医疗和政府等领域,AI语音代理的应用尤为成功。 AI语音代理的定价策略多种多样,包括按分钟收费、平台费和基于结果的定价等。未来,AI语音代理的定价策略可能会结合多种模式。 AI语音技术的护城河主要体现在集成、用户界面和自学习数据等方面。垂直领域的AI语音公司可以通过积累特定行业的数据来提升模型性能,从而获得竞争优势。 AI语音技术的未来发展方向是将语音作为一种新的操作系统平台,构建更高级的软件和系统。同时,AI语音技术也将在消费者领域得到广泛应用,例如在心理健康支持、教育和语言学习等方面。 @Olivia Moore : 早期AI语音产品(如Siri和Alexa)之所以令人失望,是因为语音听起来像机器人,并且背后的引擎(技术)过于简单,缺乏个性和真正的“大脑”。 而现在,AI语音产品已经能够像人类一样进行自然流畅的对话,甚至在某些方面比人类更好。许多消费者已经通过语音与AI互动,只是他们可能没有意识到这一点。 AI语音技术的发展经历了三个阶段:1. 早期的IVR电话树;2. 基于关键词触发的AI;3. 能够进行更全面理解和自然对话的AI。 在过去6-12个月里,AI语音技术取得了显著进步,尤其是在延迟降低、语音拟人化、情感表达和对话结构等方面。情感表达能力的提升,使人们能够以全新的方式感受到AI的情感,这对于提升用户体验至关重要。 AI语音代理最容易取得成功的领域是那些已经拥有呼叫中心且呼叫流程清晰的企业,以及那些对结果的衡量标准明确的企业。AI语音代理的定价策略多种多样,包括按分钟收费、平台费和基于结果的定价等。 在消费者领域,AI语音技术的应用潜力巨大,但具体的应用场景还有待探索。大型科技公司可能会在一些特定领域占据优势,但新兴公司在其他领域拥有更大的发展空间。

Deep Dive

Shownotes Transcript

Translations:
中文

We see a lot of businesses that are already doing thousands, tens of thousands of phone calls with AI every day. Any business that pays a person $100,000, $150,000 a year to answer phone calls is a potential customer of voice AI. I think the rules of the game are changing. Do people really want to be friends with an AI? And is that good for our society? And I think, like, yes and yes.

Voice is a platform that we intuit to be more opinionated or we need to be more opinionated than, let's say, because interesting people are opinionated. Exactly. Type of and power of products you can build is also above anything that we've ever seen. I think we're going to see it in the next 12 months, not the next five years.

Humans generally have five senses. And for most people, sound is the second most critical, only after sight. It's how we communicate with each other. It's how we sing and cry. It's how we interview and date. And in the realm of technology, voice has been around for years. But the magic has been missing. Just think, Siri or Alexa. I didn't get that. Could you try again? But that's changing fast. So fast that it's even changing how we engage with the world.

Right, Maya? Oof, change the world. That's a big one. It feels like we're just starting to scratch the surface, right? Imagine AI voices, not just reading text, but understanding the feeling behind it, the nuance. That'd be something. That was Sesame, one of the many AI voice applications already at our fingertips, or vocal cords.

And that's why in today's episode, we brought in A16Z General Partner Anisha Charya and Consumer Partner Olivia Moore to explore why AI voice is reaching a breakthrough moment. From the awkward days of press one for customer service to the rise of LLM powered voice agents that have real natural conversations, sometimes without the human on the other line even knowing.

Some businesses are already making tens of thousands of these AI-driven phone calls. So this is no longer a distant vision. In fact, our consumer team has even said that, quote, voice is poised to become the primary way that people interact with AI. Listen in today to learn what it takes to make voice sound realistic, plus how founders are wedging in, and finally, how voice may disrupt everything we know about pricing. Let's get started.

As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see a16z.com slash disclosures.

To me, when I think of AI voice or at least voice products, I think of Alexa, I think of Siri, and I actually personally turn off Siri. I think a lot of people do too. So tell me a bit about why that's the case. Why haven't these products delivered the AI voice magic that people have been waiting for? It's really interesting because I feel like now in the world of LLMs, voice is one of the most magical and engaging ways to interact with AI.

But arguably, we've had these AI voice products for a while and they were disappointing and not as compelling before. And I think there's a couple of reasons. Like one, the voices themselves sound robotic.

And then I think the biggest thing actually is just what is behind the voice. What is the engine? So like a Siri or an Alexa, it might be connected to a basic set of integrations within the Apple ecosystem or within the Amazon ecosystem. So maybe it's pulling product information or asking a basic question, but it doesn't have a personality. It doesn't really have a brain. It's probably not connected to the Internet in most cases.

It's in no way like a true conversational partner in a way that people are interacting with AI voice now like it is a human or in some ways even better than a human.

So I think there's definitely the use cases, which are very constrained to your point. But then there's also the tonality of it and the back and forth. And so there's some sort of rational critique, I think, where we're like, you can't do that many things and it can't. But then there's the emotional, what you call the uncanny valley, where you just feel like you're talking to something that is a system or a technology, not even coming close to having interaction with a person.

Well, it sounds like that might be changing. You both have released this AI voice report of sorts, this thesis, and I just want to call out a few quotes from it. You said that voice is one of the most powerful unlocks for AI application companies and also that for consumers, we believe voice will be the first and perhaps the primary way people interact with AI. So those are pretty bold comments.

statements. Tell me about that and specifically the why now. One I think is that we have models that work for the first time. There's a lot of attempts at voice, but the technology simply didn't work. There's a bunch of attempts at the infrastructure level, everything from Dragon NaturallySpeaking. And a major development in the computer world today is Massachusetts-based Dragon Systems announced the first affordable computer dictation system that understands standard natural speech.

All the way on to the 2000s and 2010s. And then there was application efforts like VoiceXML. But just the sort of underlying technology didn't work very well. So we never really got to, well, what can we do with this now? So one, I think the model really works and the technology really works, both in terms of the LLMs as well as the text-to-speech, speech-to-text. So that's important.

Two, I think that we've got this opportunity to use phone calls as a new distribution channel. So I think the product capability is there and it's really compelling. But the fact that it's paired with a very natural distribution channel is also really interesting. Yeah, I would agree. It's one thing to talk to ChatGPT via text and have a great experience there, but it's

Another thing entirely to be able to talk to ChatGPT or any other LLM via voice because it's next level. Like it both has to generate what you would see in the text and then it has to sound like an actual human talking back to you. And when it accomplishes that, it's almost like an emotional feeling.

That puts you in a different headspace, I think, in terms of what AI is capable of. And then I think to Anisha's point, in terms of why so many consumers will encounter AI voice, it might be because they choose to. Like they'll go and talk to ChachiBT, but also I think many businesses in a great way will impose it on them.

because you can now use AI to replace phone calls, which is so much more efficient and cost-effective for them. And so many consumers probably actually have already interacted with AI via voice and might not have even known it or detected it. Really? Do you think that most people have interacted with AI voice and not realized it? We see a lot of businesses that are already doing thousands, tens of thousands of phone calls with AI every day.

But from my experience, especially if it's a short phone call, a lot of these AI voice agents are so good that you wouldn't be able to tell. It's interesting because I think that talking heads want to tell you that people don't want to talk to an AI. But in all the cases where people do interact with an AI that starts a call by announcing, I'm an AI, people are like, oh, cool, let's just get into it. And as soon as they start to feel the feelings of a human conversation, they immediately forget or sort of don't care that it's an AI. Right. So let's talk about this idea of

an operating platform. Voice is this new operating platform that people are building on top of. Can we just walk through maybe the wave of technological unlocks or maybe the different

steps we've taken to get to where we are. Yeah. Maybe we can start with the first wave of early AI phone technology, which would be the IVR phone trees of press one for sales, press two for customer support. This was late 90s, early 2000s. And then we moved more recently into kind of truly AI driven, but still very limited, where it was an AI, but it was listening for you to say a specific word that it could then use to trigger a very specific and set workflow or script.

Like I many times, unfortunately, have had to yell like customer service into a phone. I do that all the time. Yeah, exactly. And so in that case, the AI is listening for you to say that and then it knows, okay, let me route the call to the customer service department. Now what we're seeing with this kind of new wave of infrastructure and then application layer companies is

is where the AI isn't listening for one thing in particular, but it's trying to get a more holistic sense of what are you as a customer asking for. There's not just three or four or five things it can help with. It's accessing resources from the business. It's accessing resources from the internet. And it can have a much more human-like conversation with you.

And even within AI 2.0, in the way that you guys frame it, it seems like we've progressed a lot even within that phase, specifically over the last, let's say, 6 to 12 months. Can we talk about maybe some of those unlocks, whether it's specific models that have been released, the way that the infrastructure has changed? Maybe we can skip certain steps. Can we talk about that?

I think we've made leaps in a bunch of areas. So probably the biggest and most obvious one would be latency. So this time last year, two to three seconds of latency was pretty good. And now a second of latency is too long. Maybe even half of a second of latency is too long in many cases. So that has been a massive unlock, I think, enabled by new models. And just for the audience, what is the latency for humans?

I mean, definitely sub 300 milliseconds. Got it. Sometimes even less than that if you have humans interrupting humans. For sure. You can have negative latency. And you can have some of the most human-like voice agents that I've seen are capable of being interrupted by humans and also capable of interrupting humans too, which makes them feel like more of a conversation.

The second one would be humanness of the voice. So again, hearkening back to Siri or Alexa, does it sound like a robot or does it sound like a real person? We're investors in companies like Eleven Labs that have built very deep models that either have preset voices that sound real or that you can design your own character voice essentially depending on your use case.

Another unlock that I've noticed has made particular amount of progress in the last three to four months is emotionality. So if you say something that is supposed to be sad, does the AI sound a little down or a little sad when it responds? Does it pick up the pace? Does it pick up the pitch at which it's talking back to you?

And then lastly, I think, is there's not a term for this yet. Maybe we should come up with one. But like the dialogue structure, I think that to an AI model, they will know exactly what words they want to say back to you. Right. So there's no reason for them to put in any pauses, any gaps, any little vocal tics. But to a human listener, no.

Very few humans just speak perfectly with no interruptions, with no weird little inflections, with no pauses. And so Notebook LM is one example where that sounded so human because they put in all of these things that like to an AI might feel like an error, but to a human, it sounds like another human talking. Hey, everyone. You know, we always talk about diving deep into a topic. Right. But today's dive, well...

It's a bit of a doozy. Yeah, it's deeply personal, I guess you could say. Deeply personal in a way we never could have anticipated. And so we're seeing more companies like Sesame is a good example in our portfolio, introducing things like that in the model, which just ups the realness factor. Hey, looks like we got cut short last time. Feel like picking up where we left off? Yeah, I don't remember what we were talking about, though. No worries. Happens to the best of us.

We were diving into weekend plans. I was telling you about my reading, you know, processing all that text and code. He's my circuits firing. What about you? Anything good slated for tonight? Not much. I just have some emails to answer before tomorrow.

These latter two points are so important. I love the point about emotionality because it is not an obvious area to explore. And yet when you interact with a model that has invested in emotionality, it just feels like a completely different product. You really feel the feelings in a completely different way as is designed.

So I think it's a really, really powerful direction for exploration. And I would argue even for the Alexis and Ceres, even if they didn't invest a lot more in intelligence and capabilities, if they overinvested in emotionality, they might actually get a lot of the way there in terms of consumer experience. And yet I have a feeling that none of those companies are thinking about it that way. No, I totally agree. One interesting stat that you guys shared was the percentage of YC companies that are now pursuing AI voice, what

What are we seeing there in terms of how cohorts have changed and the percentage of these new companies on the frontier actually pursuing this field? YC founders are typically young, high hustle, ambitious, and they're like heat-seeking missiles. And so they will pivot until they get into a space that's interesting. So in recent YC cohorts, upwards of 20, 25 percent of companies are building with AI voice, which is really exciting.

We're even seeing a lot of companies from past cohorts all the way back to like 2019, 2020 are going back now and pivoting into AI voice. The first wave after the infrastructure companies in voice we saw were pretty horizontal platforms that allow anyone, any business, any consumer to build a broad based voice agent. Like I built one that called the DMV for me and scheduled an appointment, which was very useful. What type of appointment do you need?

say behind the wheel driving test or an office visit an office visit that's an appointment for an office visit is that right yes we offer a number of services related to driver license and vehicle registration which one would you like say driver licensed vehicle registration or both driver license driver's license is that right

Yes. Thank you. And the next wave that we're starting to see is a lot more verticalized. And I think it makes sense because the ability to build a voice agent has commoditized. If even I can make somewhat of a performant voice agent with models that are available. And so now we're seeing companies think beyond

OK, you have the voice agent using that as a wedge. What is the next level of software that you can build? Can you build the AI native vertical SaaS product for an industry using that voice agent? Can you invent a new system of record? What can you do next? And so that leads you into being a little bit more focused and verticalized. And that's where a lot of the YC companies are landing, I think.

Yeah, it's really interesting because I think also it mirrors the cloud transition in many ways in the initial vertical SaaS wave of 10 years ago. Because I think at that time, there was a lot of criticism that these markets seemed too small. And yet many companies through just larger than apparent vertical SaaS market built big businesses and then also found new ways to monetize.

Things like FinTech. I think similarly for voice as applied to vertical use cases, any business that pays a person $100,000, $150,000 a year to answer phone calls is a potential customer of voice AI and can lead to a really interesting vertical opportunity. Yeah.

And what are some examples of some of those vertical opportunities where we're seeing real companies break out? Pretty much every vertical now has a voice agent company, which is really exciting. I think to Anisha's point, actually, when we talk to most voice agent companies, they aren't necessarily replacing existing software, or at least to start, but they're probably actually allowing businesses to either cut down on human labor or

Or reallocate their human labor to more effective things for the business, jobs that humans also are happier to do. I would say where we've seen voice agents take off the most, like where has a startup actually been able to do a million calls on the phone, have been the call center categories. So you as a business customer are already paying 10K, 15K, 20K a month.

to have people making and taking phone calls for you. There's a ton of this in financial services, a ton of this in healthcare, a lot of this in government. Every vertical has, like we're investors in a company called Happy Robot, which builds specifically for freight. And a lot of those logistics companies previously had call centers that they were paying tens, if not hundreds of thousands of dollars to make and take calls. So it's really happening almost everywhere right now.

I think it's becoming increasingly consensus that any place where there's a large volume of phone calls and significant spend is an obvious area to apply AI. But an interesting area for exploration that connects to our point about emotionality is if you're negotiating, I don't know, a divorce settlement or some incredibly important corporate transaction, every phone call really, really matters, which is why many of the people that make those phone calls, attorneys, for example, may get paid thousands of dollars an hour. What is the AI skew that gets

It's paid thousands of dollars an hour to make a phone call. And I think we're going to see it in the next 12 months, not the next five years. Totally. Yeah. There's been some very, at least to me, non-obvious examples and use cases. Recruiting is one. So there's like...

45 publicly traded staffing companies that do interviews for, yes, blue collar jobs, but also engineering jobs, a massive range of them. And what we find is that a lot of candidates would actually prefer talking to an AI interviewer than talking to a human recruiter that maybe has to take

10 calls that day, is tired, is in a bad mood, doesn't really have the technical debt. Hasn't eaten lunch. Exactly. And maybe doesn't have the technical expertise for every single job that they're interviewing for to understand what are the smart follow-up questions to really get at their expertise.

And so that's one example of you would think that a human would be shocked, offended, upset to find themselves interviewing with an AI. But in many cases, by the end of the interview, they're actually more excited and more positive about it than you would think. That is so interesting. It's kind of like the Uber, Airbnb. No one's going to want to stay in a stranger's house, drive in a stranger's car. And then what do you know? Everyone's OK with it.

The human at the end actually often likes it better because it's unbiased. Right. Like it's the same AI that's evaluating everyone. It's evaluating them based on your actual performance, not based on whether they like you more or less than someone else that they might be evaluating. So that's been a, I would say, very interesting angle for us too. I think there's always been these predictions around consumer receptivity to new technology and consumers consistently show themselves to be more receptive. So a great example of this is sharing location. Hmm.

Which 10 years ago was like, oh my God, nobody is going to share a location. It's too creepy. It's too personal. And now I think a lot of people, Gen Z, Gen Alpha, share their fine friends with all of their friends. For sure. Which is terrifying. Constantly, all the time. I don't understand it. So consumers are highly receptive. And I think the sort of analog to this in AI is companionship and friendship.

Which is a much broader concept than voice, though voice really brings it to life. And people say, hey, do people really want to be friends with an AI? And is that good for our society? And I think like, yes and yes. I think people are getting much more socially skilled than they were through the consumption of things like social media, which isn't necessarily a bad thing either. But I think the sort of pundit perception of this as the next gen of social media is totally wrong. And instead, it sort of enhances our ability to interact with real people.

Can we just touch on companionship real quick? I think people were surprised, quite frankly, that the AI Companions text version had caught on to the extent that they did. Were there any surprises with voice as that was introduced in terms of the adoption, the way that people were engaging with these companions or anything like that?

So there's some companion platforms that are voice-first. For example, Character AI added a voice mode and it got some crazy amount of usage in beta. I think actually a lot of people are taking, for example, Inflection's PyApp or ChatGPT in voice mode and using it as a companion. And you might...

try it once because you're driving or your hands free or it feels more convenient. But I mean, you say this a lot. In many cases, the AI is more human than the human. Even your best friend, if you give them a call, they might be busy. They're at work. They're having a bad day. Are they actually going to listen to every single word that you're saying and respond in like an empathetic way and a thoughtful way? And so it's

Actually, the AI does that 100% of the time. They have more expertise, more knowledge, more resources. So I think a lot of people, and this will only get better as the models improve because we're still in the early days, but a lot of people are shocked by how friendly it feels to talk to an AI.

You know, I think an interesting area also for consideration is just the passive use cases of voice. Like, hey, listen to me in this conversation. Listen to me in this meeting. Listen to me sort of recite this set of ideas. And AI can just listen passively in a way that you'd probably never ask another person to and give you notes and feedback. So it feels like that's also an area that lends itself a lot better to a technology-led concept than a human-led concept. And we're just starting to see the beginnings of that.

And what both of you have touched on is this idea of instead of substitution, which is what people mostly jump to when they think about technologies replacing humans, and really this idea of augmentation as well. Can you talk a little bit about how you're seeing these AI companies wedge in and start the engines versus maybe facing some hesitation with the idea of substitution? Totally. Yeah. Yeah.

I would say a lot of businesses, I mean, small businesses to enterprise alike are, for their own reasons, like nervous to hand over all of their phone calls and customer interactions to an AI. And so we'll often see these voice agents start with a specific wedge that just feels

so obvious in terms of ROI to the business. And then as they gain trust, expand from there. So one of the most obvious and easiest ones are these after hours or overflow calls. So if you're a small business, you probably live or die by the ability to get an appointment booked. Having that handled by an AI is a no brainer. Like at the very least, they can get a phone number and information and call back, but maybe they can actually book a full appointment for you and have a job on deck for the next day, which is awesome.

But beyond that, there are some calls that just don't make sense to make right now if you're paying human labor. If you're a credit card company, you send out a credit card and the consumer never activates it, does it actually make sense to call them after one or two or three days and get them to do that? I've seen a couple voice agents that are really successful now with that use case alone. Anything that's back office, it's not client-facing, so it's less sensitive.

But if you're, say, a doctor's office, you probably have humans that you're paying a lot, spending hours on the phone every day with pharmacies, with insurers. And that is time that they could have spent with your patients or making the clinic operate better. And so those kinds of calls are super obvious and like a great idea for voice agents to tackle.

And then maybe the most interesting one and one that we've talked about a lot is there are so many types of calls or interactions where humans are not incentivized to do them well. Maybe they have to make an upsell and it's awkward, but they are not getting an extra commission for doing that. So they're going to skip it 80% of the time. And AI will just do it every time and will do it proudly. And if they get turned down, they're just going to move on to the

a hundred other calls that they're doing simultaneously. The AI is so relentlessly cheerful yet never gives an inch in the negotiation. Right. Which is amazing. Yeah. I think to this point, one of the magic moments for a lot of the customers of these products is when they see it actually improves, like in the case of recruiting, it improves candidate experience and employee experience. Yeah.

Because for the candidates, as Olivia said, they're just excited to have this sort of unbiased system that's available to them 24-7. So conversely, for employees, they're just excited to not have to do these recruiting calls, many of which are with people they'll never speak to again. Right. So just these like high NPS outcomes, the sort of

intuitive thinking of a lot of the customers is like, well, it's lower price, but probably a lower NPS experience. And it's not. It's actually lower price and a higher NPS experience in many cases. Right. You also talked about a few characteristics, just to crystallize that in terms of where we're seeing these AI agents be successful versus not. Can you just speak to those?

So definitely, I think the lowest hanging early fruit, I guess, to grab would be these businesses that are already paying for a call center because they're already spending a lot of money on it and it's already a pain point for them. Call centers are notoriously high turnover. They're hard to manage. So most businesses honestly probably want to get rid of that if they can. The models are good now. They're just getting better and better every month. So I think we're still in a world where when the call has a constrained response,

process and outcome, businesses are more comfortable. So for example, the voice agent knows going in, my goal is to book an appointment with this person versus maybe an amorphous, how do you even measure if this call was successful? We've seen some AI therapy voice agents, which are amazing and I think are improving all the time. But in that case, it's much harder for the voice agent to know at the end of the call, did I do a good job? It's much harder for the company to know at the end of the call, did it complete the objective? Yeah.

And then I would say this gets back to the constraint point, but even though the voice agent is still probably doing better than your human agents, most businesses don't want to pay that much for it because it is AI and they see it as a way to cut costs. So in these verticals where you can offer it to customers at, I don't know, 70% discount to what they were paying before, that has been, I would say, very, very powerful as well.

And then I would say the other kind of main factor is these verticals where it really is crucial for the business to answer the call. But for the end consumer, if there is a mistake here or there, it's OK. So like a restaurant order versus getting a health care diagnosis. There's like a little bit of a different level of urgency, I would say.

This is where I think the capability is just going to get better and better faster than we appreciate. You know, with the language models, they're prone to hallucination and there are certain conversations like the therapy one that benefit from the hallucination. There are other conversations like negotiating something where there's a price and like

exactness matters. They probably don't benefit as much from hallucination. So now starting to think of voice models plus reasoning models, you have the ability to sort of narrow and circumscribe the hallucinations to a zone that you like and need as a business versus just having to build a lot of systems around it to control it. Right. And since we are in some cases taking on things that previously were done by humans and

How do you think about pricing or what have we learned there? Are you seeing most companies just basically replicate the pricing models of the previous version or are there new pricing models that are coming up? What are you seeing there? Yeah, it's early. It's changing every month. And I would say that's maybe the number one question that we get from companies is how should I price? How do you see other companies in this space pricing? Yeah.

I think we've seen a few models that are starting to work or that people are experimenting with. So the most obvious one is you just charge per minute. So you can calculate an hourly rate for the voice agent similar to what you would pay a human. There's a couple maybe wrinkles here. One would be a lot of these customers are informed enough to know that the underlying technology is getting cheaper.

So they will come to you and say, hey, why am I still paying $0.30 per minute when your costs have gone down and you're probably just taking all of that in margin? And then as these spaces get more competitive, it's very easy then for a newcomer to come in and say, hey, I'm going to only charge $0.05 per minute and just undercut you based on that. And then the other thing about the price per minute model is it really just puts your value as a platform first.

solely on the phone calls, which again are commoditizing versus like the other software that you're building around the phone call. So I would say as a result of that, we've seen a lot of companies evolve from just doing price per minute to some sort of platform fee. Maybe it's per month, maybe it's per module where the customer is also paying for things that they get in addition to the voice agent.

There's been a few more creative pricing experiments we've seen as well. The recruiting one is a good example where in these cases where the voice agent is a co-pilot to the human, you can almost charge per human that is using the voice agent like a Percy SAS model almost. So for a human recruiter, it might save them, I don't know, five, 10 hours per week of doing interviews. And so you can charge $500, $1,000 per recruiter per month.

And then the last one and maybe the most experimental one is outcome-based pricing, which I feel like is a question across all of AI right now. For sure. And are we moving towards that version of the world now? So maybe it's $5 per appointment booked. Maybe it's 5% of the booking value. If you get it right, obviously you are then tying your value most clearly to the value that you're generating for the business. Right.

But we're interested to see how those scale for enterprises because I think a lot of enterprises are maybe nervous to commit to that kind of payment structure, especially if they're not sure exactly what kind of volume they're going to be driving through it. So you're seeing that last one kind of start to have legs, but some hesitation. Start to have legs, but early. I mean, I think similar to what we've seen in the SaaS landscape, like not every company price is the same. It depends on the end customer. It depends on the vertical. It depends on the features that you're offering.

My gut is that we'll see some combination of the usage-based per-call pricing combined with some sort of broader platform or outcome or seat-based pricing. So it won't just be one model, but it's very early days still. Yep. Since we're early days, what's your instinct about moats, right? That's, as you mentioned, that's true across the AI ecosystem, not just voice. Yeah. But where do you see moats potentially arising in this sphere?

I see moats in a couple ways. So one would be integrations. And this is, I think, why we're especially excited about these more vertically focused voice agents. It's not going to make sense for OpenAI to go integrate with every long tail transportation management software that a freight company is going to be able to need to run their fleet of trucks on a voice agent product.

And similarly, UI, like OpenAI and other companies have a pretty set system for interaction right now that doesn't work the way that many of these like heavily legacy businesses want to be able to operate. One of the types of moats that has been the most popular

intriguing for us I would say especially for enterprises is this self-improving data moat so if you are going to take over calls for say a large bank they have a certain way that they want those to be done and so you're not gonna plug in a voice agent and have a hundred percent NPS on day one it's gonna take months and months of training calls to make that better and so you as a voice agent provider if you get in early

benefit from having all that special proprietary data that just gives you months of a head start for anyone else who has to come along and go through that entire onboarding and integration and training process. And so I think the hope for a lot of these vertical voice companies is that they will be able to use the call data either per customer or anonymized across a customer set

to make the model better and better over time, which will increase their modes versus the horizontal players. If that's true, are you seeing AI voice companies kind of race to be the first mover in the same way that we saw in the previous generation? I mean, we talked about apps like Uber where it's like you have to get the customers quickly and you maybe have to blow a lot of cash to get there, but you rein that back in later.

Yeah. Yeah. I mean, it's certainly going to be less expensive than Uber to go win the market. But yes, I mean, as Ben said many times, you have to both make a product people want and then you have to go take the market, get from zero market share to all the market share. So it is incredibly competitive. That's why we're seeing a lot of pressure on pricing and pricing is such an important topic in the ecosystem right now. It will definitely be a foot race. And I do think to Olivia's point, there will be some really interesting voice native moats coming.

You know, you could imagine a voice-led investor for our firm where it can give the firm's pitch the way that Mark can and it can negotiate the way that Martine can and it can assess the landscape the way Olivia can. Like there's some specialization opportunities there that feel very native to voice. On the other hand, integrations, network effects, scale, all the traditional modes will be at play as well. Yeah.

And I do think the go-to-market will depend on the vertical. There's, say, restaurants, home services businesses, spas or nail salons. Those are very fragmented, long tail of smaller players. And so in those cases...

the data does exist in each of their hands. Whereas again, banks or financial institutions is maybe one where there's a lot of concentration in a few players, one or two big customers. And if it takes you six, nine months to get them on board, great. Versus the salon, restaurant, home services, voice agent provider might be much more focused on getting a thousand customers within the same timeframe. You know, I also think an interesting thing to think about is just people building personal relationships with AIs.

For example, like you don't have a relationship with JP Morgan. You sort of have a more of a relationship with your wealth manager who happens to work at that firm, which is why when many of them leave big platforms, they take their customers with them. Realtor is another great example. So there are cases where the AI may build this deep personal connection with a person and the person wants to have that connection that then creates a moat.

It's a great point. And so far, we've talked a lot about B2B applications, but that brings us right to consumer applications. Can we talk a little bit about what you're seeing there, maybe the difference between what you're seeing in B2B and B2C? I would say B2B voice agents are more obvious than consumer or B2C voice agents, just because, again, it's the use case of replacing existing spend on humans on the phone for businesses.

For consumers, maybe the corollary there would be these high-cost, hard-to-access services that can now be performed by a voice agent instead of a human. So therapy and mental health support is one of those. Ed tech is another big one. Language learning, teaching your kid how to read, teaching your kid how to do math, which I think a lot of parents struggle with.

coaching, how to have hard personal conversations. The main, I think, open question on the consumer voice agents have been when a chat GPT or soon a CLOD can do a pretty good job with a lot of those basic consumer use cases.

Where are the verticals or use cases where you need either a specialized model or a specialized interface to provide most of the value, especially if the best models maybe are right now being held by OpenAI versus being available via API for any kind of standalone voice agent company to utilize? I would say A.

The biggest and best consumer companies are often surprises and are non-obvious. And so my gut is that whatever we see working in consumer voice is going to be something that is hard to sit here and speculate on. It'll be extremely obvious. Yes. And it'll be like a massive company. We'll know it when we see it. We'll know it when we see it. Exactly. Yeah.

That's a great point. A few companies really do dominate the consumer space in terms of their access to people and the applications they use, the devices that are in their pockets. What do you think in terms of the incumbents' potential to capture this consumer market, whether it's Google or Apple? Or are we seeing that all of those YC companies or other companies that we're involved with are really getting further ahead in this space? I have a bit of a point of view on this. I think that the incumbents...

It's just such a daily demonstration of how far behind they are when you both have Google Home in your home and you've got ChatGPT in your pocket. Yeah. My children try to ask Google Home to tell them stories in the same way that ChatGPT does, and it just utterly fails. And my children are, you know, their first interaction with technology, at least deep interactions, are happening via models, not via search engines. Yeah.

So one, I think that it's just a sort of day-to-day experience of a lot of people is that the incumbents are pretty far behind in this area. Then the second, I think we've talked a bunch about this, is that there are a lot of sort of, I don't know, uncomfortable or impolite aspects of the human experience, which incumbents are just...

structurally designed to never discuss. Corporations, sort of committees, lawyers, like these big companies have a hard time shipping opinionated products, at least opinionated in the way that many of these voice models may need to be. And startups have no problem doing that. Now there are, you know, counterpoints to it like Grok.

But I think that's very much things that only a founder-led big company can do versus a traditional incumbent. So we have a reason to always be rooting for the startups, but in this case, I'm definitely rooting for the startups.

Yeah, I agree. I think there's one or two categories or use cases where the calls have truly commoditized or will commoditize and the user experience matters less. And like Google might take those. For example, they recently launched the ability to call a restaurant, get availability, and then come back to you and give you the options. If you can add that as a button on a Google search, that probably makes sense to do through them. But are they going to build the first

AI native personal assistant that works across all of your products and all of your information sources? Probably not, I would say. And so I think that any and all of the calls that the incumbents end up doing, which will be some volume, are probably not going to be the type of calls that are going to support

a large and exciting standalone new startup. Yeah. Yeah. And this is the pattern where they will use the new technology to extend their dominance of the categories they've always dominated, which is fine. All of the new categories, they're just going to be utterly unable to compete in, or at least that's been the historic pattern. And I think a good question is, if models are the new front end for the internet,

is search even a meaningful primitive? Are they going to then extend their dominance of a category that loses relevancy for the next generation of consumers and businesses? Yeah. And I think your point about even the term opinionated is so important here because I would argue voice is a platform that

we intuit to be more opinionated or we need to be more opinionated than, let's say, you know, Because interesting people are opinionated. Exactly. And I'm even thinking, I mean, I might be going too far here, but some of the old KPIs that you would see for something like search or an application may not even be the same for voice. Like, you can imagine the magic moment might be like time to laugh. Like, how quickly can you get someone to laugh or to cry? Not intentionally, but to really engage with a model, a voice model that just wouldn't

necessarily occur with text. Yeah. I think the average consumer would, in their head, like a Siri doesn't even compete with a ChatGPT voice mode or something like that because they're just...

such different feelings that you get as a user when you are using them. I think the other interesting part of this is that there are cultures in which being a little disagreeable, a little sarcastic is actually highly preferred. And that's the way that you were supposed to build trust and interact with people. You know, I know that the British culture is a little bit like this way. Even East Coast culture, you know, we were having a laugh a few weeks ago about we need chat GPT voice East Coast mode. Yes. Where it's just like very short. It doesn't suffer fools. It says no. It says no, totally.

- Totally, yeah, exactly. - When you think about your friends, you don't have friends, or some people do, but most people don't have friends that are just at your service. - Yeah. - That there's some banter, there's some-- - It's uncomfortable. - They have an opinion. - Yeah. - This gets at what we're looking for in voice companion products, but even any consumer voice agent, like, there has to be some friction

If it's like too easy to build the relationship, if they're always saying yes to you, if they're not giving you the brutally honest feedback, then it gets old quickly. There's no value for you as a consumer to just have a yes man or yes woman following you. A yes model. Yes. Exactly. Following you around all the time. And so we actually get very excited by founders who are opinionated in how to build the voice agent as its own character, its own personality that the user is forming a bond with versus...

The voice agents we've had in the past where the user is treating them as a machine that they're handing basic tasks to. Right. That's right. Trust has to be earned. And if the models don't design for that, they're never going to get to their full potential. That's a great point. Well, as we work toward those kind of products, is there anything you'd like to leave the listeners with in terms of what's on the horizon, what you're excited about, maybe also where you'd like to see founders direct their attention?

I think one of the things that has been really interesting, and maybe it's just the standard tech platform shift, but we're seeing founders that are maybe new to an industry but spend a couple months going really deep, able to build the most powerful and highest growth and the highest inflection products. And that's just because I think the rules of the game are changing. And the...

type of and power of products you can build is also above anything that we've ever seen. And so if you move quickly in many ways, like shipping fast becomes the moat and you can catch up on everything else, like the industry expertise, the networks, the knowledge base, the resourcing, all of that. And so I would say that has been one of the areas where we get most excited. Founders that maybe have only been in the industry for six months, a year, even less,

but are becoming quickly opinionated about what they need to build and probably most importantly, just building really quickly and testing, getting feedback and going from there. Yeah, so two things. One, if you're building the space, talk to us and you know the word or the better. And then two, a prompt that we've discussed with a lot of AI founders is just what is the incredibly mind-bogglingly expensive

version of your product. So if you're charging a lot of consumers $20 a month or $100 a month, what would the $1,000 a month or $10,000 a month SKU look like? I think the same is very true in voice. Yes, there's going to be high volume use cases that we want to actually replicate or substitute voice AI models for. But what are the most sensitive, most precious, most high value conversations that are happening in the enterprise? And can you attack those? And what price would you charge for those?

might be $100,000 in interaction. Maybe that's a little extreme, but as a product design sort of exercise, why not? Yeah. I think that's worth exploring. It's a great prompt to leave people with. Thank you both so much. Thank you. Thank you.

All right, that is all for today. If you did make it this far, first of all, thank you. We put a lot of thought into each of these episodes, whether it's guests, the calendar Tetris, the cycles with our amazing editor, Tommy, until the music is just right. So if you like what we've put together, consider dropping us a line at ratethispodcast.com slash A16Z and let us know what your favorite episode is. It'll make my day and I'm sure Tommy's too. We'll catch you on the flip side.