DOP 289: When to Build Your Own vs. Using Off-the-Shelf

2024/11/13

DevOps Paradox

People

Darren Pope

Hugo Santos

Victor Farsen

Topics

Hugo Santos 作为 Namespace Labs 的 CEO，分享了他们选择自建硬件而非使用大型云服务提供商的原因。他们专注于特定用例，通过垂直整合计算和网络资源，实现了比云服务提供商更高的性能和更低的启动延迟。这需要大量的基础设施专业知识，并且过程非常痛苦，但对于他们的特定需求来说，回报非常高。他们认为，对于大多数公司来说，使用云服务仍然是更好的选择，但对于一些特殊用例，自建或使用专业化云服务提供商也是一种可行的方案。他认为，初创公司应该专注于解决当前问题，而不是过度优化未来的问题，因为大多数初创公司都无法生存到未来。他还建议初创公司充分利用公共云的弹性特性，避免将其视为普通数据中心。 Darren Pope 和 Victor Farsen 作为 DevOps Paradox 节目的主持人，与 Hugo Santos 就云服务选择、Kubernetes 的应用、AI 的未来发展等话题进行了深入探讨。他们认同 Hugo Santos 的观点，即对于大多数公司来说，使用云服务仍然是更好的选择，但对于一些特殊用例，自建或使用专业化云服务提供商也是一种可行的方案。他们还讨论了 Kubernetes 的优势，以及如何利用 Kubernetes 降低对特定云服务提供商的依赖。他们也探讨了 AI 在未来五年对软件开发和运维领域的影响，认为 AI 将会辅助软件工程师等职业，提升工作效率，但人类的创造力无法被取代。 Darren Pope 主要从主持人的角度，引导话题，提出问题，并总结发言人的观点。他与 Victor Farsen 一起，对 Hugo Santos 的观点进行补充和解释，并提出一些新的问题，推动讨论的深入。 Victor Farsen 主要从主持人的角度，引导话题，提出问题，并总结发言人的观点。他与 Darren Pope 一起，对 Hugo Santos 的观点进行补充和解释，并提出一些新的问题，推动讨论的深入。他特别关注 Kubernetes 的应用，以及如何利用 Kubernetes 降低对特定云服务提供商的依赖。

Deep Dive

Shownotes Transcript

Translations:

中文

My belief is that there's creativity in the human mind that is not going to be replaced, so you always need some sort of conductor, and I expect that to still be the case in five years. This is DevOps Paradox, episode number 289. When to build your own versus using off the shelf. Welcome to DevOps Paradox. This is a podcast about random stuff in which we, Darren and Victor, pretend we know what we're talking about.

Most of the time, we mask our ignorance by putting the word DevOps everywhere we can and mix it with random buzzwords like Kubernetes, serverless, CICD, team productivity, islands of happiness, and other fancy expressions that make us sound like we know what we're doing. Occasionally, we invite guests who do know something, but we do not do that often since they might make us look incompetent.

The truth is out there, and there is no way we are going to find it. P.S. It's Darren reading this text and feeling embarrassed that Victor made me do it. Here are your hosts, Darren Pope and Victor Farsen. So here we are, Victor. We're starting to ease into the holidays. It's the middle of November. In fact, if you're listening to this on release date, it's November 13th. And through the magic of recording early, right now Victor is on his way to KubeCon.

U.S. or N.A., however you want to say it. So is our guest. On today's show, we have Hugo Santos on. Both of y'all are going to be hanging out at KubeCon. Let me ask this question to start out. I'll ask it to Hugo first. Are we expecting anything new out of KubeCon this year?

it's a great question well first of all thanks for having me it's great to be chatting with you folks there have been folks that have been going to kubecon for much longer than i have but definitely has been an interesting journey seeing kubecon evolve from a small gang of folks that worked on on kubernetes to this massive halls of companies that are kind of growing around in the industry

That's what I expect. I'm actually going there to see what's actually buzzing, what's capturing people's attention and how AI in particular is sipping in into everything that we talk about. I'm mostly going for the people than anything else. I really appreciate that connection and being able to go and chat with folks that I've known now for a while and seeing what they're up to. Now, Victor, before you answer, Hugo is the CEO of his own company, so he can choose to go to KubeCon.

unlike us plebs who are employees of companies, right? So Hugo, you are in a vaulted position there. He's employed as well, just by investors. Well, that's definitely an interesting way to look at it. We're very friendly with our investors. And I think if they felt that we were not using our funds well, I'm sure that they would ring the bell. But in this case, I think it's important to be where the action is. We're actually people that are thinking about the same kind of problems. So it is a deliberate...

decision. But it's also when we talk about it internally, where it's worth being throughout the year, which conferences, which meetups, we look at it very much from who are the people that are going to be there, who are the companies. And I think the decision process probably doesn't look very differently from your own. Victor, is there going to be action at KubeCon this year? No. First years, five years, six years, seven, maybe eight years.

We were in the discovery phase where, hey, oh, I did not think about this. So companies or projects were popping up with new ideas that you haven't heard of before, right? Oh, service mesh, right?

or whatever that something is. Now we're more mature, I feel. So nobody I know is doing something revolutionary that haven't been seen before and it will be announced at KubeCon and everybody will have their eyes open and say, oh, I cannot believe that you came up with this idea. I don't think that we are in that phase anymore, right? That will be happening in some other areas, but not in cloud-native CNCF, what's not to be.

That does not mean that there are no improvements. It's just that now there are more incremental improvements. Similar to Docker, right? Docker is not revolutionizing anymore anything, right? They're just incrementally improving what they have. Now, Hugo used to be at Google, let's say, years ago. Is that a fair statement? Yeah, it's been three years, actually. Time flies. So it's been three years since you've been gone. How long were you at Google?

I was there for nine years working on infrastructure, mostly focused on the internal developers, folks building search and YouTube and photos and other applications. Definitely a very interesting journey looking at Google is its own microcosms. You might have heard this from other folks. And that is true. Whenever we bring in people

to our company today, and they come from Google, we often talk about a period of de-Google-ification where it's a lot of translating how we did things internally back at Google and then how the industry is doing it. But for the most part, similar semantics, just different ways to tackle the same problems. So in those nine years, you saw, let's say if you were to do the math right, that was probably 2015 when you were there? 13 when I started.

So, right, because you've been gone for three. So 13 was at the early stages of Borg, maybe? I mean, Borg had been around, but that's when it started leaking out? I think that's when Google started talking about Borg. When I joined, Borg was already very established. So there were kind of phases to it. There were some teams that still...

were not fully bought into the performance impact of running multiple replicas, running in containers or kind of Google's version of containers. But when I left, Borg was pervasive, like everything. Apart from a couple of very specialized applications, almost everything was on Borg. Folks tend to think of Borg as kind of like Kubernetes, but it has its own twists. The way that

Package distribution happens is slightly different. The way that networking happens is slightly different. But for the most part, it's kind of similar semantics or at least similar goals. That was why Borg was put together. But it was the world in 2013 of how different teams use Borg and then how we ended up using Borg when I left in 2021 was very different, was extremely different.

And then I think there's a nice parallel here with also what we're seeing with Kubernetes and all of the CNCF projects as well. Like things start much more bare bones and then they grow in complexity, but also in integration over time. And that's the journey that we saw as well.

Until they grow to sufficient size that nobody can replace them anymore, right? It's really interesting how the energy necessary to replace something sometimes is so enormous that it's just not worth it. I mean, I cannot speak for Google, but we often felt this pressure of, well, we're so invested into Borg, but the industry is picking up on Kubernetes. So why wouldn't Google just move over to Kubernetes as well internally?

And it's just extremely difficult. Like the scale at which the internal systems operate is just so high that a lot of work is required to bring the equivalent systems up to par. And it's hard, you know, for a running business, even one at Google scale to justify doing that. So it will be interesting to see for how long Borg will continue to be the foundation at Google. It still is to the best of my knowledge today.

I'm actually surprised, to be honest, that Kubernetes itself is so widely used right now. If you asked me five years ago, I would say, yeah, it's going to explode. People are going to love it. They're going to use it widely. But enterprises with their silliness created 20 years ago will not be able to move. And some did not. Most of them did not move 100%. But I'm still surprised that

companies were motivated enough or saw enough value actually to do that investment, you know, because it is a big investment. Yeah. I think there was something that the folks that put Kubernetes together did really well, which was the composable model. In some ways, there's no single one Kubernetes. There's a control plane. But if you look at inside of different companies, they all put different pieces together to build their own infrastructure.

And there are companies and even ourselves, we try to tackle this problem space early on. But I think now in hindsight, that's actually a strength and it's inevitable that you need specialization. And the fact that Kubernetes is more of a building block, I think enables different companies to use it. I think Kelsey at some point mentioned that

Kubernetes is a little bit more like Linux. It's more like a core operating system, but more of a distributed operating system. And I tend to subscribe to that. It establishes a common language where people that are building different parts of the system, they understand each other and they have APIs that they can build off. They can pick and choose which APIs they target, which APIs they compose.

Yes, that adds complexity because there's all these diversity of solutions, but that's also, I think, the strength and why Kubernetes sees the adoption that it sees. And even beyond just the container platforms like ECS and Google Cloud Run, I think that flexibility ends up being extremely powerful. I have mixed feelings about that, to be honest. I think that accessibility was very, very beneficial from the project's perspective,

that, hey, because it's so extensible, we could end up with hundreds or thousands of different projects competing with ideas and moving it forward, right? Because I feel it's just as much about Kubernetes as of everything else, the ecosystem itself, right? That's what makes it Kubernetes, actually. But then normal people, users, they get constantly confused. You know, how many talks have you seen

in KubeCon or wherever, where it starts with a screenshot of the CNCM landscape. And you know, okay, this talk is going to talk about the madness that surrounds us, right? Now, I feel that we are better off than we were years ago because we have some winners. Okay, oh, you're most likely you're going to use Argo CT for synchronization. You're most likely going to use Cilium. So it might be slightly better, but it's still horribly complicated.

Just to choose, you know, just to say, okay, how do I start? Yeah, it is. I think a lot of folks look for a solution to that problem. I don't know if there should be a solution. It's like everything in technology. There are trade-offs. You can narrow down the set of options and then you make it simpler. But with that simplicity, you lose a lot of flexibility in building more powerful solutions.

I appreciate Cilium. So Cilium tackles a lot of problems, but Cilium could only come into the picture because there was enough flexibility to be able to build something like Cilium. Because if there wouldn't be APIs to change the CNI, then Cilium would probably not be a thing. It would just be way too difficult to integrate into the Kubernetes ecosystem. And if you don't have an incremental story, then you're dead in the water anyway.

So I think those abstractions are helpful. And yes, they come at the cost. The cost is of the multicolored, highly pixelated CNCF landscape banner ever growing, but it is a part of it.

Will the landscape ever quit growing? No. I also don't think it will stop growing. I think Victor is right that the curve changes. It starts very exponential at the beginning when there's a lot of problems to figure out and then it stabilizes and probably now it's sublinear in that it actually slows down the growth. But I also don't think that it will stop because the world doesn't stop. Like the types of problems don't stop. Five years ago, we would talk about

GPU orchestration and scheduling, but not with the same emphasis that we talk about today. And there's kind of a very real reason to do so. And in a few years, there will be something else that will be important that we'll need to find a way to represent and have the right abstractions and have the right support software in order to incorporate it into what we already have. So I think it's never going to stop. What is happening in five years? Depends on what part of Twitter you read. Yeah.

Okay, well, let's read all the parts of X, Twitter, whatever. What do you think? I think in five years, but maybe let's preface by saying that the work that I'm most familiar with is tech, right? Whether it's software engineering, systems engineering. So that's what I think the most about. And I think in five years, it will be much more assisted. I think it will be easier to address some levels of complexity.

because you will have access to vast encyclopedias of knowledge that you can just ask questions. And they may even be able to reason about the question that you're asking. Is this the right question? How should we approach it? There are some folks that think that there's some class of work that will be replaced. My belief is that there's creativity in the human mind that is not going to be replaced. So you always need some sort of conductor. And I expect that to still be the case in five years.

So I think co-pilots helping software engineers, helping DevOps, helping DC ops, helping all kinds of professions and elevating their work. I think the downside is that there's going to be a lot more output and it's unclear whether the quality of that output, whether the signal and noise ratio will remain the same. Very likely will go down. So there's going to be more software, but

I think a lot of that software will be of less quality because the effort necessary to actually develop new software will be lower. So I think that's the counterpart to it. I don't think we know how to tackle that. Right now, we're still fairly simplistic because you interact with an LLM, you ask a question, and you still interpret that question, even though there's a lot of energy around self-driven agents that build applications as a whole. I haven't seen an...

any kind of meaningful large system built that way. If anything, my experience back at Google tells me that these types of problems won't be solved for a long time. Software is only a small part of the problem. It is an important part, but it's only a small part of the problem. So yeah, assistance, folks will be able to do much more with less.

Maybe smaller companies that can build bigger products, more robust products, because you won't need as many people. That's directionally where I expect this to be in five years. I see a big part of the history of the software industry attempt to remove or reduce the number of repetitive things we can do so that we can focus more on creative things, right?

This conversation to me is very similar to the conversations we had about Java a long time ago. Oh, those guys, they don't know how to manage memory themselves, right? This is going to pass, right? This is going to be poor quality, what's or not. And it turned out it's not. It freed people to focus on other things. Now, in the case of Java, it was wrong because memory management freed people to write getter setters instructions, but...

Nevertheless, those things are happening all the time. The current state of AI to me is very similar. It helps me by letting me not open a browser tab and go to Stack Overflow to copy and paste things. I can do it now directly from my ID, right? And that will continue improving. What I'm more interested in is operations, right?

We will almost certainly get to a point that operations will be more or not more with less human touch. Because that's where repetitiveness plays a large role. How many times either of you solve the problem just by restarting an application, right? I mean, it's a silly example, but there must be a lot of space there for

to truly help people by performing some operations. And just to be clear, Kubernetes already took away part of that workload, but we can go further probably with AI. I think it will enable a lot more people to automate systems that previously would require some degree of specialization. You would either have to know a particular set of tools or you would have to go and code something.

whether it's a small script or a program to do data automation. And I think it will enable that. In some ways, it might enable the idea of this no-code movement in a more effective way. I think there's another side of automation. Part of what I did back at Google, I actually started as an SRE. And I've always been very close to the SRE teams, even when I kind of was building some of the infrastructure then later on.

If we look at what SREs are often doing, they're making sure that we have the right data to make decisions. And then in the event of an incident or on a regular basis, they're looking at that data to make decisions. It was a very common problem that sometimes you either don't have the right data or you have so much data that it's hard to really pinpoint the root cause. And I think that's a class of problem where it's more of a qualitative problem.

decision rather than quantitative on is exactly this data point that is wrong, but it's more trends and correlations where systems like LLMs may end up being impactful. We chat a lot internally about whether it would make sense to already start, and there are companies doing this, training models that are specialized to finding correlation on failures across systems.

For example, the reason why this test failed is not because this error line has been admitted, but rather because this transaction in Postgres failed for this reason. And often to connect those dots, you need to have both the data and you need to have the knowledge that they are related. And those are the types of things that LLMs tend to do well at. So I also expect some major improvements in that area in the next years.

Will people really be hosting their own LLMs or are they just going to use services, whether that's from a specific service or the big three? There's an economical question and there's kind of a philosophical question in there. I think there's a lot of questions around privacy that we don't quite have an answer for. So far, almost everything we did around data, it was data that was produced for some kind of work reason, right?

And now as LLMs get more into our day-to-day, they actually know just, they don't know just the output, but they actually know your train of thought as well. I hear some folks concerned about that. Like, I don't want to reveal that I'm looking into this beyond just the query stream because it's actually, you often are having a conversation that

tells a lot about where you're headed. So I think from the privacy perspective, and perhaps because I'm an European, you know, when we think a lot about privacy, there will be some economical investment to support some sort of self-hosting or privacy-centric. So at the large-scale side, I think it will be impossible to find a unit of economics unless you are a hyperscaler or hyperscaler-like. First of all, the number of machines...

that you need, whether it's GPUs, TPUs, other systems, specialized systems. And the sheer amount of energy that you need is just mind-boggling. At Google, I didn't have any insight into this. Nowadays, I do because we actually deploy our own hardware into data centers. So we're kind of a small cloud provider as well.

And I can only imagine the challenge of going from a couple tens megawatts to hundreds of megawatts to then a gigawatt data center. And you just cannot do that as a small operator. So your company is Namespace Labs. You're putting your own hardware into co-location facilities. I'm sort of filling out. Is that correct? Yeah.

So why are you doing that instead of doing what people tell people to do and just say, go to AWS, go to Google, go to Azure? Why are you deciding to run your own hardware? That seems completely silly in 2024. And it is silly in some ways. The reason why we do it is because for us, for the types of problems that we're tackling,

we can provide a better service to users by doing so because we design the compute and networking that goes into these racks in a way that it maximizes those use cases. And actually, some of this goes back to some of what we were talking about before, like Kubernetes and containers. In Kubernetes, you can start the pod and it will know how to pull an image from a registry and it just works.

But end-to-end, it might take tens of seconds until a pod actually starts. And why is that? Because a lot of these components are built in a way that they may not take performance into account, they're not vertically integrated, and a few other reasons. And we look at this as, well, I want to start the pod in a second. So how can I start the pod in a second? Well, I need to have a certain class of CPUs. I need to have a certain amount of memory bandwidth.

I need to have my images set up in a particular way ahead of time. I need to have enough network bandwidth. And the only way that we found to have those guarantees is by deploying our own hardware. We also have use cases that we use AWS and the other hyperscalers, and they provide a fantastic service. And I think folks should rely on

the cloud. I think that makes sense for most companies, but we're building infrastructure and this vertical integration, we're using it to really provide extremely high performance that we otherwise wouldn't be able to. This is the answer I actually like, right? You're setting up your own data centers. I mean, maybe collocated, doesn't matter, because cloud providers currently do not solve the problem that you're having, right? Yeah.

Which is not the answer I get from most other people. Quite a few companies don't want to use hyperscalers for reasons other than that. I don't trust them or I can do better and they cannot. We do it and believe me, it's really, really hard. And that's why I say it's somehow, even though we do it, it's still a little bit silly. And we do it because...

We have tried multiple times to build out the same services with what the hyperscalers provide, and we just couldn't. So that's why we turned to our own hardware. And it kind of makes sense. Like the hyperscalers, they are economically incentivized to offer hardware that is mostly general purpose, that it fits well most workloads.

And our workloads are slightly different. The things that we care about are slightly different. Like most people, they only deploy pods maybe a couple times per day or maybe a few hundred times per day. And so if a pod takes 10 seconds to start or 20 seconds, it doesn't really change things too much. For us, the startup latency is extremely important because we're building ephemeral compute and

And you want to start the workload and you want that workload to start working immediately, as quickly as possible. So the startup latency matters a lot. Then you go into some of these hyperscaler SKUs, even their high tier SKUs, and you

It's not uncommon to see something like, okay, the biggest VM that you can get, unless you get a network optimized virtual machine, which would then compromise on something else, it's something like 25 gigabits per second. And 25 sounds like a lot, but again, when you're pushing gigabytes of data very quickly to start equivalent of pods in just a second, then you need a lot more bandwidth. And that's one of the things that we do differently. So it sounds like you're probably using InfiniBand

type speed for what you need. Is that true? We still do just regular Ethernet, but yeah, or in a rack, we have hundreds of gigabit per second capacity available. But it's not just that. I mean, that is extremely important, but locality is very important as well. You have a certain amount of network capacity in a rack, and then if you have to go from one rack somewhere else, then you already tend to be a

Our systems, they make decisions of where to start the workload based on where the data that you need is so that we rely on as much network capacity as available within RAC. That's one of the tricks that Google did as well. And there were many workloads that took a while to move over to Borg because Borg didn't have RAC-level scheduling affinity for a long time.

When you're doing high performance, then there's a lot more things that you need to take into account that go beyond. Maybe I have 100 VMs and networking works between them, like my service mesh just works. In our case, network bandwidth and network bandwidth in RAC are important factors that we take into account when scheduling. We actually place the equivalent of pods based on our capacity that we have. I think the key part to all of this is a phrase you were using earlier.

the use case. You've got a specific use case that sounds like you may have tried on public cloud and quickly determined, eh, that's not going to work out for what we need. So we got to suck it up, not build our own data center. You didn't go that far, but you did get to a co-location facility and start building out your racks. As you went to your co-location facility, after having been at Google, what was the pain points in figuring that part out?

Actually, like 15 years working with some sort of managed cloud, like Borg is some sort of managed cloud. And before that, I had another company before Google where we built on AWS. One of the magical prospects of cloud is that you need capacity. And for the most part, you have that capacity available immediately.

If you want to schedule a lot of capacity, probably you may hit some quotas. But if you can resolve through those quota issues, then you have that capacity available immediately. Hey, you hit quotas on your own data center, just to be clear as well. That's right. So that's what I was getting to, that that's what's different when we deploy our own hardware. There's all sides of procurement. So first of all, you need to decide how much hardware do you need.

because it doesn't just show up. It can take six to eight weeks to be made available to you. You need to have the colo capacity available. So maybe you already have a few racks and then you want to deploy additional hardware and you need a few racks more. And you call the data center saying, in this data center, I don't have a full cage, so I just need a few more racks. And they might tell you, well, we don't have more racks for another six months. And then you need to start looking somewhere else.

Power budgets. It's another thing. We take power into account into our scheduling as well because different racks have different rated circuits. We deploy into different data centers. We have different types of contracts. In some contracts, we have the full circuit available. In other contracts, we pay per usage. So we actually decide where to deploy workloads that you're going to run right now

based on things like power. What is the power availability and how much it would cost us? And those types of problems you never have to think about when you're just using cloud. And you shouldn't. I think these are very specialized problems. But we had to learn a lot. Everything from cooling to power usage to a network capacity, who are your transit providers,

How do you build out your spines? All of those things that you have to consider. If you're not an infrastructure geek or you don't have a bunch of infrastructure geeks in the team, I would say don't even venture down this path because it is painful and you need to draw some enjoyment from it as well because it is very painful. In your case, that's probably not the situation, but in many other companies, they often don't realize that.

How much they're removing resources, I mean, humans from things that actually might matter much more, right? Oh, we can do it. I know that you can do it. But what if those people were working on this instead, right? 100%. Even if you work with a fully managed provider. So I think over time, it has always existed, but I think there's now a few more.

So there's a few providers out there that you can call them and you say, well, I want this amount of hardware in a rack. They'll actually order the hardware for you. They'll deploy that hardware. They'll deploy the network capacity. You have network engineers available to you. You pay a premium for these services. But even in those cases, you're still not removed from the fact that you have the

hardware that you need to manage. And there's other things that you need to take into account. But it's a world of trade-offs. In our case, it is a cost, but the return that we get on that is extremely high. We often tell folks like we build full stack systems. Like when we're building a new product, we actually take into account how much capacity, like our rack design, so much capacity is available in our current racks and how we should evolve the

And even certain products will only be available in some of our kind of future racks because they're kind of designed for that. So we have that luxury in that now that we've done that investment, we can really go deep and change the economics of these features so that we can both offer very high performance, which for us is kind of the driving factor, but still at the competitive price point so that you just don't throw a bunch of money at it.

I think there is a space for co-location, but not for everyone. I think most folks are better off investing into cloud. And you have so many providers out there. Like if you don't want to spend a lot of money in a large hyperscaler, there's kind of smaller providers that offer nearly as good of a product and at a very competitive price. That's...

I feel somehow missed opportunity in cloud area is that you have huge hyperscalers, Google, Azure, AWS, Alibaba, what's not. And then you have a bunch of smaller providers like DigitalOcean that basically say, we are doing a subset of services of those hyperscalers at a lower cost and potentially easier. But I don't hear many more specialized providers.

And I feel that that might be an opportunity. Hey, instead of you trying to be the same as AWS, just cheaper, why not provide a service that AWS could probably do, but it's not sufficiently big for them to tackle that problem. And you can be very specialized and say, hey, my customers are those 1%, whatever they are, instead of I'm just cheaper.

One thing that is very real is that the innovator dilemma, if you're a big hyperscaler, and I saw that in my past a couple of times, not necessarily in this area, but more on the consumer side. But if you're making $1 billion in revenue, just to use a number, if you have an opportunity to start a new part of the business that does $10 million, it's really hard to go and invest into that because it's just such a small relative number compared with your total revenue.

But if you're just starting a company today and your target market could actually yield 10 million in revenue, that's extremely attractive. So that's the opportunity, I think, for many startups out there. And in many ways, that's also what we try to do. We are a specialized cloud provider. There's one particular vertical that we go after. And when we kind of go out of our way to make the best product out there in this particular vertical. And I think there are still many other verticals.

as you're making a reference to, that are out there, that are opportunities for folks to tackle. There's a couple of companies that I find inspiring. A model is a good example on the serverless GPU orchestration side. Fly is another good example of keeping the API control plane super simple and just run your applications. So I think there is space to do something special that still can target a fairly big market.

I want to go back to your 10 million example. Let's say I come up with that idea. Let's say I'm going to take and run a Valkey specific service. Valkey is the open source Redis. And we've seen variations of this over the years of people running Elasticsearch when Elasticsearch was originally open source. Different conversation. But I've got a $10 million idea, but I don't know how to properly use public cloud. So even though I have a $10 million idea, I'm spending $20 million to bring in that $10 million.

How do we not do that to where I have a 10 million idea and I actually only cost me 2 million to do it? Yeah, that's where engineering ingenuity comes into play. Now I'm older. With older age comes more acute sense of this is going to be hard, but also jumping into an opportunity, it will be done much more carefully. I think the younger me would just be very bullish and say, why wouldn't I just buy four machines and

I go to the local computer center, buy four machines. So I need to find 10K, 20K US dollars from somewhere. Maybe I get a loan from my friends or something. And I go buy those machines and I just place them somewhere where power is continuous and it works and networking is good enough. So I need to find that. And I just deploy those services there. Like you don't start...

with being able to target 10 million. You start by being able to target 100K and then from 100K, you make enough motion to be able to target 1 million and so on. So I think you really need to think a little bit out of the box and be comfortable doing things that other folks are not comfortable. So if you just build on the public cloud, then you're not going to differentiate yourself. It will be very hard to

if you're just hosting Valky to actually build a differentiated service. So there's no formula that I know. There's just, you know, here's how you think outside of the box. But that's what I would say would be important here. The one point I wanted to try to add to that is if you're going to use public cloud, use public cloud the way it's meant to be used, elastically. If you're going to treat it like a dumb data center, you're going to be spending more money than you ever would have thought you've had in your life.

You don't have to necessarily start designing it for Google scale and say, I need to go to AWS right away. There's absolutely nothing wrong using DigitalOcean when you start your company. Maybe when you reach 100 mil in revenue, that will not be enough for you, right? Maybe then you will have to move to AWS or build your own data center or whatever. But I still don't understand why startups are still...

kind of afraid to use anything but the big hyperscaler? Yeah, that's a great question. First of all, I really like this idea that you shouldn't optimize for your future problems. You should optimize for the problem that you have today. And then most startups don't make it. So you just have to do well enough today so that you can live to have the problems of tomorrow.

The DigitalOcean question, and there's a couple of providers like DigitalOcean. Hetzner has been on the Twitter sphere quite a bit and OVH a bit as well. It's a great question. I think there's a certain amount of tribal confidence that folks lean on. Like, what is everyone else using? Well, they're using a hyperscaler. So probably I should also be using a hyperscaler. So how to break away from that? I think it's a bit of that out-of-the-box thinking.

There's maybe more that those providers could be doing into building confidence.

There's also an economical aspect here. Like I think the hyperscalers have been very smart at, and I, you know, we've benefited from that as well. So I'm not going to push back on those programs, but they incentivize startups, right? Economically to use their products. And that creates a moat because then after you're invested into a particular architecture and to use your point that you should use the cloud the way that it's meant to be used, you

So for example, if you're using Lambda and now all of a sudden you run out of credits, pulling out is quite the project. And more likely than not, you're just going to continue using it. And now you're an AWS customer.

So I think finding kind of the right economical incentives for startups, because startups are often just trying to scrap by and you just want to lean on the ones that will become big over time. So kind of playing around a little bit with that may be important for some of these other providers. That's where I feel that Kubernetes is underutilized, right? There is that real fear of, hey, if I do my stuff in AWS Lambda,

How will I be forever and ever tied to it? Or if I go DigitalOcean and start spinning up my VMs, will that ever work anywhere else? But Kubernetes is that normalizer, right? It takes a bit of time to get used to it. That's true. But if you're talking about a normal app, I'm not talking about the requirements like that you were saying, but I have a killer idea for a killer app. I run it in DigitalOcean or whatever in Kubernetes.

That same application will work almost the same if you move it to EKS or GKE or whatever. Now, again, I'm talking about a normal app, no special requirement. You're not going crazy with custom things. It works the same. I mean, it's not the same, but close enough.

So this is an area I'm less familiar with, but I've seen some attempts at trying to have a vendor-neutral solution for more of the serverless space. So Knative comes to mind as a solution. I think it's hard. I think just having that and without a hyperscaler or without the provider support to make it work really well, there's less of an incentive. And I think you started by saying that there's a lot of parallels in some of these things. And this brings me back to...

From a long time ago, that actually maybe some of the listeners won't even know about this, but the mobile operators, they struggled on the movement to the data when data became primary because they became more undifferentiated, right? The service that they provide is data. And so any sort of additional services, now they're provided over data. So you could just move across different services.

mobile providers with very little cost. And that's great for the consumer. That's fantastic. Like you can just go and pick the one that has the best reception, the best price. But for the operator, that's not so great because there are fewer opportunities to monetize. And I think that's the dance that these providers do, that they do want to lean on

having vendor neutral solutions because they also want to be able to attract customers in. So we already support whatever you're doing out of the box. That's great. Like we run Kubernetes. You do Kubernetes, we do Kubernetes. We're talking the same language. But it's always a little bit difficult over time to just not become a race to the bottom from a cost perspective.

Because that's great for the companies that build on the providers, that costs just go down. But that's not great for the providers to be able to finance their own research and kind of developing new features because you're constantly in this pressure of just lowering your prices. And I don't have a solution for that. The way that I look at it is you should find some sort of unique value that is hard to replicate, but it's a real tension on these.

I have good friends that work at some of these kind of middle tier providers that they wouldn't call themselves hyperscalers. I think they provide extremely high quality product that is great for most startups. And 100%, if you have a startup, you should be using them because you're probably going to have a closer connection with the engineering team as well. If you do end up with problems because they're smaller and they'll be able to support you better and you'll have a better price.

So we've talked about it. Hugo is the CEO and founder at Namespace Labs.

but we haven't told you what namespace labs is about hugo why don't you give us the elevator pitch real quick we're a developer infrastructure company we work with developer teams to accelerate their developer workflow what we focus on today is providing extremely high performance build and test infrastructure so using some of the things that we were talking about before no one likes to wait for their builds no one likes to wait for their tests

So we provide off-the-shelf compute that is just faster, whether you do it yourself or whether you use some of the hyperscalers to run those workloads for you. And then we add in some of our unique special sauce around trying to make workloads as incremental as possible to make them extremely high performance. But again, just making developers' lives better by accelerating their workflow. That's what we're about.

So all of Hugo's contact information will be down in the episode description and namespace labs can be found at namespace.so. Hugo, thanks for being with us today. Thank you so much for having me. We hope this episode was helpful to you. If you want to discuss it or ask a question, please reach out to us. Our contact information and a link to the Slack workspace are at devopsparadox.com slash contact.

If you subscribe through Apple Podcasts, be sure to leave us a review there that helps other people discover this podcast. Go sign up right now at DevOpsParadox.com to receive an email whenever we drop the latest episode. Thank you for listening to DevOps Paradox. ♪

DOP 289: When to Build Your Own vs. Using Off-the-Shelf 45:38 Share

DevOps Paradox

Deep Dive

Shownotes Transcript

DOP 289: When to Build Your Own vs. Using Off-the-Shelf