DOP 282: How To Measure Software Complexity

2024/9/25

DevOps Paradox

Darren Pope

Victor Farsen

Darren 和 Victor 讨论了 John Gall 关于复杂系统的观点，即复杂的系统总是从简单的系统演化而来，而从头设计复杂的系统通常会失败。他们以 Kubernetes 为例，阐述了复杂系统是如何从简单的系统演化而来的，以及在演化过程中遇到的挑战。他们还讨论了 Mahesh Balakrishnan 提出的软件复杂度的三个定律：1. 设计良好的系统会随着时间的推移而退化；2. 复杂性是由于抽象的泄漏造成的；3. 软件复杂性没有上限。他们分析了每个定律背后的原因，并探讨了如何应对软件复杂性增加的问题，例如如何处理需求变更、如何权衡维护现有系统的成本和替换系统的成本，以及如何避免系统设计过度复杂。他们认为，需要定期评估系统的成本效益，并决定是否需要替换系统，同时经验丰富的工程师应该指导初级工程师，避免软件变得过于复杂。 Darren 和 Victor 详细分析了软件复杂度增加的原因，并提出了相应的应对策略。他们认为，软件复杂性的增加是不可避免的，因为需求会随着时间的推移而发生变化。为了应对这种变化，我们需要定期评估系统的成本效益，并决定是否需要替换系统。同时，我们需要在设计系统时注意避免过度复杂，并采用良好的抽象机制来隐藏实现细节。此外，经验丰富的工程师应该指导初级工程师，避免软件变得过于复杂。他们还以 Kubernetes 和 Kafka 为例，说明了如何通过移除冗余代码和改进系统设计来降低软件复杂性。

Deep Dive

Chapters

This chapter explores John Gall's General Systematics and its application to software systems, using Kubernetes as a prime example. It examines how complex systems evolve from simpler ones and the challenges of designing complex systems from scratch.

Complex systems that work evolve from simpler systems.
Designing complex systems from scratch rarely works.
Kubernetes' evolution from simpler systems like Docker illustrates this principle.

Shownotes Transcript

Translations:

中文

The needs for complex software are there, so it doesn't really matter that much whether we want it or not. We are forced into it in a way, right? This is DevOps Paradox, episode number 282. How to measure software complexity. Welcome to DevOps Paradox. This is a podcast about random stuff in which we, Darren and Victor, pretend we know what we're talking about.

Most of the time we mask our ignorance by putting the word DevOps everywhere we can and mix it with random buzzwords like Kubernetes, serverless, CICD, team productivity, islands of happiness, and other fancy expressions that make us sound like we know what we're doing.

Occasionally, we invite guests who do know something, but we do not do that often since they might make us look incompetent. The truth is out there, and there is no way we are going to find it. Yes, it's Darren reading this text and feeling embarrassed that Victor made me do it. Here are your hosts, Darren Pope and Victor Farsen. Victor, have you ever heard of John Gall? No, I haven't.

No idea. I heard the Father John, but not the Gollum. Not the Gollum. Well, John Goll, looking at his Wikipedia page, passed away in 2014. He was born in 1925. So he was a little before both our times, but he hung out until 10 years ago. He was known for his 1975 book, General Systematics, an essay on how systems work and especially how they fail. We don't know anything about failure of systems, do we?

No, my systems never failed. And I'm now planning to also start writing code in Rust because Rust has no bugs, I was told. Has no bugs? No bugs. It's a bug-free language. Yeah, when you write code in Rust, you're bug-free. That's the promise. Interesting. I think that's going to be a different episode. We'll see how that plays out. But Gauss' Law states, and I'm going to read this, a complex system that works...

Okay, a complex system that works. Do we need any more than those five words? Because in order to get to a complex system that works, here's the definition, is invariably found to have evolved from a simple system that worked. Continuing, a complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working, simple system.

That sounds like the first thing that comes to my mind when you read that is cities, right? Cities would probably fall into that category. It's a complex system that works. It wasn't designed initially with the scope it is today. It just evolved there. It's sort of like the sidewalks on a college campus.

Yeah, that's designed. Well, except for the college students don't stay on the sidewalks. They just cut across to find the new sidewalks. So that's what I'm saying is you did this whole big grid of sidewalks that nobody uses. Instead of first waiting to see what the traffic patterns are and then laying down the concrete. Exactly. Yeah.

This inevitably will lead towards waterfall, this conversation. I think we're already there, but let's put it into something a little more concrete. So let's apply our text base to the first five words, a complex system that works. Can we define a complex system right now, just for definition? Let's say it's Kubernetes, right? Is that fair enough? Yeah, that's a complex system, fairly complex system. Okay, right.

is invariably found to have evolved from a simple system that worked. What would have been the simple system that Kubernetes would have evolved from, or maybe multiple systems? In the first iteration of Kubernetes was only stateless applications and only Docker, without much more. Basically, it's how can we run containers and a bit of networking, very primitive one. So it's very, very different from Kubernetes today.

Especially if you count into Kubernetes, not only Kubernetes itself, but the end result. The end result is almost never core Kubernetes only. So you're saying the simple system that got to initial Kubernetes, right? Let's make that part. Relatively speaking, simple. Yes. Was Docker.

A way how to schedule Docker containers across a fleet of nodes. I'm simplifying it now, but let's say for the sake of argument, that's what it was. So how did we get to that point? So let's, again, we're stopping at V1 of Kubernetes, not V1 technically, but, you know, the initial versions of Kubernetes. Docker led to that, but what led to Docker?

What led to Docker is a company, which name I forgot now, the one that Solomon had before founding Docker, trying to solve its internal problems for the SaaS services that they were offering. Okay. But what was the technology that got us to that? Technology. Technology is like ancient. We're talking about namespaces and stuff, the things that nobody knew how to use but existed in Linux forever. Right.

Is that what you're referring to? That's where LXC and the other types of containerized, even though it wasn't containers as we know it in 2024. By the way, containers as a term still doesn't exist in Linux to this day. I was just referencing it as a container. It's not a container. Yes, I know. Cgroups, everything else, right? Because Docker gave us the ability to manage Cgroups easily. Exactly. It made it available to the masses, to everybody.

And when I say everybody, literally everybody can learn it in no time. Now, we're not going to continue on from prior to C groups because that really predates. Well, it doesn't really predate me, but it's part to where I didn't get into it. Yeah, we're not going to go down the route of mainframes and whatever was before Linux, right? Unix. Let's talk about Unix. Now let's imagine a world.

where Kubernetes was just trying to be created by, let's say it's Google, because Google created Kubernetes for Magic Hand Wave. And they just said, here's Kubernetes. Could they have done that without prior things? Going back to Gall's Law, a complex system can't just exist. It has to be simple systems first prior to getting to the complex. So, I mean, in all fairness, they could have done it, whether it would be successful or not, without Docker, but...

But they couldn't have done it without C groups and whatnot. Or at least not in nowhere near the form that we have it today, right? It's very hard to play could have, should have type of game, right? So yeah, they couldn't have done it without prior work, of course. That applies to almost everything, right?

On the other hand, it was almost inevitable that some kind of system that would allow us to leverage compute at large scale would appear sooner or later. But yeah, they would need to come up with proprietary stuff if it didn't exist. And that would still be based on prior work. So we've talked about Gauss' law a little bit, but I want to focus on a blog post by, and please forgive me, I'm going to try my best, Mahesh Balakrishnan. Mahesh, if I got it even close...

It's a miracle. And if I butchered it, I apologize. But he recently had a post from May of 2024 titled The Three Laws of Software Complexity, subtitled, or Why Software Engineers Are Always Grumpy. So do we really want complex software, Victor? Do we want it? The needs for complex software are there. So it doesn't really matter that much whether we want it or not.

We are forced into it in a way, right? Because if you go back 20 years in time, then making something that can sustain any number of users would be very different because any number of users would be, if you ever get 1,000 concurrent users, that would be unbelievable. We're never going to get to that number, right? And today that's the hobby site. That's kind of, okay, this is ridiculous. Of course, kind of, that's...

the requirement to run on your laptop, right? Requirements are changing through forces around us. And with those change requirements, complexity is increasing as well. Now, the question is, how much of that complexity do we see and how much we don't, right? Because you can say, hey, I set up and run Kubernetes myself, right?

That's much more complex than IU's GKE, which is more complex than IU's GKE Autopilot, which is more complex than IU's Google Cloud Run. Now, realistically, there is no difference in complexity. I mean, there is some difference, right? But for the sake of argument, all those systems I just named are equally complex, but we as users do not necessarily see that complexity. But it's there.

So let's get into the three laws that Mahesh wrote. The first law of software complexity. You may laugh, you may not laugh. We'll see. A well-designed system will degrade into a badly designed system over time. Is this foreshadowing for Kubernetes? Yeah. I mean, I'm not sure about foreshadowing Kubernetes, but what he's saying is inevitably true for two major reasons. First, because we mess it up over time, right?

You have a greatly designed system and then, you know, the more people work on it, the more of a mess accumulates. That's just how we humans operate, right? I'm yet to see many examples of companies who actually spend significant amount of time cleaning up the mess. That's rarely happening because, hey, it's working. And the other reason is what I mentioned before, the

the requirements are going to change. And when looking at the system a while later through the prism of new requirements, we say, hey, this is just silly. This doesn't make sense. This is so badly designed. But it's not necessarily, I'm excluding now the case of people messing it up over time, right? It's not necessarily badly designed according to the requirements when it was designed, but with the introduction of new requirements,

that turns up into a bad design. You know, if I need to have a million concurrent users, this is just horrible. I would never design something like that. But when it was designed, that was not a requirement. They confuse everybody now, right? I'm not sure even what I said. Well, what you're saying is we had the initial requirements. Everything was met. Then we start doing change requests against that system.

And we lose the plot of what that initial system was meant to do. We start turning it into Frankenstein's monster. Bingo. That's the point. And when I say change requirements, I don't mean only, hey, the management is willy-nilly every second day changing things. But simply what we need is different every month, every year, right? The needs are different. You cannot compare, I don't know, let's say Netflix today with Netflix 10 years ago, right? Their system changed.

The requirements are very, very different today from back then. So if you believe you're going to write a system and launch it and never touch it again, it's possible. But I think in this decade, that's not going to be probable over time. To be clear, that's happening with almost anything. It's not related only to software, right?

I can say the same thing for bridges, right? If you take a bridge that was designed 100 years ago, it was designed with the idea that, hey, you might have on average one car every five minutes. And now you have, I don't know, one car every five seconds, right? The load on that bridge must be different. And it's not necessarily going to collapse right away, right?

But it's going to get wasted faster than originally thought. How can we handle these change orders that come in on a system? Should we reject change orders if it's going to change the system too much? Of course, you can reject some change orders. But more often than not, you cannot change because what we do is for the benefit of business that we work for, right?

So I'm now excluding silly requests from silly managers who don't know what they're doing. But in general, you cannot change requests. "Hey, we have more customers. That's a new request. You need to handle a higher load." I cannot reject that. I cannot say, "Hey, you know what? We can go under. We can go bankrupt. Why do I care?" I mean, you can try to do that, but that's not how the world works. You have new requirements and you need to accept them.

There are actually two important questions there. Will you do what is necessary to prolong the life of something? And this can be equally applied to software engineering or anything else, right? So I can have a system that is designed like this and it will survive five years. We don't know that in advance, but history will tell us later that it's five years, right? Now, there are certain things I can do to that system, modernize it up to a point so that it survives 10 years.

As long as we understand it, that's investment that needs to be made. And the second thing is to recognize when does the maintenance of that something, the price of maintaining that something and applying new requirements end up being higher than replacing it altogether. And I could repeat exactly the same story for the bridge example before, right? You're maintaining the bridge, but then comes the time and you're prolonging its life by

painting it, increasing the structural, something, something. But then comes the time that you say, you know what? This is so expensive now for this bridge that I'm going to create a new one. It's going to be cheaper and more efficient for me. I'm a simple person. I'm going to use a car example. At a certain point, a car becomes too expensive to maintain. Exactly. Then it's just time to go take another $5,000 and go buy another cheap car

That's going to last Toyota or Lexus or Nissan or Infiniti, by the way. I love my European cars, but they're too expensive to maintain. The others, not so much in comparison, at least. So at that point, once you maintain that vehicle and now you're spending $600 a month, but that's $600 a month be better towards a new car. Yeah. And that new car,

might be cheaper than the original price of the old car that you bought, and you're still going to get better value. Because things change. I don't know. Now it's electric, or now it has auto brakes, or whatever. I don't drive cars much, but whatever new things are, right? You're going to get it. And most likely, yeah, you buy maybe not 5,000, but if it's a new car, 15K car today, and that car is going to be

Cheaper to maintain than your old car and give you more benefits than that old car that you maybe paid 50K. The tech analogy, once it starts costing more, it's time to get rid of it. Yeah. But now we're coming to the problem. If I keep the analogy with the car, you said $600 a month to maintain, right? Yeah.

And let's say for the quarter that would be 1,800, right? And you say, okay, but you know what? If I invest 1,800, that's still much cheaper than 5,000. And I'm really – we just need to push through this quarter. We just need to survive it now, right? We'll think about the new car later.

And then it flows from one quarter to another to another. And your cost of maintenance is going to keep increasing. But surviving short-term is almost always cheaper than longer-term investment. And that cheaper on long run is more expensive, just to be clear. But right now, the amount of cash I need to put into fixing that old car is smaller than buying a new car.

Because you're trying to buy the new car cash or at least near cash. Yeah, let's say like that, right? That's the investment I need to make. So then you say, so what should I do? If I'm focused on my well-being, my future, it's quite clear what I should do. I should find those 5K, 15K, whatever the price is and buy a car.

It's very clear. But if I'm thinking on the short-term benefit for me, if I'm only focused, if nothing exists beyond this quarter, the world will end. Then repairing the car makes more sense. And I feel that many companies are more focused on the latter case. Don't give me that five years, whatever. There is the quarter and we need to be successful in it.

And that's why we have old systems that are extremely expensive to maintain. And everybody or most people know that it's silly that it's not replaced by something. It already passed the threshold of I can improve this, right? No, there is no improvement to this system. It costs a hell of a money. And we all know that we should go with something else. Yet there is no will to do it.

Let's move on to the second point. The second law of software complexity is complexity is a moat filled by leaky abstractions. Let me read the first sentence. Designing a good abstraction is a delicate dance between providing utility to the application while hiding detail about the implementation. True, right? We need to do that. The problem is, I'm trying to find it here real fast. I'll just jump to the bottom of the paragraph.

Based on this law, most engineers will work on badly designed systems because most successful popular systems are badly designed systems. I don't agree with the last part of badly designed systems. You know, in retrospective, yes. Oh, if I would have the experience, you know, if we would have experience we have today with Kubernetes and applied it with whatever we were doing 10 years ago or 20 years ago, oh, it's so clear that 20 years ago was a

It was a disastrous system, but you cannot apply that logic because we did not know back then what we know now. I must assume that at least some of the companies or projects are basing their APIs and systems and abstractions on the prior experience. I'm not saying that everybody is doing that, but we are better at designing things today than we were in the past.

If you know that this is a badly designed system, then be my guest. Let's talk about whatever is the name of that person, what's bad about it, and we can improve it, right? Whichever system we're talking about. The example that he's giving here is, in fact, I'm just going to read the whole paragraph. When systems compete with each other for market share, delicacy goes out the window and designers often give the application everything at once. It goes back to that first point, right? We keep dumping things in.

This has a dual effect of increasing market share by attracting developers while simultaneously making it difficult for competing systems to substitute different implementations under the hood. The example he gives is the way Zookeeper and Kafka does things. And it's like once you marry yourself into that system, it's not like you can just go swap out Zookeeper for something else without a lot of work. Yeah, it's absolutely true. But then this is going back to the analogy of the car, right?

There are many decisions in Kafka that do not make sense today. That is not to say that they didn't make sense back in the day when Kafka was designed. I mean, I remember those days. If you asked me back then, what would I use? Zookeeper rules. Of course I'm going to use Zookeeper. Now if you tell me Zookeeper, I walk away. I don't want to be in that conversation anymore, right? Now, the problem is...

from that person for that person is okay so are you saying that that kafka system that you're using right now is equivalent of the old car and you're not willing to buy a new it's going to cost sure but maybe the total cost is not going to be i mean the total cost might be paid off in a year or two which is not such a long time i'm not saying that everybody should replace kafka just to be 100 clear right

I'm just saying that you need to do that cost-benefit evaluation all the time. And not just once at budgeting time. Maybe it should be a quarterly exercise. Okay, we know that we have a ton of stuff that we shouldn't have. What's the pick for this quarter to replace? Make a pick.

There are plenty of candidates in every company. It's not only one. There are many. So kind of pick which one in this quarter should go out or not out, be replaced. Well, I've been seeing that a lot in companies to where they are trying to get more efficient in managing cash. So that means getting rid of subscriptions. Heck, I can even apply that to myself. What subscriptions can I get rid of at the house? Netflix stays. Does other ones go away? Because do I need them?

Yeah, exactly. They accumulated, at least in my house, they accumulated over time. And actually, it was maybe a month or two ago that I looked and kind of, okay, man, this makes no sense. I mean, I can afford it, but it makes zero sense that I have HBO, Prime, Disney. I have Netflix, I have Apple, and probably seven others that I haven't used in kind of two years. And yeah, I had to get rid of some of them. Now,

This is slightly easier than what we are talking about because, you know, getting rid of subscription that you don't use or doesn't provide much value is an easy exercise. Removing Kafka is not about removing Kafka, it's about finding a replacement and changing a bunch of things. It costs a lot, but the question is, should it be done? And the third and final law of software complexity, there is no fundamental upper limit on software complexity.

I'm going to read the first sentence. I don't think I'm going to have to read anything else beyond this. In real world systems that are built by large groups of people over time, complexity is limited only by human creativity. As long as that assumes that when we're talking about complexity, we're talking about something that accumulates over time, right? Not by designed complexity, right? Most of the time it's not designed complexity. It just becomes complex over time.

That's how anything works. You are getting a new house right now, right? And when you move into that house, man is going to be less complex than half a year later. And you put all the furniture and forget to clean the rooms and what's not, right? It's inevitable. That's how life works. I think you said it went backwards. You said it was going to get less complex over time. Did I say less? No, I wanted to say more complex. Yeah, exactly. Yeah, it gets more complex.

Yeah. Is it a badly designed house? No. Was it complex house? No. The act of moving in and living in it is going to increase its complexity. Guaranteed. The part I liked about his sentence, though, is complexity is limited only by human creativity. I think he was being polite. I'm thinking about new people coming to work on an application that have never worked on the application before. Maybe this is their first application ever maintaining.

and they're the only 10th, 15th, 20th tranche of people to come through and maintain this application, people will get creative over time. I mean, we'll use your 20 year. We'll go back to 2004. What were you writing in 2004 primarily? I'm ashamed of whatever I was doing in 2000, 20 years ago. Okay. But there are applications that are still running from 20 years ago. Yeah, of course. And what are the chances that the people that were working on it in 2004 are the same people that are working on it in 2024?

Very few, right? We retire eventually. We move to other companies and stuff like that. But again, that shouldn't be a problem, right? The problem that we see in that example, at least from my perspective, is that that means that your senior people are not doing their job. I cannot blame juniors for being overly too enthusiastic and wanting stuff. I mean, you want people like that.

You want people who are excited about the things they're working on, that they're motivated and they go there and they mess it up. But then it's you, kind of. You have a zillion years of experience, kind of. Like, it's your job to guide people, right? And, you know, it would be the same thing as going back to the bridge example. You don't

put random people to do random stuff and constructing a bridge. You have somebody who is overseeing it and that somebody is probably more senior and knows what he or she is doing and what's not. A good example is, I don't know the exact date, but it wasn't that long ago that there was a single pull request in Kubernetes that removed around one million lines of code. Do you understand what is one million of lines? That number by itself is bigger than many people's

systems, maybe not systems, but applications or platforms or whatever it is. Kubernetes community removed around 1 million of lines. That was code that was related to cloud providers, you know, AWSs of the world. And it's all removed and the complexity was reduced. It's possible, but it's work that nobody wants to do. I mean, not nobody, but not many.

I wish I would have been an approver on that one. LGTM, million lines, gone. I would approve it without going through it. You wouldn't even have done that. You would have just said, yeah, just go for it. Go for it. I trust you. Once you have that level of delete, you had better trust whoever put that PR in.

It's been a long time going, so that example is not something that happened overnight. Those same providers are working on how to provide the same functionality services outside of Kubernetes as add-ons and whatnot. So it was a process, but it ended up... It's a process that took time, it took effort, money, everything, and it resulted in approximately one million lines of code removed. It's an unconscious investment.

And I'm very impressed by that. Something like that. So how do you measure your software complexity? Head over to the Slack workspace, look for episode number 282, and leave your comments there. We hope this episode was helpful to you. If you want to discuss it or ask a question, please reach out to us. Our contact information and a link to the Slack workspace are at devopsparadox.com slash contact.

If you subscribe through Apple Podcasts, be sure to leave us a review there. That helps other people discover this podcast. Go sign up right now at DevOpsParadox.com to receive an email whenever we drop the latest episode. Thank you for listening to DevOps Paradox.

DOP 282: How To Measure Software Complexity 31:46 Share

DevOps Paradox

Deep Dive

Shownotes Transcript

DOP 282: How To Measure Software Complexity