cover of episode 2.5 Admins 235: XKCD221

2.5 Admins 235: XKCD221

2025/2/20
logo of podcast 2.5 Admins

2.5 Admins

AI Chapters Transcript
Chapters
Google found a way to run unofficial microcode on AMD CPUs, bypassing signature checks and potentially compromising security features like encrypted virtualization. This raises concerns about the security of AMD CPUs and the potential for unauthorized access to sensitive data. The discussion also touches upon the common practice among software developers of 'rolling their own' solutions instead of using established, well-tested libraries, which can lead to vulnerabilities.
  • Google bypassed signature checks on AMD CPUs using unofficial microcode.
  • This allowed them to introduce a bug that made RD-RAND always return 4.
  • This also compromises security features like encrypted virtualization and the root of trust.
  • Software developers often write their own cryptographic functions instead of using established libraries, leading to vulnerabilities.

Shownotes Transcript

Two and a half admins, episode 235. I'm Joe. I'm Jim. And I'm Alan. And here we are again. And before we get started, you've got a plug for us, Alan. ZFS orchestration tools, part one, snapshots. Yeah, so it turns out there's a lot of different tools for taking and managing snapshots on ZFS. And we had a little rundown through a whole bunch of them. As a prelude to a series where we'll also talk about the tools for managing replication. What, you mean there's other tools apart from Sanoid and Syncoid? Yeah.

There are a lot, and we'll tell you why you probably don't want to use most of them. To be clear, you know, as the Sanoid guy over here, yeah, there are a lot. I personally do genuinely believe Sanoid and Synco are the best tools out there. However, I'm not going to say anything bad about the rest of them because I think it's important to have diversity.

You know, I've been preaching that for decades and I'm not going to stop preaching it just because we're talking about competition to my own project. Diversity is good. Give everything a try. Yeah. So we compare them based on what language it's going to use, because maybe it depends what's available on your machines. Like if you don't want to install yet another scripting language to run it, you know, if you have Python and not Perl, then maybe you want to use one of these other tools. And then license, you know, do you want something BSD or Apache licensed or do you definitely want GPL? And you can do...

whatever picking you want there. And we talked a bit about use case. If your use case is backing up a laptop, some of these tools are geared more towards that than ones that are meant for managing VM infrastructure. Right, well, link in the show notes as usual. How to make any AMD Zen CPU always generate 4 from RD-RAND. This one hit kind of near and dear to me because when I set up my first Ryzen 7 2700 system,

Unbeknownst to me, at that time, there was a microcode bug that made the onboard RD-RAND instruction always return the most artisanal random number possible. I think it actually was four for that matter. No matter how many times you asked for a random number, you got the same one every single time.

Most folks didn't notice that because most applications will ask, they'll ask the kernel to give them a random number and the Linux kernel will basically source its randomness from a whole bunch of different stuff and glom them together. So losing some entropy by having, you know, one bogus source of randomness, in this case, the CPU already ran instruction,

Not a big deal. It lowers the entropy a little bit in your pseudo-random information that you're getting back from the kernel, but you don't just get four, four, four, and don't worry, all these fours are random, but they're all fours. This bug at the launch of the Ryzen 7 2700, and actually all the Ryzens in the 2000 series,

This bug was eventually patched with microcode, and the way that works is your motherboard vendor actually has to accept the microcode from AMD and then issue a patch for the BIOS for that board that then every time you boot your CPU on that board, that board reads from its BIOS, sees, oh, I should tell the CPU to patch itself with this stuff, and it becomes ephemerally patched until the next time the CPU is rebooted, at which time the microcode needs to be loaded again. It's never a permanent fix.

Now, what the Google folks are basically doing is the other way around. Rather than using microcode to fix a bug, they're using microcode to introduce one. You shouldn't be able to do that because the updates are supposed to be signed and nobody but the original vendor should have the proper keys to sign an update in a way that the CPU will accept it, but Google worked around that. Yeah, so basically this is breaking the

signature checking that the CPU is supposed to do to make sure the microcode it runs only comes from AMD or Intel in the other case. But in particular, why this matters more is not just that Google made a way to have RD Vandoy's turn in turn four. It's that this also means that Google's able to smash all the protection provided by the EPYC's

encrypted virtualization where each VM uses different encryption keys so that one VM can't possibly read the RAM from another VM and so on that we saw happening sometimes with other CPUs in the past. And its whole root of trust security feature where you trusting the CPU to enforce that only signed BIOSes are actually running on that CPU. And

Yeah.

Yeah, that's true. My original issue with the real from AMD RD-RAND bug was basically it broke WireGuard because at the time, WireGuard used RD-RAND not sourced from the kernel layered with all the other primitives, but just directly RD-RAND as a source of low entropy pseudo-randomness for serial numbers for tokens that it used. And this wasn't considered a big deal that it was a lower quality random because it was very, very fast and competing technologies were

just used steadily incrementing serial numbers. So any amount of randomness was seen as a security win. But in this case, because it just came back four every time, well, you can't have all your items all just have the same number for the serial. They actually need to change. So that would cause WireGuard to lock up. And that problem has since been resolved. And like Alan said,

The RD-RAN got my attention because I got hit with that bug when it was for real. But yeah, basically, if you can load microcode patches into a CPU, there's no end to the possibilities of the shenanigans you can get up to because you can rewrite how the entire CPU functions pretty much.

Doing away with the SCV stuff that Alan was talking about. Now, this particularly, this is one of the things where I'm glad that Google actually chose once more not to do evil because as a large cloud provider themselves, Google could theoretically have sat on that, have broken into their own CPUs they use for Google Cloud and

And just quietly had the option to look in on their customers, you know, supposedly encrypted data and operations within these encrypted archives. Because the way SEV is supposed to work is it allows you to run a trusted workload at an untrusted third-party provider because all of the untrusted provider owns the hardware and has direct physical access to it and all this stuff.

everything is running with a key that that host doesn't have access to. So they can't actually see what's going on inside your VM.

This is a very important safeguard for people who have massively valuable workloads that they don't want to run on their own hardware that they keep under their own physical lock and key with their own like physical roaming security in the whole mine. It allows them to outsource this and trust that this will still be as safe as though we had built out all of this infrastructure, including real physical security in the whole mine ourselves. Right.

So thanks, Google, for reporting that one. I'm glad to see that you did that. Maybe do a lot more of this kind of thing and put do no evil back up on the walls at corporate headquarters. I'd like that. So in particular, this affects Zen 1 all the way up to Zen 4 CPUs. And apparently the problem is that CPU uses an insecure hash function for signature validation. Why it's not using...

SHA-256 or something like, especially one of the ones the CPU has offload support for? I'm not sure. I mean, I'm pretty sure. Have you met developers? A software developer, generically speaking, is very knowledgeable in a particular area. And they tend to think that that knowledge in that area means they know how to do all the things and they can just roll their own everything from scratch.

To be fair to software developers, most of them come from computer science academic backgrounds where they are taught to do everything from scratch every time because a computer science degree is not about making you good at computer. It's about making you a mathematician that hopefully will come up with a new algorithm that changes the world, which 99.99% of computer science graduates will never do even the tiniest bit of that. But they do get that deeply ingrained habit of

You don't use the thing that's been in the public library for 30 years with thousands of hours of battle testing. No, you write your own quicksort algorithms from scratch because you're dev and that's what devs do. Well, when you approach system administration that way, you have terrible results. And it turns out when you approach cryptography that way, you have terrible results.

If you do not have serious domain-specific expertise in crypto security, you should not be trying to do that yourself. You should be talking very deeply with people who absolutely have that very specific domain expertise.

That was a very long way of saying, don't roll your own crypto. It was, but the short ways aren't getting people's attention, damn it. In particular, I'm sure there's a joke in there about how hardware people don't trust software people and software people don't trust hardware people, but both of them hate firmware people. CVEs for end of life?

This is a very colorfully written piece by Josh Bressers. I did not expect to be as entertained by an article about the idea of having CVEs for end-of-life software as I wound up being for this one.

The basic idea here is that, and you may not be aware of this, but, you know, CVEs, common vulnerability events, are not normally generated for software that is end-of-life. Once software is end-of-life, you just stop tracking anything about it, and you assume that nobody with any sense is using it. Unfortunately, that's a terrible assumption. Either it's a terrible assumption because lots of relatively sensible people are still using end-of-life software, or it's a terrible assumption because there are no sensible people. Everybody's using EOL software.

Either way, it makes a lot of sense to still keep some idea of like how vulnerable is this software that is end of life. The folks who run the actual CVE project have a very hardcore stance on no, they will not track vulnerabilities on ELL software.

Because their database of essentially all the vulnerabilities they know about for all the currently active software known demand is already so gargantuan. They have no interest in growing it any further than they have to. So they feel comfortable just saying, don't use EOL software if you're worried about vulnerabilities. This article obviously makes the case that, you know, we should have some tracking of that while acknowledging the issue of software.

Scope creep isn't the right word. It's closer to scope explosion. But despite that, you know, one way or another, we do need to think about this. But again, the real takeaway I want everybody to have is please hit the show notes and read the article because it's hilarious.

You get to find out which groups of people are sitting around with no metaphorical pants on at the convention, which of them are drinking coffee, which of them are burnouts with whiskey bottles. You don't know if they're wearing pants or not. And it all makes sense. Yeah, when I first just read the headline of this, I was like, I think it probably does make sense to just do a CVE at the end of life of a software saying every version of this software is now vulnerable by the fact that it's not maintained anymore.

I know so many companies that only care about fixing something because their CVE scanner says so. And so if the software goes end of life and it never gets a CVE again, that doesn't set off an alarm. But they're just automatically being a CVE that, you know, Samba 4.19 goes EOL on March 31st of 2025, meaning that every version of Samba 4.19 and older is end of life at that date. If you have that installed, that's a vulnerability.

The fix is to upgrade to a supportive version like 4.20 or 4.22, et cetera. I do agree with this article that that could cause some problems in just the number of CVEs that come out every day. There's a lot of software out there, but shouldn't we design the CVE system to scale to actually cover every time there's a risk? It'd be nice. You know, I got to be honest here. I don't know that I have the best handle on what scale that would require. I don't know that I have the best handle on

on how you would address this massively increased size of what's already a gargantuan database. Do you need to shard it? Do you need bigger clusters? You know, how does this work? I don't know. But I do think that it's just pants on head silly to act like nobody's out there running EOL software and we don't need to worry about what is or isn't vulnerable in EOL software. This also makes sense because I hate saying this because

People are going to take the wrong message from it. But software isn't necessarily packed full of vulnerabilities just because it's EOL or abandoned either. The odds are real good. But for example, if something like Sanoid were to become unsupported, there aren't a lot of avenues in that type of project that lead to critical vulnerabilities to begin with. There's not a whole lot of, you know, user input to be parsed or mishandled.

There's just not a lot of ways into that system to accomplish any kind of a goal that you'd want to in breaking into something.

Whereas on the other hand, something like, say, a user shell, like, oh, my God, it's nothing but one gigantic vulnerability. Like, yeah, you absolutely want to know how bad is it if I'm running an older version of Bash on an older distribution? What does this mean in terms of an attacker who gets a shell just being able to instantly become root because of a thousand unaddressed privilege escalation vulnerabilities or what have you? These are really good things to know. They're

There probably is, even there, some nuance between the difference between we've released a newer version of this software and the old version has reached its EOL, and there's nobody maintaining this software anymore. Like, for your point with Sanoid, one thing to say, you know, any version of Sanoid before 2023 has got a problem and you shouldn't use it, versus I'm quitting Sanoid, there's no new versions of Sanoid, you should just stop using it, period. But, yeah, it does feel like...

oftentimes just trying to tell when the EOL for a certain piece of software is or was is a challenge. And the CVE database was designed to solve a very related problem. And it seems like a way that most places already have a way of consuming that data and comparing it against their catalog of what software they're using and finding anything that matches. They have this whole scheme for making sure that you

When a vulnerability comes out, it has this computer encoded way of telling, oh, is this software with a version between this and that? And understanding the complex version numbers to know what things are vulnerable and what aren't. And it seems like the best way to maintain a database of just what is ELL versus what is not.

Because if you've ever tried to go and figure out when a certain version of software will go ELL, it can be very difficult, especially projects like I think Pearl. The answer is that version will go ELL X time after this version comes out. So short of going back to that website and refresh everyone's, I don't find to go when that happens that, OK, it will go ELL four months from two weeks ago. You'd really not have a good way to tell.

And that's why it feels like the CV database is a good way to handle this. Ultimately, however, this article really goes into more detail about the scope of this problem and how you might address it. Then we're going to have time to cover on the show. And it is written incredibly entertainingly. So to close out and to motivate y'all to actually hit the show notes and go visit this article at opensourcesecurity.io, I'm going to read you one paragraph from it on vulnerability scanners.

The vulnerability scanner folks are near the door. Instead of their table being littered with old fast food wrappers, these folks have a disturbing number of empty whiskey bottles and energy drink cans strewn about. When you ask them about end-of-life CVEs, they just look at you and mumble something about log for shell and how nothing is real. Some of them seem to think this is a great idea. Some think it's a terrible idea. Either way, they have to deal with all this data, NCVE or not. They're all sitting at tables, and it's not obvious if they have pants on. You're too afraid to ask.

Okay, this episode is sponsored by people who support us with PayPal and Patreon. Go to 2.5admins.com slash support for details of how you can support us too. Patreon supporters have the option to listen to episodes without ads like this. And it's not just this show. There's Late Night Linux for news, discoveries, audience input, and misanthropy. Linux Matters for upbeat family-friendly adventures. Linux After Dark for silly challenges and philosophical debates.

Linux Dev Time about developing with and for Linux, Hybrid Cloud Show for everything public and private cloud, and Ask the Host for off-topic questions from you. You can even get some episodes a bit early. We've got a lot going on, and it's only possible because of the people who support us. So if you like what we do and can afford it, it would be great if you could support us too at 2.5admins.com slash support.

There's a couple of AI stories that you were keen to talk about, Jim. Yes, and that you basically thought were bollocks. That's not really fair. Joe definitely has an end to his tolerance for talking about AI, and I get that. I think a lot of us have that same limit for how much AI nonsense we want to hear about because it's so overhyped and it's everywhere. But I don't think it's something that we can safely ignore.

And there are two articles making the rounds right now that aren't related on the surface, but I think have some pretty disturbing connotations when you look at them both as part of a growing trend. And the first one, OpenAI says that its models are more persuasive than 82% of Reddit users. And the way that it backs this up is it basically turned a model loose and Reddit's subreddit are changed my mind.

If you're not familiar, this is a subreddit where you go in and, you know, it's basically a devil's advocate type of thing. You go in and you say, you know, here is my belief on this controversial topic. Change my mind. And you just open the field to people trying to tell you that you're wrong about the thing. And if somebody can convince you that you were wrong, then that person who convinced you wins a point.

This is essentially just kind of an interesting way to hash out ideas, and if you are interested in arguing, it's a good way to hone your own debate skills. Not just in the sense of, can I win points, but can I actually influence people? So, on

On the one hand, this seems like a non-story to a lot of folks, Joe included, because 82% of Redditors are, sorry guys, complete idiots. You know, being able to beat them doesn't seem to mean much. Okay, you're more convincing than the bottom 80% of like randos on Reddit? That's not real impressive. And it shouldn't be in human sense.

However, the idea that a model can get unleashed into a communication forum like this and perform as well as the top 20% of humans in that same, again, completely random, anybody can come in, but it's performing on a level with, you know, the upper 20% of humans at changing people's minds, however infrequently, what that tells you is that

Modern AI models are actually effective as compared to normal humans as propaganda tools. You can give an AI model a general mission on a particular point of data and just kind of unleash it to argue with people. And it's at least as good as an actual human that you paid to do this kind of thing.

Now, does that mean it's as good as the absolute best, like, you know, the an expert in propaganda manipulation that might be employed at the top levels of some espionage agency or, you know, I don't know, an archer, if you like cartoons? No, absolutely not. But that's not really how modern propaganda works. It's not an Olympic 22 caliber pistol. It's a freaking burp gun, you know, sprayed randomly out of the helicopter. Right.

And when you say, OK, we can take an AI model and we can have an actual positive real world effect at changing people's minds at scale, even if it's only, well, you know, the duller people, the people who aren't thinking it all the way through. Well, those folks generally have a vote that counts just as much as anybody else's. So that's kind of scary. It's a trend that's worth keeping an eye on.

Now we tie it to the second thing. Those of you who are regular listeners may remember when a while back I opined that we weren't very far away from seeing AI malware that was capable of replicating itself and looking for, you know, exploits to allow it to carve out more space to infect other systems, you know, other cloud providers that were set up to run AI models to begin with. I opined that we probably weren't going to be too far away from seeing those things, even if they weren't very practical at first. And I

This other article that we're referencing today is the first step and arguably the tougher step in that process. AI models are now capable of replicating themselves. So you put these two stories together and we know that large language models are capable of influencing humans who don't particularly want to be influenced on a topic that they care about.

And they can self-replicate. I'm not trying to chicken little on everybody here. I'm not saying the sky is falling and society is doomed tomorrow. Please don't hear that. But this is something that we need to keep an eye on. Yeah, like my first thought, especially with the first story, was a while back when we were talking about some university where the researchers had gone and purposely introduced vulnerabilities into open source software to see if they would get caught.

And there was a big ethics discussion about this and their project got shut down because it turns out most universities don't like it when you do research on unsuspecting humans. Humans have to volunteer to be researched upon. In this case, OpenAI seems to be researching on people on Reddit without telling them that they're being researched on by AI. Well, you know, Spez isn't going to have a problem with that. Yeah, but there's a whole ethics conversation we should be having about that general idea. But

But yes, especially when you combine that with the replication, it's like, okay, so the AI is now going to spread itself and then convince people that the AI is okay and then spread itself some more. Then convince people that that's okay. It can be a whole problem. Or it can start deciding to convince people of terrible things. We saw how quickly when Microsoft introduced their

AI chatbot, the old one back in the day on Twitter. And within, you know, half a day, it was being a Nazi. Both of them. Microsoft did that twice. It turned into a Nazi within 24 hours, both times. Yeah. So do we want self-replicating AIs that can fall into that same trap? That doesn't seem like a good idea. Yeah. And, you know, when you consider the kind of unethical actor that might choose to use a tool like this, that might want to, that might say, hey,

I want to influence the most gullible out of a population of several hundred million people. And the cheapest and most effective way for me to do that is to build a model that will self-replicate and spread itself and talk to as many of these people as possible using resources I don't even own. Well, that means it's now out of that actor's control really quickly. And it means that not only is this actor capable of –

Potentially even a single person might be capable of putting on a propaganda operation that would put traditional like Kremlin style operations to shame, but

but they wouldn't be able to pull it back once it was done. Now, this is not an unknown problem. Agencies that use propaganda at massive scale have faced this problem with normal humans because once you get propaganda to spread in a target population, it's hard to undo it. You can stop adding the propaganda, but the people you already convinced are going to keep pushing that message in some cases even after you really, really wish they'd quit.

So this is already a problem with propaganda, but it gets that much worse when not only are you worried about random humans that you've converted into essentially a cult of propaganda continuing to spread a message that you really wish you could pull back. You've still got AIs out there that are pushing this message at massive scale.

And what happens when, you know, this propaganda campaign's remnant AIs are still spreading that message? The next propaganda campaign, the next one. I mean, eventually we wind up with a sea of rogue AIs all trying to influence humans at cross purposes. And nobody's got control of any of the damn things. You're almost there, Jim. A sea of AIs trying to convince each other.

Dead internet theory. That's what you're talking about right there. Well, sure. They'll try to convince each other as well. Now, that also brings to another thing. People are already using AI models

as a force multiplier, basically, to make it easier to do prompt engineering for the large model that you want to do the actual generation that you're interested in, whether that's image-based or text. If you use a lot of the newer, even the free public tools out there, like there's one called Leonardo.ai. It's an image generation tool. There's nothing that special about it.

It makes AI nonsense. Sometimes that's fun. Sometimes that's funny. It always burns down rainforest, blah, blah, blah. We know the pros and the cons. But the point that I'm getting at here is if you use Leonardo.ai and you say something like, for example, five screaming possums tightly clustered,

what you see happen is the actual prompt below that gets expanded to like a whole paragraph about like angry feral possums in a tight cluster snarling and hissing ferociously, blah, blah, blah. And that's what gets fed to the actual image generation model. So yeah, you are absolutely gonna start seeing one large language model trying to prompt engineer another one to bypass its safeguards, try to change what it's doing, the whole nine.

It's going to get ugly. What we're really talking about here, the end game of this, it's not even dead internet theory so much as it is the September that never ended. Yeah, but really unleashing AI in this way really reminds me of the old nursery rhyme about the lady who swallowed a fly. And this is she swallowed a cat, a bird to get the fly and then a cat to get the bird and it's on and on. But it really just did seem like...

This sounds like a great idea with the best of intentions. What could possibly go wrong? And one finger curls on the monkey's paw. And, you know, just to kind of bring it back home, why I'm referencing the September that never ended and not simply dead Internet theory. I am absolutely not trying to argue that modern AI technology or any AI technology that I expect to see within the next 10 years is even close to human level.

But when we say that, it's kind of like the difference between like hard drive vendors benchmarks versus my benchmarks for storage system that are very real world and like are actually looking for pain points versus just trying to pass. And when you compare even chat GPT from like two years ago to some random mouth breather that just got on the Internet for the first time in 1994, it's

Yeah, we've already well bypassed, you know, the apparent intelligence of the dumbest people showing up on the Internet. Old chat GPT already looks, even though it's not, it looks smarter than most normal people you find on the Internet. So this is dangerous. This is a problem. This is not just going to be an issue because it's stupid bots constantly yammering at each other. It's also an issue because most people won't know what is a bot and what isn't.

Let's do some free consulting then. But first, just a quick thank you to everyone who supports us with PayPal and Patreon. We really do appreciate that. And if you want to send in your questions for Jim and Alan or your feedback, you can email show at 2.5admins.com. Tyler writes, how would you go about managing SSH keys for a team on a large group of machines, 100 plus? I know that setting up an SSH certificate authority is one way to tackle the problem. And another might be to manage the keys with automation tools like Ansible, Puppet, Chef, etc.,

But what way would the hosts go about it, as managing a team of keys on 100+ machines by hand would not be a reasonable task when a key needs to be added or revoked immediately? Yeah, so like you said, SSH certificate authority is an option, or you can use something like Puppet or Chef or whatever to push out the keys. One other way is OpenSSH has a config setting where it can basically run a command and use that to get the key for a user.

And so at the FreeBSD cluster, where we have 250 plus developers that all have to have SSH keys, and we have to decide who has access to what host based on roles and so on, we use LDAP. And then the integration with SSH basically runs this command saying, hey, ask LDAP, what SSH keys should I accept for user Alan? And it will go into LDAP and pull out the list of keys from the directory service that has all of our TPSs.

team members and we'll decide which machines you're allowed to have that authorized keys file on basically. And that allows us to decide all of Alan's keys work on any of these boxes, but not those boxes. And generally, if you have a very large team, you're going to have some kind of directory already, probably LDAP or LDAP compatible. And so you're

Basically, you can have people put their list of public keys, what you would normally put in an authorized keys file per user into that LDAP database and be able to pull it out that way. And then same way, that way, as soon as you deactivate their account for all the other stuff that uses your single sign-on, it also immediately takes effect for SSH. And it's not some separate step like with Puppet, you know, just because somebody got fired doesn't mean they told you to go and delete that person's key from Puppet.

Whereas with LDAP, when they get disabled in one place, it'll affect everything. I think we probably should also mention that if you're in an environment with a whole lot of Windows infrastructure, you might be tempted to manage certificates with Active Directory. That is a thing that you can do. It's a lot of work, and I wouldn't necessarily recommend doing it that way. But if you're a real big AD shop and that's what you want to do, yeah, it's possible.

Another option, if you're in a heavy Windows environment, and I'm actually going to pitch a proprietary tool that I've used at a couple of enterprise clients before, something called JumpCloud. It's not something that I would want to budget at most of my clients, to be honest. But once you're in an enterprise environment and the tolerance for like monthly expenditure starts going up, JumpCloud can be extremely useful in helping you manage large numbers of machines at scale,

policies of like what they are or aren't allowed to do, you know, ways to remote control them to offer support. JumpCloud can offer all this kind of stuff. And it also has a facility for managing SSH keys specifically, which is a whole lot easier than just diving directly into raw Active Directory and trying to find a way to get that to match up with the SSH to use Active Directory's LDAP compatible, yada, yada, yada, in the ways that Alan was just talking about the FreeBSD team using like raw LDAP.

Right, well, we'd better get out of here then. Remember, show at 2.5admins.com if you want to send any questions or feedback. You can find me at joelrest.com slash mastodon. You can find me at mercenariesysadmin.com. And I'm at Alan Jude. We'll see you next week.