Two and a half admins episode 232. I'm Joe. I'm Jim. And I'm Alan. And here we are again. Before we get started, you've got a plug for a webinar, Alan. Yes, on February 13th at 1pm Eastern, myself and Colin Percival, the founder of Tarsenac, the backup service, will be hosting a webinar, Raid is not a backup and other hard truths about disaster recovery.
So if you've ever wondered how to make sure you're doing your backups right and some of the pitfalls that we've seen other people fall into, you should definitely tune in and join the webinar. Right. Well, link in the show notes as usual.
MasterCard DNS error went unnoticed for years. Yeah, this one was pretty amusing. A relatively innocent typo. So when they were setting up the DNS delegation for one of their subdomains that was going to use Akamai, which is one of the big CDNs, they basically delegated all the DNS for az.mastercard.com to Akamai.
a set of servers at Akamai. But when creating the five NS records that would delegate that zone, one of them they typoed and put, you know, the serial number dot A-K-A-M dot N-E. It missed the T, the dot net part. And so that's been there like that for a while, and basically meant one in five DNS requests would go to a thing that didn't resolve, and then would retry and try one of the other five and resolve, and everything was fine. Until somebody finally noticed.
And they realized that for about $300 and a bunch of hassle, you can actually register a .ne domain. And .ne is the country code TLD for Niger. Nebraska. Yes, exactly. And so this typo meant suddenly a fifth of the DNS lookups trying to go to az.mastercard.com would go to the security researcher.
Except for you quickly noticed that MasterCard wasn't the only person who had made this typo before. Just for Akamai, so for akam.net. I'm sure if you were the .netld, you'd probably get more traffic that is people typoing .net than people actually trying to go to your country. Which I can assume is a bit annoying at times. But what this means is, because there's no DNSSEC on that domain...
the researcher could get one fifth of the people trying to go to those websites to go to his website instead. Now, might be able to get a Let's Encrypt certificate for that even if he just retries enough where enough of the times it resolves to his name server. Well, he'd probably have to try about five times, wouldn't he? Well, maybe 10 to make sure you got one or two of them to go through. And I don't know how
Picky, let's correct, is about that. And if MasterCard does any other anchoring to make sure that the only certificates that are actually the right one or from the right CA work on their website. But yeah, the amount of damage you could do with this is pretty bad for basically anybody who makes a typo like this in their DNS. And this is not just limited to MasterCard and things with Akamai and what have you. Honestly, even without the InfoSec issue here, which is major,
You're talking about leaving a loophole that could potentially give an attacker access to millions of credit card transactions.
That's huge. But even without that, this is already pissing me off because it's forcing a failure and retry for roughly 20% of all the traffic going to the service. That's not only wasting bandwidth. It's not only, you know, wasting all sorts of resources. It's also introducing latency into the actual processing time when essentially one out of 20 tries fails.
Hopefully your client devices will eventually remember, you know, I can't reach that non-existent server or whatever, and you'll be okay as long as, again, nobody's actually registering it and it's not actually going through. But this still represents a huge burr in your pants as far as this goes. You know, it's a major problem. And it's not an uncommon one. Even if you're not doing something like this, if you are just
A lone IT person who has found yourself getting pulled into a larger team with an existing Active Directory infrastructure, man, check your Active Directory DNS. I can't tell you how many times I've been brought into a new enterprise size or, you know, larger mid-market size client.
found that their Active Directory has like eight different name servers listed all in different sites. And at any given site, three or four of those freaking name servers are unreachable because that site doesn't have a VPN to the site where the other one is, or there's not permissions for this VLAN to access the other VLAN where that one was. But because there are a couple that go through,
eventually your lookups will work, even if it had to fail accessing three out of eight servers before it got to one that functioned.
So you wind up with a whole organization of people who are cranky and bitter and angry about how weird the computers are and how sometimes everything works the first time and sometimes it just hangs forever before it works. It's all because nobody bothered to sit down and say, hey, can I actually reach all of these servers that I'm telling everybody in my organization are valid DNS servers that they should use?
Shouldn't monitoring have caught that? Yes, this exposes a deficiency in their monitoring in that they weren't checking that every one of the listed name servers actually answered queries. A deficiency in their what now? Monitoring. Their what? What's the thing that you think they were doing? Well, yes, in this case, maybe they weren't doing any monitoring at all, most likely. Similar to how you have to make sure you're monitoring that your domain isn't about to expire and that your SSL certificate isn't about to expire, you do want to make sure that every one of the listed name servers is willing to answer.
As Jim alluded to with the Windows organization, sometimes it's just like, that server's been retired for three years. Why are people still pointing at it? And what can happen that's worse is something else comes up at the IP address someday and does have a DNS server, but is not for the same zones and so won't authoritatively answer and will say, yeah, there's no such thing or something and cause even worse problems. Now, see, I thought this was going to be a really short segment, Alan, but then you had to go and bring up the whole monitoring thing and...
Now you've got me getting my teeth into this, and I want to push back on that as well, because another issue that I see a lot in larger organizations is even when they do have active monitoring, the active monitoring is taking place from a trusted VLAN that gets to touch everything.
It is going to have absolutely no bloody idea when this VLAN over here that only, you know, the dog body rank and file employees have access to that's not privileged can't get to half of the servers it's trying to because all it sees is, oh, hey, I'm monitoring all my DNS servers. And you know what? They all respond. I'm in perfect health. Yeah, this is where something like the RIPE Atlas probes or I think there's some other device we talked about in the past.
in the past from the show where you run a probe on your network that monitors the health of your network and
Other people do the same and you earn credits where you can ask other people's probes to try DNS queries or pings or tracer ads for you and give you the results. So if your monitoring system can ask random other devices on the Internet, hey, can you resolve this host name and make sure that they can? That can do a lot to make sure that you're getting real world results, not just like Jim said, your privileged view of the network.
Or from the internal perspective, like what I was talking about with Active Directory setups, internal enterprises where half the stuff is broken. The thing that we're looking for here is you need some kind of probe that's in each of your VLANs. When you've got a complex network environment where you've got VLANs that aren't allowed to talk to one another –
part of your monitoring is you've got to have a probe in each one of them so that you can verify that each segment is allowed to talk to all the segments it's supposed to and everything works from there. Because again, the fact that you can reach something from your server VLAN tells you absolutely nothing about whether your employees can reach it from the untrusted VLAN where, you know, they're scrutinized from everything to how many potty breaks they take to which websites they visit to, you know, what infrastructure they're allowed to touch. Yep. And I
I'd say the culprit that it shows up most of the time when people complain about slow internet is actually DNS. Because they're hitting some DNS server that's either terrible or not actually there, waiting to time out and then trying one that is, it means that, yeah, every time you try to load a website, there's a long stall while it tries the server that's not there and then goes to one that is. And that's an excellent chance to hit a note that I love to hit as frequently as possible, which is what's the thing that really pisses humans off? Is it a throughput problem?
No, it's a latency problem. Latency is what really pisses humans off. And that's exactly what you get with DNS issues. Latency. Yeah. And 20% of all of MasterCard's DNS lookups had a latency and they didn't bother to notice or look into. I'm sure people probably complained at some point and they were just like, ah, and brushing that kind of stuff off.
What the security researcher pointed out that they could have done, which would have been quite evil, is they could have set the TTL on their responses. So when they hijacked this domain and were giving out invalid results, they could have set a TTL of like a year. So anything that ever cached it, and because the TTL on MasterCard's real servers is short...
In the one in five chance of you getting the other server, over a little bit of time, every host that was going to the site frequently would have cached only the bad result for a year. And then you start screwing with it and see how long it takes MasterCard to figure out what's going on. Or at that point, you could even stop answering from the hijacked domain and you've just captured a bunch of banks connection back to the MasterCard API.
Okay, this episode is sponsored by ServerMania. Go to servermania.com slash 25A to get 15% off dedicated servers recurring for life.
ServerMania is a Canadian company with over a decade of experience building high-performance infrastructure hosting platforms for businesses globally. ServerMania has up to 20 Gbps network speeds in 8 locations worldwide for optimized global reach. They have flexible custom server configurations tailored to unique needs as well as a personal account manager offering free consultations.
With 24-7 live chat support and support ticket response times under 15 minutes, you can always keep your systems running smoothly. Alan's been a happy ServerMania customer for over seven years, so support the show and join him at a hosting provider that truly delivers. Get 15% off dedicated servers recurring for life at servermania.com slash 25A and use code 25ADMINS. That's servermania.com slash 25A and code 25ADMINS.
Let's do some free consulting then. But first, just a quick thank you to everyone who supports us with PayPal and Patreon. We really do appreciate that. If you want to join those people, you can go to 2.5admins.com slash support. And remember that for various amounts on Patreon, you can get an advert-free RSS feed of either just this show or all the shows in the Late Night Linux family. And if you want to send in your questions for Jim and Alan or your feedback, you can email show at 2.5admins.com.
Another perk of being a patron is you get to skip the queue, which is what Eric has done. He writes: "I use ZFS for about 4TB of storage on my home NAS, which locally backs up to an external hard drive using Syncoid. I'm trying to figure out a practical offsite backup solution. I've tried putting a small NAS at the houses of friends and family,
But the maintenance issues become too much for me to handle. It's hard to debug why a wireguard tunnel suddenly stops working on the remote end. One alternative I contemplated was simply rotating two external hard drives between my NAS and an offsite location on a regular basis. The problem, of course, is that new data in between rotations won't be backed up offsite. I don't expect to generate that much data in between rotations though, say less than a gigabyte, which is easy and cheap enough to store on the cloud.
Would it be advisable to, say, generate an incremental stream with ZFS Send daily and store that on the cloud? Is there an easy way to do that with existing tools? Or should I forget about this and just pay for async.net or zfs.rent? So for storing just the incremental, so if you use ZFS Send,
and just do an incremental between basically what is already on the external hard drive and what today's image is, you can just have ZFS send direct that to a file. And you'll just have a single file that is what you'd have to ZFS receive on top of the external hard drive to catch it up.
And so because it's just one file, you could store that in the cloud in any of the different kinds of clouds. You could use, you know, S3 type thing with object storage, which can be cheaper or a regular file storage thing like a Google Drive or Dropbox or whatever. It's just if you end up with 30 of these, then that might get a little cumbersome.
And you have to decide whether you want each one of those to be just built on the day before, meaning that when you want to catch up the backup, you'd have to take the backup and then restore every day's incremental up to the day where the failure happened. But, you know, if at most that is 30 days, it's probably not really that difficult to do. So yeah, that can definitely work. It sounds like you've already tried the easier solution, which is, you know, having that
remote backup be online somewhere where it can receive those incrementals. Because, you know, if it's less than a gigabyte of difference, then it is totally feasible to send that over the internet on a regular basis. But as you said, debugging other people's internet connections remotely is not easy. So I don't have a specific tool to recommend. I think Rclone is pretty good support for a bunch of different clouds. We've helped a couple of customers configure it to backup
their files to have their non-ZFS based backups. So they have a ZFS primary and then a ZFS replica, and then they want a backup that's not on ZFS. And we've used that for a couple of different providers. But whatever fits your style and your budget properly is fine. I think the bigger issue here is that once you start talking about saving individual images of individual incremental streams,
you're not really performing what we think of as a ZFS replication workflow anymore. You are still using the same tools, but the workflow is very different and the limitations of it are very different. And at that point, you're treating it like you would a tape drive with all of those limitations.
And as Alan alluded to, you know, you have the question of how frequently do you rebase your incrementals? The part that Alan just kind of left on the table is obvious, is that you have to make new fulls periodically on which to rebase your incrementals.
Whether you're basing each daily incremental on the daily before it and the daily before it and the daily before it, there has to be a full back there somewhere. And the longer you make that chain, the more fragile your restoration process is going to be and the more difficult and the more time consuming it's going to be.
So one way around that is to take a full and say, perhaps for the next 30 days, each day I take a new incremental. But the incremental is not just for today versus yesterday. It's for today versus that last full that I took. At that point, all you need to have a successful restoration is the latest incremental and the full that it's based off of. But you still have to generate that full pretty frequently. And you have to have a place to store it and you have to get it there and you have to manage all that.
If you have a relatively trivial amount of data, you know, no more than a few gigs, it may be acceptable to do this, you know, with a cloud-based type service. But at that point, I would have to question whether using ZFS tooling is really the right answer at all. If you have a small enough amount of data that you can afford everything from the IOPS to the bandwidth to the you name it, that you take up with the less efficient approach of, you know, the incrementals and very occasional fulls,
At that point, maybe it makes more sense just to use something like Borg or RESTIC or whatnot that's more directly supported by the cloud backend that you want to back up to in the first place. Maybe this becomes, you know, like the two or the three of a 3-2-1, you know, type backup strategy where you say, okay, well, I am doing like full-on replication between this host and this host. And I love that means that all my checksum stay intact. And, you know, I've got all the various, you know, ZFS bells and whistles there.
But I also, just in case, like all of ZFS has some horrible bug that takes out both of these ZFS-based backups, I've got my Borg or RESTIC or whatever from the cloud, and you don't have to jump through any extra hoops to restore it. Now, again, that won't necessarily work for everybody. Personally, I back up a ton of large VM images, and that approach just plain doesn't work for me. But if it can work for your dataset, maybe that's the right answer. Yeah, I kind of glossed over
what Eric was talking about where he's got a pair of external drives that he's doing the full replication to and swapping them once a month. So when the month rollover comes, he takes the second external drive that is now a month out of date, brings it home, does the differential from the last full backup on that one to the whole 30 days that have changed. And so now the external drive has a full on it.
And then he's storing just the incrementals in the cloud because the full is basically offline somewhere where he can go and get it. Right. And I'm glad that you went over that in more detail for the listeners. Personally, I did recognize that. And that's exactly what I was talking about. You now have this fragile situation where you're relying on both a full in one format in one place and an incremental that's stored somewhere else. And you've got to put the two together or you have nothing.
There's a lesson that I learned in my early 20s, which is when you're on the way to the party, never, ever, ever put the keg and the tap in different vehicles because one of them is very likely not to show up. Yeah, you basically double the number of ways things can go wrong by having these different mechanisms.
But this kind of hybrid can be practical when the size of the full is so big that the cloud doesn't make sense, but the size of the incrementals is so small that it does. But as Jim said, yeah, if you put the keg in the tap in different cars, the party's probably not going to be fun. That really was not a good party. Just returning to the bit where Eric said it's hard to debug why a wire guard tunnel suddenly stops working on the remote end.
That did make me wonder about Tailscale. Now, for full disclosure, they are a long-time sponsor of various Late Night Linux Family shows, so keep that in mind. But that was the first thing that jumped into my head. I think that's a really reasonable thing to suggest. I have quite a bit of experience with people struggling to troubleshoot why their WireGuard connection isn't working.
And it's usually not something that just like magically stops working. It's like you didn't get something right to begin with or, you know, somebody monkeyed around inside a config file, something like that. And once it breaks, it can get pretty arcane. There's not a lot of logging going on. It essentially either works or it doesn't, in part because of the ultra secure way that it's intended to operate to begin with. The two parts don't really work.
They try not to leak data to each other, so either you got your secrets in your config right or you didn't and you don't really know a lot about what you got wrong if you got it wrong. Now, Tailscale is a much friendlier alternative. It's not something that I personally usually choose because I don't need the friendliness and I like the really stripped down leanness of Pure WireGuard.
But if you're a person who finds this kind of thing, you know, intimidating and you want something that's a little bit more handholdy and you can click around in a GUI, you can, you know, add devices and have like kind of a single pane of glass overview of your management. Tailscale is a really good fit there. And instead of being like, oh, no, you know, I don't see the thing on the other end of the wire guard tunnel, you know, what's going on?
If you're running tail scale instead, then you're probably going to be a lot more confident that if you don't see the device on the other end, it's because your friend turned it off and you can just tell your friend, hey, did you have a power outage or something? Please turn it back on.
And even in the event that it becomes something more serious than that, you've got a better shot. I know this is still not going to be true for everybody, but you've got a better shot of being able to just ask your friend, hey, can you look at the tail scale thing on that side and see what it says? Whereas asking some random buddy to look through like var log syslog hoping for some kind of output about what happened with a WireGuard connection, that's a huge exercise in fail. Yeah, like as Jim said, WireGuard...
If everything's not right, we'll refuse to try to interpret the packet so it can't interpret it and tell you what is wrong about it. So it will just be like, nope, it's not right. I'm just ignoring these messages. And unless you turn the debug verbosity all the way up, it won't even tell you that it got one it couldn't decrypt. And so yeah, as a trade-off for the security, it is a little harder to debug than a lot of other similar tools. Yeah, there's a reason why our half-admin uses Tailscale and not Pure Wireguard.
Okay, this episode is sponsored by Automox. Are you prepared for whatever shitstorm may hit your desk during the workday? Automox has your back. Check out the brand new Autonomous IT Podcast. Listen in as various IT experts discuss the latest Patch Tuesday releases, mitigation tips, and custom automations to help with CVE remediations. Make new work friends. Listen now to the Autonomous IT Podcast on Spotify, Apple, or wherever you tune into your podcasts.
Harold, who's a patron, also skipped the queue. He writes, You mentioned on a show recently that the FBI recommends using encrypted messaging. Does that mean I should stop using SMS for two-factor authentication and use something like an authentication app instead?
I do use an authentication app for some of my most critical web activity, but I'm going to be whipping out my authentication app every 10 minutes. Get used to it. The issue with SMS authentication isn't even necessarily about whether it's encrypted or not. The big issue is that it's pretty easily compromised. You don't actually know who you're getting a text message from. You don't know whether it's a legitimate text message that has actually come across your
your telcos network. You don't know if you're connected to a stingray device somewhere. SMS is a lot like email. It was never designed with security in mind. And unlike email, there really hasn't even been, to the best of my knowledge, any serious attempt to try to bolt security on afterward. The biggest issue with SMS, obviously, is in order to intercept your SMS two-factor code, all the attacker has to do is social engineer someone at your phone company.
They don't need any technical sophistication at all. They just need a recording of a screaming baby and convince somebody to port your phone number to their phone. And now they're getting the two-factor code instead of you, and they can log in as you. Whereas the app requires pairing and make sure that, you know, only the devices that have been authorized are actually going to get the prompt or have the TOTP code or whatever the secret is. And it makes sure that there's a lot more required to actually be able to get around the two-factor.
And that's assuming that we're even talking about SMS two factor that actually requires you to type in a number on your computer that you got an SMS text about. I have seen some attempts at SMS two factor authentication that actually want you to click a link that's in the text in order to validate that you were the person to blah, blah, blah. And at that point,
that becomes really easy for somebody to just send you those things. And with a lot of humans, the odds are pretty good that if they get hit with something that looks like it might've been part of their workflow,
They're just going to say, okay, and they're going to click the thing. So SMS authentication is incredibly dangerous. Don't do it. Yeah, like training people that clicking on links that come in over SMS is a normal thing to do is bad for all kinds of reasons. And we'll get lots of people. But even people that are more security aware, like if you go to the website and you use the thing and you get the text message right then because you were expecting it, your guard's not going to be as up as if it was an unexpected text message. And...
Yeah, you could probably get even people that wouldn't normally fall for most phishing or shmishing or whatever you want to call them attacks just because if the attacker is in the chain somewhere and they're actually able to
send you one that coincides with when you were expecting one, they could get a lot more people that way. And then for having to whip it out the app every 10 minutes, well, if you're having to look at your phone to copy the number off an SMS every 10 minutes anyway, you're probably not making that big of a difference. I do know some people that have gone to smartwatches specifically to have those numbers pop up on their watch so they don't have to pick up their phone every time they want to do it.
Because of my work, I work with customers that have every different possible version of this. I use almost all of the apps. And I will say that the Duo one I really like where I just get a prompt is like, was this you? And I just say yes. Or the Microsoft one's not so bad. It pops up and I then have to authenticate with my biometrics or my pin. And then it tells me a two digit number I have to type into the website to prove that I'm answering the correct 2FA prompt.
Specifically, like Jim was talking about, if someone could send you one around the time you're expecting one, you could maybe get them confused in this way. Microsoft made sure that there's this extra little thing, but it's just two digits. It's not having to type out a six-digit thing.
Mikael, who's a patron, also skipped the queue. He writes: "Do you guys have any suggestions for Dropbox alternatives? I'd like to move my wife's document files which are currently inside a cloud service called Synology Drive, perhaps to a ZFS pool. She uses two machines with macOS, so syncing between the two is a requirement. I've used SyncThing before, with the third instance running backed by ZFS, but maybe there's a better option with high spousal approval."
I think realistically you're going to need a cloud-based service of some kind because trying to directly bidirectionally sync between two different sources of truth is just an exercise of never-ending frustration. And it will never have decent spousal approval. Hell, it doesn't have decent me approval, let alone spousal approval, because you've always got conflicts you've got to manually resolve where somebody has modified a file on each end. You have to decide which one you want to keep and which one you want to throw away.
And usually the answer is neither. So you need a single source of truth that then you can replicate downstream, you know, to other locations. And Dropbox works well for that. If you're a Google person, there's G drive. If you're a Microsoft person, there's one drive. You could perhaps set up, you know, stand up your own VM in the cloud somewhere and, you know, run NextCloud on it and use that as your single source of truth.
But ultimately, I think that's probably what you're going to have to do to keep the spousal approval factor. There's iCloud as well, of course, if you're trying to sync two Macs. That is very true, and I can't believe I didn't think of that. But in my defense, I'm not a Mac person, and I was trying to politely ignore the macOS part. Yeah, especially if syncing isn't getting you the spousal approval level you need. I think Jim's point of whatever cloud works best for that type, knowing it's macOS, that might be iCloud,
And then using a tool like Rclone or something to pull a copy down from that cloud to ZFS on a regular basis so that you have basically this not-in-the-cloud backup of it as well. And so it gives you kind of the same thing. So whether that's Dropbox or whatever service you want, just pulling a copy of it out of there and putting it on ZFS will make sure that you have a local one to bootstrap from if you ever need to resynchronize because the device dies or whatever. You can save yourself redownloading all of those machines
media files again and know that you have a copy that doesn't go away if the cloud does or your subscription lapses or whatever might happen to disrupt the cloud. Well, if you're using one of the major cloud services, you're generally not going to need something like Rclone because they're already designed to be the single source of truth and automatically sync the files up and down as you modify them. You're talking about like a OneDrive or a G drive. Right. So I'm saying pulling an extra copy drawn to your ZFS machine without having to install Google's thing on your NAS. Yeah.
That's fair. I hadn't quite realized that you were talking about like a third copy as a backup rather than just the wife's two Mac OS machines. Yeah, so it's her two Mac machines and then he wants a copy on ZFS because he doesn't trust the Macs or the cloud. Gotcha. Another alternative, if you're using a cross-platform cloud service, you know, Dropbox or what have you,
and you're feeling extra froggy, you could spin up a VM with a better known file system beneath ZFS, and you can just install your OneDrive or what have you directly on that. I have absolutely helped small businesses use OneDrive that way before, where they just have a folder on the server is designated as the OneDrive folder. And if you're in the office, you use the file share on the server. And if you're out of the office, you use OneDrive.
Looking through the documentation for Dropbox, it looks like they support ZFS, so you wouldn't even need a VM potentially. Which might be a great reason to choose Dropbox as your, you know, cloud sync service of choice. That would be really nice to be able to say, hey,
Dropbox, here you go. Whereas I've set up OneDrive situations for clients before, and you can't get off the ranch at all with something like OneDrive. If it doesn't see what it expects, it won't work. Therefore, VM, Windows, and you're good to go. Yeah, but if the Dropbox Linux app is giving you trouble, then as Jim said, you could always run a VM and make it have Windows if that's what it wants to work better.
alternately, you could just directly back up from one of your spouse's Macs. Yeah, that's kind of how we do it at my house here. My spouse doesn't really understand the difference that the S drive on her computer is...
lives on a server in the rack, not on her computer. It just works, so she doesn't care. I see you also are a fan of the S for server drive. In this case, it's S because it's her first initial. Fair enough. I set up an awful lot of clients with mapped S drives because the S is for server. Yeah, makes sense. On my computer, it's S for Steam, actually, because this is where I put all the games. My computer doesn't have drive letters.
It's my gaming computer. My gaming computer doesn't have drive letters. Busted. Yep, you got me there. Right, well, we better get out of here then. Remember, show at 2.5admins.com if you want to send any questions or feedback. You can find me at joerest.com slash mastodon. You can find me at mercenariesysadmin.com. And I'm at Alan Jude. We'll see you next week.