> *no government should keep critical data on foreign cloud storage* Primary? No...

jacquesm · 2025-10-05T21:43:33 1759700613

They fucked up, that much is clear but the should not have kept that data on foreign cloud storage regardless. It's not like there are only two choices here.

JumpCrisscross · 2025-10-05T21:51:23 1759701083

> the should not have kept that data on foreign cloud storage regardless. It's not like there are only two choices here

Doesn't have to be an American provider (Though anyone else probably increases Seoul's security cross section. America is already its security guarantor, with tens of thousands of troops stationed in Korea.)

And doesn't have to be permanent. Ship encrypted copies to S3 while you get your hardenede-bunker domestic option constructed. Still beats the mess that's about to come for South Korea's population.

jacquesm · 2025-10-05T21:56:32 1759701392

I'm aware of a big cloud services provider (I won't name any names but it was IBM) that lost a fairly large amount of data. Permanently. So that too isn't a guarantee. They simply should have made local and off-line backups, that's the gold standard, and to ensure that those backups are complete and can be used to restore from scratch to a complete working service.

xoa · 2025-10-05T22:41:59 1759704119

>I'm aware of a big cloud services provider (I won't name any names but it was IBM) that lost a fairly large amount of data. Permanently. So that too isn't a guarantee.

Permanently losing data at a given store point isn't relevant to losing data overall. Data store failures are assumed or else there'd be no point in backups. What matters is whether failures in multiple points happen at the same time, which means a major issue is whether "independent" repositories are actually truly independent or whether (and to what extent) they have some degree of correlation. Using one or more completely unique systems done by someone else entirely is a pretty darn good way to bury accidental correlations with your own stuff, including human factors like the same tech people making the same sorts of mistakes or reusing the same components (software, hardware or both). For government that also includes political factors (like any push towards using purely domestic components).

>They simply should have made local and off-line backups

FWIW there's no "simply" about that though at large scale. I'm not saying it's undoable at all but it's not trivial. As is literally the subject here.

jacquesm · 2025-10-05T23:03:15 1759705395

> Permanently losing data at a given store point isn't relevant to losing data overall.

I can't reveal any details but it was a lot more than just a given storage point. The interesting thing is that there were multiple points along the way where the damage would have been recoverable but their absolute incompetence made matters much worse to the point where there were no options left.

> FWIW there's no "simply" about that though at large scale. I'm not saying it's undoable at all but it's not trivial. As is literally the subject here.

If you can't do the job you should get out of the kitchen.

xoa · 2025-10-06T01:50:42 1759715442

>I can't reveal any details but it was a lot more than just a given storage point

Sorry, not brain not really clicking tonight and used lazy imprecise terminology here, been a long one. But what I meant by "store point" was any single data repository that can be interacted with as a unit, regardless of implementation details, that's part of a holistic data storage strategy. So in this case the entirety of IBM would be a "storage point", and then your own self-hosted system would be another, and if you also had data replicated to AWS etc those would be others. IBM (or any other cloud storage provider operating in this role) effectively might as well simply be another hard drive. A very big, complex and pricey magic hard drive that can scale its own storage and performance on demand granted, but still a "hard drive".

And hard drives fail, and that's ok. Regardless of the internal details of how the IBM-HDD ended up failing, the only way it'd affect the overall data is if that failure happened simultaneously with enough other failures at local-HDD and AWD-HDD and rsync.net-HDD and GC-HDD etc etc that it exceeded available parity to rebuild. If these are all mirrors, then only simultaneous failure of every single last one of them would do it. It's fine for every single last one of them to fail... just separately, with enough of a time delta between each one that the data can be rebuilt on another.

>If you can't do the job you should get out of the kitchen.

Isn't that precisely what bringing in external entities as part of your infrastructure strategy is? You're not cooking in their kitchen.

jacquesm · 2025-10-06T02:45:24 1759718724

Ah ok, clear. Thank you for the clarification. Some more interesting details: the initial fault was triggered by a test of a fire suppression system, that would have been recoverable. But someone thought they were exceedingly clever and they were going to fix this without any downtime and that's when a small problem became a much larger one, more so when they found out that their backups were incomplete. I still wonder if they ever did RCA/PM on this and what their lessons learned were. It should be a book sized document given how much went wrong. I got the call after their own efforts failed by one of their customers and after hearing them out I figured this is not worth my time because it just isn't going to work.

xoa · 2025-10-06T11:44:40 1759751080

Thanks in turn for the details, always fascinating (and useful for lessons... even if not always for the party in question dohoho) to hear a touch of inside baseball on that kind of incident.

>But someone thought they were exceedingly clever and they were going to fix this without any downtime and that's when a small problem became a much larger one

The sentence "and that's when a small problem became a big problem" comes up depressingly frequently in these sorts of post mortems :(. Sometimes sort of feels like, along all the checklists and training and practice and so on, there should also simply be the old Hitchhiker's Guide "Don't Panic!" sprinkled liberally around along with a dabbing of red/orange "...and Don't Be Clever" right after it. We're operating in alternate/direct law here folks, regular assumptions may not hold. Hit the emergency stop button and take a breath.

But of course management and incentive structures play a role in that too.

Dylan16807 · 2025-10-05T23:48:41 1759708121

In this context the entirety of IBM cloud is basically a single storage point.

(If IBM was also running the local storage then we're talking about a very different risk profile from "run your own storage, back up to a cloud" and the anecdote is worth noting but not directly relevant.)

hedora · 2025-10-06T02:23:00 1759717380

If that’s the case, then they should make it clear they don’t provide data backup.

A quick search reveals IBM does still sell backup solutions, including ones that backup from multiple cloud locations and can restore to multiple distinct cloud locations while maintaining high availability.

So, if the claims are true, then IBM screwed up badly.

nicolas_17 · 2025-10-05T22:12:29 1759702349

DigitalOcean lost some of my files in their object storage too: https://status.digitalocean.com/incidents/tmnyhddpkyvf

Using a commercial provider is not a guarantee.

lukevp · 2025-10-05T22:25:21 1759703121

DO Spaces, for at least a year after launch, had no durability guarantees whatsoever. Perhaps they do now, but I wouldn’t compare DO in any meaningful way to S3, which has crazy high durability guarantees as well as competent engineering effort expended on designing and validating that durability.

mensetmanusman · 2025-10-06T01:30:18 1759714218

They should have kept encrypted data somewhere else. If they know how to use encryption, it doesn’t matter where. Some people use stenographic backup on YouTube even.