I agree with this somewhat. The other day I was driving home and I saw a sprinkler head and broke on the side of the road and was spraying water everywhere. It made me think, why aren't sprinkler systems designed with HA in mind? Why aren't there dual water lines with dual sprinkler heads everywhere with an electronic component that detects a break in a line and automatically switches to the backup water line? It's because the downside of having the water spray everywhere, the grass become unhealthy or die is less than how much it would cost to deploy it HA.
In the software/tech industry it's common place to just accept that your app can't be down for any amount of time no matter what. No one checked to see how much more it would cost (engineering time & infra costs) to deploy the app so it would be HA, so no one checked to see if it would be worth it.
I blame this logic on the low interest rates for a decade. I could be wrong.
This week we had a few minutes of downtime on an internal service because of a node rotation that triggered an alert. The responding engineer started to put together a plan to make the service HA (which would have tripled the cost to serve). I asked how frequently the service went down and how many people would be inconvenienced if it did. They didn't know, but when we checked the metrics it had single-digit minutes of downtime this year and fewer than a dozen daily users. We bumped the threshold on the alert to longer than it takes for a pod to be re-scheduled and resolved the ticket.
This is most sensible thing I’ve read on here in a while. Engineers’ obsession with tinkering and perfection is the slow death of many startups. If you’re doing something important like banking or air traffic control fair enough but a CRUD app for booking hair appointments will survive a bit of downtime
You assume that the teams running these systems achieve acceptable uptime and companies aren't making refunds for missed uptime targets when contracts enforce that, or losing customers. There is definitely a vision for HA at many companies, but they are struggling with and without k8s.
It depends on the cost of complexity you're adding. Adding another database or whatever is really not that complex so yeah sure, go for it.
But a lot of companies are building distributed systems purely because they want this ultra-low downtime. Distributed systems are HARD. You get an entire set of problems you don't get otherwise, and the complexity explodes.
Often, in my opinion, this is not justified. Saving a few minutes of downtime in exchange for making your application orders of magnitude more complex is just not worth it.
Distributed systems solve distributed problems. They're overkill if you just want better uptime or crisis recovery. You can do that with a monolith and a database and get 99.99% of the way there. That's good enough.
Redundancy, like most engineering choices, is a cost/benefit tradeoff. If the costs are distorted, the result of the tradeoff study will be distorted from the decisions that would be made in "more normal" times.
In the software/tech industry it's common place to just accept that your app can't be down for any amount of time no matter what. No one checked to see how much more it would cost (engineering time & infra costs) to deploy the app so it would be HA, so no one checked to see if it would be worth it.
I blame this logic on the low interest rates for a decade. I could be wrong.