I agree with this somewhat. The other day I was driving home and I saw a sprinkl...

loire280 · on Aug 9, 2024

This week we had a few minutes of downtime on an internal service because of a node rotation that triggered an alert. The responding engineer started to put together a plan to make the service HA (which would have tripled the cost to serve). I asked how frequently the service went down and how many people would be inconvenienced if it did. They didn't know, but when we checked the metrics it had single-digit minutes of downtime this year and fewer than a dozen daily users. We bumped the threshold on the alert to longer than it takes for a pod to be re-scheduled and resolved the ticket.

jack_riminton · on Aug 10, 2024

This is most sensible thing I’ve read on here in a while. Engineers’ obsession with tinkering and perfection is the slow death of many startups. If you’re doing something important like banking or air traffic control fair enough but a CRUD app for booking hair appointments will survive a bit of downtime

zerkten · on Aug 9, 2024

You assume that the teams running these systems achieve acceptable uptime and companies aren't making refunds for missed uptime targets when contracts enforce that, or losing customers. There is definitely a vision for HA at many companies, but they are struggling with and without k8s.

fragmede · on Aug 9, 2024

Why would wanting redundancy be a ZIRP? Is blaming everything on ZIRP like Mercury was in retrograde but for economics dorks?

consteval · on Aug 9, 2024

It depends on the cost of complexity you're adding. Adding another database or whatever is really not that complex so yeah sure, go for it.

But a lot of companies are building distributed systems purely because they want this ultra-low downtime. Distributed systems are HARD. You get an entire set of problems you don't get otherwise, and the complexity explodes.

Often, in my opinion, this is not justified. Saving a few minutes of downtime in exchange for making your application orders of magnitude more complex is just not worth it.

Distributed systems solve distributed problems. They're overkill if you just want better uptime or crisis recovery. You can do that with a monolith and a database and get 99.99% of the way there. That's good enough.

addaon · on Aug 9, 2024

Redundancy, like most engineering choices, is a cost/benefit tradeoff. If the costs are distorted, the result of the tradeoff study will be distorted from the decisions that would be made in "more normal" times.

felixgallo · on Aug 9, 2024

Because the company overhired to the point where people were sitting around dreaming up useless features just to justify their workday.