The big dogs absolutely do phase config rollouts as a general rule.
There are still two weaknesses:
1) Some configs are inherently global and cannot be phased. There's only one place to set them. E.g. if you run a webapp, this would be configs for the load balancer as opposed to configs for each webserver
2) Some configs have a cascading effect -- even though a config is applied to 1% of servers, it affects the other servers they interact with, and a bad thing spreads across the entire network
> Some configs are inherently global and cannot be phased
This is also why "it is always DNS". It's not that DNS itself is particularly unreliable, but rather that it is the one area where you can really screw up a whole system by running a single command, even if everything else is insanely redundant.
Sure, but that doesn't really help for user-facing services where people expect to either type a domain name in their browser or click on a search result, and end up on your website every time.
And the access controls of DNS services are often (but not always) not fine-grained enough to actually prevent someone from ignoring the procedure and changing every single subdomain at once.
> Sure, but that doesn't really help for user-facing services where people expect to either type a domain name in their browser or click on a search result, and end up on your website every time.
It does help. For example, at my company we have two public endpoints:
company-staging.com
company.com
We roll out changes to company-staging.com first and have smoke tests which hit that endpoint. If the smoketests fail we stop the rollout to company.com.
That doesn’t help with rolling out updates to the DNS for company.com which is the point here. It’s always DNS because your pre-production smoke tests can’t test your production DNS configuration.
If I'm understanding it right, the idea is that the DNS configuration for company-staging.com is identical to that for company.com - same IPs and servers, DNS provider, domain registrar.
Literally the only differences are s/company/company-staging/, all accesses should hit the same server with the same request other than the Host header.
Then you can update the DNS configuration for company-staging.com, and if that doesn't break there's very little scope for the update to company.com to go differently.
The purpose of a staged rollout is to test things with some percentage of actual real-world production traffic, after having already thoroughly tested things in a private staging environment. Your staging URL doesn't have that. Unless the public happens to know about it.
The scope for it to go wrong is the differences in real-world and simulation.
It's a good thing to have, but not a replacement for the concept of staged rollout.
But users are going to example.com. Not my-service-33.example.com.
So if you've got some configuration that has a problem that only appears at the root-level domain, no amount of subdomain testing is going to catch it.
There are still two weaknesses:
1) Some configs are inherently global and cannot be phased. There's only one place to set them. E.g. if you run a webapp, this would be configs for the load balancer as opposed to configs for each webserver
2) Some configs have a cascading effect -- even though a config is applied to 1% of servers, it affects the other servers they interact with, and a bad thing spreads across the entire network