The thing that made Google SRE scale was in the fine print and covered in this a...

Cthulhu_ · on Oct 13, 2021

I heard about that; amongst other things, if the amount of times SRE gets pinged because of a software fault exceeds X amount, they basically give the pager back to the team that built the software. They aren't taking responsibility for reliability if the software is not reliable.

I really appreciated working with a team of SRE types, it gave me a newfound appreciation for quality in software development. I remember one instance where a colleague in my team went up to the 'ops' team and wanted SSH access to a production server to check on some settings (environment variables, I believe). He thought it was a trivial thing, you know, "just lemme have a peek" kinda thing, but the ops team flat-out told him no, if he needs to print env vars, he can do it in his own code - we did continuous releases, a patch could be deployed within minutes if passing all the checks. I loved that the ops team had the mandate from higher up to say no to requests like this, and I found them a lot more professional than the software developers, including myself.

_ktx2 · on Oct 13, 2021

Ideally what that developer was wanting should be a function of the platform. That's likely not in an operations teams scope because they're mostly just fire fighters. Also ideally in a situation like SOX where devs can't have access to production their preproduction environments share the same interface that production has and if their values differ that much is documented.

bytelines · on Oct 13, 2021

> the ops team flat-out told him no, if he needs to print env vars, he can do it in his own code

This is still the wrong attitude from a "productization" of infrastructure perspective. Configuring the environment a program runs in is a core responsibility of infrastructure. So is being able to query it. Build that infrastructure product feature.

Or select a platform that makes this trivial, e.g kubernetes.