Yep. SRE is not a substitute for high level, overarching architects and designers.
One pattern I see is that, as the company grows the development gets split into different product groups which will organically diverge unless there is rigid enforcement of design patterns. In some places, SRE does this implicitly because they will only support X, Y, or Z but in others each product group will have their own group of SREs.
There becomes a point when you need one or a small group of people who are the opinionated developers who can make design decisions and who have the authority to cause everyone else to course correct. If you don't have this, you'll wind up with long migrations and legacy stuff that never seems to go away.
I don't find having high-level architects to be a good pattern. They can make mistakes like anyone else; indeed having people who are no longer day-to-day coding make decisions that they don't feel the effects of makes wrong decisions more likely.
SRE exists to support product functions and like everything else should be attached to and understood in terms of those functions. Yes, every product group probably should have their own SREs, so that product group can own its whole lifecycle. Yes, different groups will do their own things and there will be mismatches and duplicated effort. That's less bad than the alternative.
I'm not saying that they should not be active developers, but people who can enforce change across the entire organization.
Previous Job (2500+ devs by the time I left) had an in-house RPc system that was being moved over to gRPC. That project was taking years because teams had no coordination on this process. The decision was made at some level and trickled out to everyone else. There was no single person or group who was in charge of:
- How services would be discovered
- Implementation Patterns of how Services & Methods will be defined
- Standardization of which libraries to use
- Examples and Pre-build shared libraries that provide the stuff like tracing, monitoring, retries, etc...
- Advocating for the changes
SRE seems to fall into the position of advocating business value for development practices that compete with business objectives that can provide value as well. At large organizations, if you don't have a central point that can set development objectives and be the one who teams can go to with "this pattern doesn't work for us, we can do this but we need buy in from other teams" issues and have directives handed down.
Unless you operate in an environment where the only cross-team communication is well versioned public APIs, then you will run into issues where you have to conflicting needs between teams and need someone to set a vision (this can be a group of people, rotating people, or a single person. how is not the issue)
The whole idea of enforcing technical mandates across the entire organisation is something I'm very sceptical about. No-one can hope to understand the constraints and requirements that 2500+ other devs are working under. Realistically the cross-team bandwidth is low, so if you don't have well versioned public APIs then you have barely understood interactions and no clear responsibility when they break.
There are probably some things do need to be standardised, but if there's a business need for standardisation then product teams should be able to understand and advocate for that (whether that means agreeing something with their directly adjacent product team, publishing something for clients to use, or something else). But in a lot of cases I think just accepting that different parts of the organization will work differently is the best way forward.
We recently decided as a company that the horizontal responsibilities structure doesn't work well at all, at least not at small scale. This was not in the software/infra teams but in our operations but I think there's some general truth here. The more vertically responsible your teams are, the better the final product is, and the more inefficiencies and impedance mismatches you can track down and fix.
For us it meant that the data processing teams have been made part of the drone operators team, so whenever we fly a mission a photography/3d rendering expert will also be part of the team that operates the drone. On paper it's more expensive to have office workers in the field, but in practice it leads to fewer reflights and happier and more productive employees.
I imagine that for the software departments, it could mean that every app development team has at least one member that has good operating system and network infrastructure knowledge, and/or maybe database expertise so that the team as a whole can largely operate a feature largely without having to depend on an outside SRE specialist.
And then the SRE's that you do have can focus on the site reliability, instead of having to constantly tell developers how the way they coded something is bad or whatever.
When you get very large you need both enterprise wide SREs that are responsible for consulting and approving architecture for reliability purposes AND localized per service or other small breakdown SREs to support small sets of services. How you break this down is tricky and there is definitely overlap.
Crappy attitudes towards support and reliability don’t scale, what you can get away with a few people keeping things barely held together stops working as you grow.
> There becomes a point when you need one or a small group of people who are the opinionated developers who can make design decisions and who have the authority to cause everyone else to course correct.
How many times have you been in the "everyone else" camp and was course corrected? How did those efforts work out for the firm in the long run? Any experiences to share that could be useful for the community?
Previous Job had an in-house RPC system and there was a desire to move to gRPC.
This process largely turned into "name and shame" because there was no incentive for some teams to put in the effort to make the changes. They had other objectives to complete, and swapping RPC frameworks was not one of them. The only way the change happened was putting a hard deadline before the old system was shutdown (by SRE), which is not the right way to do it.
There were a lot of stories like this. One team owned user information, but the business needs shifted this ownership to another team. This resulted in the ownership being split between three teams, and applications turning into transparent proxies for other applications. One service was a REST interface that provided a bit of logic on top of forwarding the request to a gRPC service.
The make up of the company was a bunch of loosely coupled product teams, and the only common connection was SRE and Data who worked with everyone. SRE became the team that had to work to resolve these "what the fuck, why can't you just sit down and figure out who should own this" issues. There really needed to be an architect or someone who could look at the big picture that could say "Why do we have this one internal REST service? Ok. Team A and B. You have this quarter to stop using Service Q and migrate to Service W."
$NewCompany, SRE is doing the course correcting (just due to small size), but we have a Principal Developer who is dictating that "Yes, we're going to implement new business logic in lambdas following this pattern." And they work to make sure that everything is done in a standard way, but at the same time take ideas for new patterns and make sure they are done in a smart way and don't conflict with anything else. He doesn't stand as a roadblock, but someone who can make sure teams are not going off and doing weird things (like use MySQL when we're a Postgres shop) that can cause issues later.
One pattern I see is that, as the company grows the development gets split into different product groups which will organically diverge unless there is rigid enforcement of design patterns. In some places, SRE does this implicitly because they will only support X, Y, or Z but in others each product group will have their own group of SREs.
There becomes a point when you need one or a small group of people who are the opinionated developers who can make design decisions and who have the authority to cause everyone else to course correct. If you don't have this, you'll wind up with long migrations and legacy stuff that never seems to go away.