Hello, CTO of Terrateam here, the creators of Stategraph.
As you said, the common practice is to use locks on state to guarantee that operations don't step on each other. This works, however the cost is that if it takes 5 minutes to perform an operation, only one person can be doing an operation at a time, so if 5 devs are modifying infrastructure, the last one has to wait 25 minutes just to get back the plan, even if those 5 people are not changing overlapping resources in the state.
The way that most people deal with this is they take their infrastructure and break it up across multiple root modules, and then when those root modules, break it up again, etc.
Stategraph is solving the problem of getting all of the performance benefits of breaking up your root modules without breaking up your root modules. It dynamically determines which resources each of those 5 devs are operation on and, if the resources do not overlap, can run them in parallel.
That means Stategraph is manipulating state in a bit more sophisticated way than standard Terraform/Tofu, and we need to be careful we don't get it wrong.
I'm not sure I would want this even if I could have it TBH. Engingeering org size is about ~200 with infra/sre/ops around ~25.
Different teams want to move at difference cadences. At a certain scale splitting up things feels a little more natural (maybe I am stockholmed by prior limitations with TF though or just used to this way of operating now).
But even then, we're moving to k8s operators to orchestrate a bunch of things and moving off terraform apart from the stuff that doesn't change much (which will eventually get retired as well). Something like https://www.youtube.com/watch?v=q_-wnp9wRX0
Terraform variable management is our larger problem (now/nearterm) when we have to deploy numerous cells of infra that use the same project/TF files with different variables. Given the number of projects/layers of TF getting cell specific variables injected is meh.
Those variables are instance size, volume size, addresses, IAM policy, keys etc.
This is in the b2b saas world with over a million MAU. We've got islands of infra for data soverignty, some global cells where each cell can communicate back / host some shared services (internal data analytics, orchestration tooling, internal management tooling and the like).
The way I look at it is that TF has a limitation on state size. And when you hit that limit, you have to either slow down a ton or do a (big) refactoring.
As comparison, if a programming language forced you to split your software into multiple executables when you got to a certain number of functions, I think, almost universally, we would say that it's not a production language. That is a stupid limitation and forcing development work on users because of stupid limitations is disqualifying.
But for TF, even if we are refactoring it because the tool is doing it, we tell ourselves that it's a good idea anyways because of good software practices. But splitting infrastructure over multiple root modules is, in my analogy, the same as being forced to do it over multiple executables. It comes with a lot of unnecessary limitations.
With Stategraph, you can choose to split your infrastructure over multiple root modules, if that is what you want to do, not because you don't have a choice.
V1 of Stategraph is a drop-in TF/Tofu replacement, but once it's there, you can see a path to something more like k8s operators, without having to do any migration of infrastructure.
As you said, the common practice is to use locks on state to guarantee that operations don't step on each other. This works, however the cost is that if it takes 5 minutes to perform an operation, only one person can be doing an operation at a time, so if 5 devs are modifying infrastructure, the last one has to wait 25 minutes just to get back the plan, even if those 5 people are not changing overlapping resources in the state.
The way that most people deal with this is they take their infrastructure and break it up across multiple root modules, and then when those root modules, break it up again, etc.
Stategraph is solving the problem of getting all of the performance benefits of breaking up your root modules without breaking up your root modules. It dynamically determines which resources each of those 5 devs are operation on and, if the resources do not overlap, can run them in parallel.
That means Stategraph is manipulating state in a bit more sophisticated way than standard Terraform/Tofu, and we need to be careful we don't get it wrong.