I do think the 'run all tests on CI' part is not that fine, it bites a lot earlier than the others do. Git is totally fine for a few hundred engineers and 100ish services (assuming nobody does anything really bad to it, but then it fails for 10 engineers anyway), but running all tests rapidly becomes an issue even with tens of engineers.
That is mitigated a lot by a really good caching system (and even more by full remote build execution) but most times you basically end up needing a 'big iron' build system to get that, at which point it should be able to run the changed subset of tests accurately for you anyway.
There are also so many types of slow tests in web systems. Any kind of e2e test like Cypress or Playwright can easily take a minute. Integrations tests that render components and potentially even access a DB take many times longer than a basic unit test. It doesn’t take very many of the slow group to reaaaly start slowing your system down. At that point, what matters is how much money you’re willing to pay to scale your build agents either vertical or (more likely) horizontally
Well no, it's just not build agent size; if you have 10 tests that take 3-4 minutes each, you're not gonna go any faster than the slowest of them (plus the time to build them, which is also typically bad for those kinds of tests, although a bigger build agent may be faster there). Having a system that can avoid running the test for many PRs because it can prove it's not affected means in those cases you don't have to wait for that thing to run at all.
Although, time is money, so often scaling build agents may be cheaper than paying for the engineering time to redo your build system...
I have hundreds of tests that take 15-30 mintues each. These tests tend to be whole system tests so there is no way useful way to say it won't touch your change (75% will). Despite an extensive unit test suite (that runs first) these tests catch a large number of real production bugs, and most of them are things that a quicker running test couldn't catch.
Which is to say that trying to avoid running tests isn't the right answer. Make them as fast as you can, but be prepared to pay the price - either a lot of parrell build systems, or lower quality.
It's a bit of a tangent and I agree with your point, but wanted to note that for one project our e2e tests went from ~40 min to less than 10, just by moving from Cypress to Playwright. You can go pretty far with Playwright and a couple of cheap runners.
I appreciate the point, but I've heard this kind of thing several times before - last time around was hype about how Cypress would have exactly this effect (spoiler: it did not live up to the hype). I don't believe the new framework du jour will save you from this kind of thing, it's about how you write & maintain the tests.
I wish I had hard evidence to show because my normal instinct would be similar to yours, but in this case I'm a total Playwright convert.
Part of it might be that Playwright makes it much easier to write and organize complex tests. But for that specific project, it was as close to a 1 to 1 conversion as you get, the speedup came without significant architectural changes.
The original reason for switching was flaky tests in CI that were taking way too much effort to fix over time, likely due to oddities in Cypress' command queue. After the switch, and in new projects using Playwright, I haven't had to deal with any intermittent flakiness.
I think that discussions in this area get muddied by people using different definitions of “rapidly”. There are (at least) two kinds of speed WRT tests being run for a large code base.
First, there is “rapidly” as pertains to the speed of running tests during development of a change. This is “did I screw up in an obvious way” error checking, and also often “are the tests that I wrote as part of this change passing” error checking. “Rapid” in this area should target low single digits of minutes as the maximum allowed time, preferably much less. This type of validation doesn’t need to run all tests—or even run a full determinator pass to determine what tests to run; a cache, approximation, or sampling can be used instead. In some environments, tests can be run in the development environment rather than in CI for added speed.
Then there is “rapidly” as pertains to the speed of running tests before deployment. This is after the developer of a change thinks their code is pretty much done, unless they missed something—this pass checks for “something”. Full determinator runs or full builds are necessary here. Speed should usually be achieved through parallelism and, depending on the urgency of release needs, by spending money scaling out CI jobs across many cores.
Now the hot take: in nearly every professional software development context it is fine if “rapidly” for the pre-deployment category of tests is denominated in multiple hours.
Yes, really.
Obviously, make it faster than that if you can, but if you have to trade away “did I miss something” coverage, don’t. Hours are fine, I promise. You can work on something else or pick up the next story while you wait—and skip the “but context switching!” line; stop feverishly checking whether your build is green and work on the next thing for 90min regardless.
“But what if the slow build fails and I have to keep coming back and fixing stuff with an 2+ hours wait time each fix cycle? My precious sprint velocity predictability!”—you never had predictability; you paid that cost in fixing broken releases that made it out because you didn’t run all the tests. Really, just go work on something else while the big build runs, and tell your PM to chill out (a common organizational failure uncovered here is that PMs are held accountable for late releases but not for severe breakage caused by them pushing devs to release too early and spend less time on testing).
“But flakes!”—fix the flakes. If your organization draws a hard “all tests run on every build and spurious failures are p0 bugs for the responsible team” line, then this problem goes away very quickly—weeks, and not many of them. Shame and PagerDuty are powerful motivators.
“But what if production is down?” Have an artifact-based revert system to turn back the clock on everything, so you don’t need to wait hours to validate a forward fix or cherry-picked partial revert. Yes, even data migrations.
You are of course entitled to your opinion, and I do appreciate going against the grain, but having worked
in an “hours” environment and a “minutes” environment I couldn’t disagree more. The minutes job is so much more pleasant to work with in nearly every way. And ironically ended up being higher quality because you couldn’t lean on a giant integration test suite as a crutch. Automated business metric based canary rollbacks, sophisticated feature flagging and gating systems, contract tests, etc. and these run in production, so are accurate where integration tests often aren’t in a complicated service topology.
There are also categories of work that are so miserable with long deployment times that they just don’t get done at all in those environments. Things like improving telemetry, tracing, observability. Things like performance debugging, where lower envs aren’t representative.
I would personally never go back, for a system of moderate or more distributive complexity (ie > 10 services, 10 total data stores )
All very fair points! I think it is perhaps much more situational than I made it out to be, and that functioning in an “hours” environment is only possible as described if some organizational patterns are in place to make it work.
yeah i realized as i wrote that out that my personal conclusions probably don't apply in a monoservice type architecture. If you have a mono(or few) service architecture with a single (or few) db, it is actually feasible to have integration tests that are worth the runtime. The bigger & more distributed you get, the more the costs of integration tests go up (velocity, fragility, maintenance, burden of mirroring production config) and the equation doesnt pencil out anymore. Probably other scenarios where im wrong also.
That is mitigated a lot by a really good caching system (and even more by full remote build execution) but most times you basically end up needing a 'big iron' build system to get that, at which point it should be able to run the changed subset of tests accurately for you anyway.