I've had coworkers in the past that treat code like it needs to be compressed. Like, in the huffman coding sense. Find code that exists in two places, put it in one place, then call it from the original places. Repeat until there's no more duplication.
It results in a brittle nightmare because you can no longer change any of it, because the responsibility of the refactored functions is simply "whatever the orignal code was doing before it was de-duplicated", and don't represent anything logical.
Then, if two places that had "duplicated" code before the refactoring need to start doing different things, the common functions get new options/parameters to cover the different use cases, until those get so huge that they start needing to get broken up too, and then the process repeats until you have a zillion functions called "process_foo" and "execute_bar", and nothing makes sense any more.
I've since become allergic to any sort of refactoring that feels like this kind of compression. All code needs to justify its existence, and it has to have an obvious name. It can't just be "do this common subset of what these 2 other places need to do". It's common sense, obviously, but I still have to explain it to people in code review. The tendency to want to "compress" your code seems to be strong, especially in more junior engineers.
Yup. People are taught DRY very early on, as an introductory "engineering" practice above the nuts and bolts of writing code.
But nobody really teaches the distinction between two passages that happen to have an identical implementation vs two passages that represent an identical concept, so they start aggressively DRY'ing up the former even though the practice is only really suited for the latter subset of them.
As you note, when you blindly de-duplicate code that's only identical by happenstance (which is a lot), it's only a matter of time before the concepts making them distinct in the first place start applying pressure for differentiation again and you end up with that nasty spaghetti splatter.
> But nobody really teaches the distinction between two passages that happen to have an identical implementation vs two passages that represent an identical concept, so they start aggressively DRY'ing up the former even though the practice is only really suited for the latter subset of them.
Even identical implementations might make more sense to be duplicated when throwing in variables around organizational coupling of different business groups and their change mgmt cycle/requirements.
I had the pleasure of Sandi Metz coming to a company I worked for and going us a “boot camp” of sorts for all of the engineering principles she espouses and it had a profound impact on how I view software development. Whatever the company paid for her to come - it was worth every penny.
Poodr is one of the best programming books ever written. Even if you don’t program in Ruby you should read it anyway (and pick up a bit of Ruby just for fun) because there are lots of great concepts to internalize that are useful in almost all programming languages.
You develop a sense for when the time is right over the years, by maintaining over engineered pieces of shit, many written by yourself.
To beginners it seems like coming up with the idea and building it is the difficult part; it isn't, not even close. The only difficult parts worth mentioning is keeping complexity on a tight leash and maintaining conceptual integrity.
I'm a fan of the Go proverb "a little copying is better than a little dependency"[1] and also the "rule of three"[2] when designing a shared dependency.
I think the JS developers could take a lesson from the Go proverb. I often write something from scratch to avoid a dependency because of the overhead of maintaining dependencies (or dealing with dependencies that cease to be maintained). If I only need a half dozen lines of code, I'm not going to import a dependency with a couple hundred lines of code, including lots of features I don't need.
The "rule of three" helps avoid premature abstractions. Put the code directly in your project instead of in a library the first time. The second time, copy what you need. And the third time, figure out what's common between all the uses, and then build the abstraction that fits all the projects. The avoids over-optimizing on a single use case and refactoring/deprecating APIs that are already in use.
In terms of code & data, I would say that duplication is mostly upside because the cost of refactoring is negligible. If all call sites are truly duplicate usages, then normalizing them must be trivial. Otherwise, you are dealing with something that seems like duplication but is not. The nuance of things being so similar that we would prefer they be the same (but unfortunately they cannot be) is where we will find some of the shittiest decisions in software engineering. We should not be in a rush to turn the problem domain into a crystalline structure. The focus should be about making our customer happy and keeping our options open.
That said, I have found other areas of tech where duplication is very costly. If you are doing something like building a game, avoiding use of abstractions like prefabs and scriptable objects will turn into a monster situation. Failure to identify ways to manage common kinds of complexity across the project will result in rapid failure. I think this is what actually makes game dev so hard. You have to come up with some concrete domain model & tool chain pretty quickly that is also reasonably normalized or no one can collaborate effectively. The art quality will fall through the basement level if a designer has to touch the same NPC in 100+ places every time they iterate a concept.
When a colleague told my father that "duplication is always bad" he grabbed a random memo from that colleague's desk and said "I bet there's at least 3 copies of this piece of paper in this building". That drove the point home alright.
I think the riposte is against the word "always" and it worked precisely because one could blindly pick a counter example from the physical space of the discussion.
I.e. it worked because it smashed the broad statement and forced a discussion about particulars. Now who was right about those, I have no idea, since I wasn't even present.
If you have only one copy of the code then you only have to fix the bug in one place, as opposed to a dozen. So there is significant cost savings. But there is a problem: when you make a bug fix you have to test all the different places it is used. If you don't then you could be breaking something while fixing something. If you have comprehensive automated tests then you can have just one copy of the code--if you introduce a bug while fixing a bug the automated tests will catch it.
If you don't have comprehensive test automation then you have to consider whether you can manually test all the places it is used. If the code is used in multiple products at your company--and you aren't even familiar with some of those products then you can't manually test all the places it is used. Under such circumstances it may be preferable for each team to have duplicate copies of some code. Not ideal, but practical.
Right, but you have to consider the cost of incorporating bug fixes without fully testing them. That too can introduce new failures that are noticed by customers first.
"Don't Repeat Yourself" is a great rule of thumb which, at least in writing Terraform configuration, became absolute tyranny. Strange webs of highly coupled code with layers of modules, all in an effort to be DRY - almost everywhere I've seen Terraform.
Trying to explain why a little duplication is preferable to bad abstractions, and specifically preferable to tightly coupling two unrelated systems together because they happened to sort-of interact with the same resource, was endless and tiring and - ultimately - often futile.
On the terraform comment, things that change together ship together is a good mantra.
If you keep having to make edits in two independent systems every time you want to make one change, something is wrong. If you’re leaving footguns around because changing one thing affects two or more systems, but you aren’t at liberty to change them both in production, that’s also something wrong.
I don't do too much terraform. But isn't the DRY really happening on provider level? And when you are using it, most of times it really doesn't make too much sense to try to not repeat yourself. Unless you are dealing with actual identical resources. Or deploying multiple times say dev, test and prod.
testability and developability, ideally you structure your terraform/terragrunt code in a way that you can bootstrap an almost equivalent test environment. For example when using "AWS Well Architected"-method you would be able to bootstrap a similar environment on a separate AWS account that's part of your organization.
Unfortunately, terraform module system is extremely lacking and in many ways you're totally right - if your module is just replicating all the provider arguments it just feels wrong.
Bit of both, really. There are some common techniques that would be a lot simpler or more robust if terraform would support variables and expressions like lambdas in more places (tofu is getting there…) but it’s also a failure to realize that terraform is ,rant to composite many small modules together and not just pass 150 different inputs into an Omni module.
Explaning is hard. Examples often work better. You need to be able to show an example where deduplication would be made worse by applying DRY, otherwise it's hard to argue using just vague descriptions.
I totally agree with deduplication, but only when it's shown. Otherwise it's too easy, and I've seen people try to use this argument to justify slop many times.
I had a lengthy argument about this in our architecture forum. I argued that "re-use" shouldn't be included as an Enterprise (keyword here) Architecture principle because they are clear use-cases where duplication is preferable to alternatives. e.g. deployment and testing decoupling etc etc. I had a lot of resistance, and eventually we just ended up with an EA principle with a ton of needless caveats.
It's unfortunate that so many people end up parroting fanciful ideas without fully appreciating the different contexts around software development.
> It's unfortunate that so many people end up parroting fanciful ideas without fully appreciating the different contexts around software development.
Of course that's true of both sides of this discussion too.
I really value DRY, but of course I have seen cases where a little duplication is preferable. Lately I've seen a steady stream of these "duplication is ok" posts, and I worry that newer programmers will use it to justify copy-paste-modifying 20-30-line blocks of code without even trying to create an appropriate abstraction.
The reality of software is, as you suggest, that there are many good rules of thumb, but also lots of exceptions, and judgment is required in applying them.
I have to agree, it's much easier to remove and consolidate duplicative work than unwind a poor abstraction that is embedded everywhere.
And I think it's easy to see small companies lean on the duplication because it's too easy to screw up abstractions without more engineering heads involved to get it right sometimes.
> it's much easier to remove and consolidate duplicative work than unwind a poor abstraction that is embedded everywhere.
It's not easy to deduplicate after a few years have passed, and one copy had a bugfix, another got a refactoring improvement, and a third copy got a language modernization.
With poor abstractions, at least you can readily find all the places that the abstraction is used and imorove them. Whereas copy-paste-modified code can be hard to even find.
With poor abstractions I can improve abstractions and ensure holistic impact because of the reuse. Then I’m left with well factored reusable code full of increasingly powerful abstractions. Productivity increases over time. Abstractions improve and refine over time. Domain understanding is deepened.
With duplicated messes you may be looking at years before a logical point to attack across the stack is even available because the team is duplicating and producing duplicated efforts on an ongoing basis. Every issue, every hotfix, every customer request, every semi-complete update, every deviation is putting pressure to produce and with duplication available as the quickest and possibly only method. And there are geological nuances to each copy and paste exercise that often have rippling effects…
The necessary abstractions often aren’t even immaturely conceived of. Domain understanding is buried under layers of incidental complexity. Superstition around troublesome components takes over decision making. And a few years of plugging the same dams with the same fingers drains and scares off proper IT talent. Up front savings transmutate to tech debt, with every incentive to every actor at every point to make the collective situation worse by repeating the same short term reasoning.
Learning to abstract and modularize properly is the underlying issue. Learn to express yourself in maintainable fashion, then Don’t Repeat Yourself.
I feel AI does decent at fixing the dupes and consolidating it as one instance. Abstractions can have far deeper connections and embeddings making it really hard to undo and reform but to each their own on what works for them.
I’ve been working on a new framework for the last five years. White paper dropping soon. It’s called “Write Everything Thrice” (WET). Lmk if want the link to my substack where I’m cooking up more stuff like this.
Once has to deal with a couple dozen data imports that would run regularly.
I advocated for just having a script for each, even if they were 80% alike to handle the variations... Another developer created a massive set of database tables and coded abstractions for flexible configuration driven imports.
My solution was done in a couple days... The other dev spent months in their solution that didn't work for half the imports and nobody could follow their solution. When they left the company the next year, the imports that were under the complex solution were switched to scripts and the beast was abandoned entirely.
I've been writing up a similar piece for my own personal blog (though as much to collect my own thoughts on this) that touches on this idea, particularly as it applies to shared code/modules, re-usable components in general, and also any kind of templater or builder-type tool, and the costs of over-eager abstraction, sharing and re-use, and when (if ever) to pivot to get a net positive result.
As it's only a draft piece at the moment I'll lay out some of the talking points:
- All software design and structure decisions have trade-offs (no value without some kind of cost, we're really shifting what or where the cost is to a place we find acceptable)
- 'Dont Repeat Yourself' as a principle taught as good engineering practice and why you should think about repeating yourself; don't take social proof or appeal to authority-type arguments without solid experience
- There is a difference between things that are actually the same (or should be for consistency (such as domain facts, knowledge) versus ones that happen to be the same at the time of creation but are only that way by coincidence
- Effective change almost always (if not always always) comes from actual, specific use-cases; a reusable component not derived from these cases cannot show these
- Re-usable components themselves are not necessarily deployed or actually used, so by definition can't drive their own growth
- If they are deployed, it's N+1 things to maintain, and if you can't maintain N how are you going to maintain N+1?
- The costs of creation and ongoing maintenance - quite simply there's a cost to doing it and doing it well, and if it costs more to develop than the value gained then it's a net loss
- Components/modules that are used in the same places their use cases are get naturally tested and have specific use-cases; taking them out removes the opportunity for organic use cases
- What happens when we re-use components to allow easy upgrades but then pin those for stability? You still have to update N places. The best case scenario might be you have to update N places but the work to do that is minimised for each element of N
- Creation of an abstraction without enough variety of uses in terms of location and variety of use (a single use-case is essentially a layer that adds no value)
- Inherent contradictions in software design principles - you're taught to 'avoid coupling', but any shared component is by definition coupled. The value of duplication is that it support independent growth or change
- The cost of service templates and/or builders (simple templated text or entire builder-type tools that need to be maintained and used just to boostrap something) - these almost never work for you after creation to support updates
- The cost of fast up-front creation (if you're doing this a lot, maybe you have a different problem) over supporting long-term maintenance
- The value of friction - some friction that makes you question whether a 'New thing' is even needed is arguably good as a screening/design decision analysis step; having to do work to make shared things should help to identify if it's worth doing as the costs of that should be apparent; this frames friction as a way of avoiding doing things that look easy or cost-free but aren't in the long term
- As a project lives longer, any fixed up-front creation time diminishes to a miniscule fraction of the overall time spent
- Continuous, long-term drift detection (and update assistance) is more powerful and useful than a fixed-time upfront bootstrap time saving for any project with a significant-enough lifetime
> - There is a difference between things that are actually the same (or should be for consistency (such as domain facts, knowledge) versus ones that happen to be the same at the time of creation but are only that way by coincidence
For my money, this is the key point that people miss.
A test I like to use for whether two things are actually or just incidentally related is to think about “if I repeat this, and then change one but not the other, what breaks?”
Often the answer is that something will break. If I repeat how a compound id “<foo>-<bar>” is constructed when I insert the key and lookup, if I change the insert to “<foo>::<bar>” but not the lookup, then I’m not going to be able to find anything. If I have some complicated domain logic I duplicate, and fix a bug in one place but not the other, then I’ve still got a bug but now probably harder to track down. In these cases the duplication has introduced risk. And I need to weigh that risk against the cost of introducing an abstraction.
If I have a unit test `insert(id=1234); item = fetch(id=1234); assert item is not nil`, if I change one id but not the other, the test will fail.
But if I have two separate unit tests, and both happen to use the same id 1234, if I change one but not the other, absolutely nothing breaks. They aren’t actually related, they’re just incidentally the same.
> A test I like to use for whether two things are actually or just incidentally related is to think about “if I repeat this, and then change one but not the other, what breaks?”
I really like this question as a way of figuring out whether things happen to look the same or actually should be the same for correctness, plus it feels like it should be an easy question to answer concretely without leading you down the path of 'Well we might need this as a common component in the future'.
I also think you can frame it as a same value or same identity type question.
> Abstractions can be costly, and it is often in a programmer’s best interest to leave code duplicated instead. Specifically, we have identified the following general costs of abstraction that lead programmers to duplicate code (supported by a literature survey, programmer interviews, and our own analysis). These costs apply to any abstraction mechanism based on named, parameterized definitions and uses, regardless of the language.
> 1. *Too much work to create.* In order to create a new programming abstraction from duplicated code, the programmer has to analyze the clones’ similarities and differences, research their uses in the context of the program, and design a name and sequence of named parameters that account for present and future instantiations and represent a meaningful “design concept” in the system. This research and reasoning is thought-intensive and time-consuming.
> 2. *Too much overhead after creation.* Each new programming abstraction adds textual and cognitive overhead: the abstraction’s interface must be declared, maintained, and kept consistent, and the program logic (now decoupled) must be traced through additional interfaces and locations to be understood and managed. In a case study, Balazinska et. al reported that the removal of clones from the JDK source code actually increased its overall size [4].
> 3. *Too hard to change.* It is hard to modify the structure of highly-abstracted code. Doing so requires changing abstraction definitions and all of their uses, and often necessitates re-ordering inheritance hierarchies and other restructuring, requiring a new round of testing to ensure correctness. Programmers may duplicate code instead of restructuring existing abstractions, or in order to reduce the risk of restructuring in the future.
> 4. *Too hard to understand.* Some instances of duplicated code are particularly difficult to abstract cleanly, e.g. because they have a complex set of differences to parameterize or do not represent a clear design concept in the system. Furthermore, abstractions themselves are cognitively difficult. To quote Green & Blackwell: “Thinking in abstract terms is difficult: it comes late in children, it comes late to adults as they learn a new domain of knowledge, and it comes late within any given discipline.” [20]
> 5. *Impossible to express.* A language might not support direct abstraction of some types of clones: for instance those differing only by types (float vs. double) or keywords (if vs. while) in Java. Or, organizational issues may prevent refactoring: the code may be fragile, “frozen”, private, performance-critical, affect a standardized interface, or introduce illegal binary couplings between modules [41].
> Programmers are stuck between a rock and hard place. Traditional abstractions can be too costly, causing rational programmers to duplicate code instead—but such code is viscous and prone to inconsistencies. Programmers need a flexible, lightweight tool to complement their other options.
It results in a brittle nightmare because you can no longer change any of it, because the responsibility of the refactored functions is simply "whatever the orignal code was doing before it was de-duplicated", and don't represent anything logical.
Then, if two places that had "duplicated" code before the refactoring need to start doing different things, the common functions get new options/parameters to cover the different use cases, until those get so huge that they start needing to get broken up too, and then the process repeats until you have a zillion functions called "process_foo" and "execute_bar", and nothing makes sense any more.
I've since become allergic to any sort of refactoring that feels like this kind of compression. All code needs to justify its existence, and it has to have an obvious name. It can't just be "do this common subset of what these 2 other places need to do". It's common sense, obviously, but I still have to explain it to people in code review. The tendency to want to "compress" your code seems to be strong, especially in more junior engineers.
reply