What should you do with a pile of legacy code you hate?
This was the central challenge of my last job. I was partially successful at solving it, and unsuccessful in ways that I want to share with you so you can do better than I did.
Let’s start by clarifying the problem.
“Hate” is a spongy word and we can be more descriptive about why you dislike the code base you’re presented with. Maybe it’s bad code: tangled, hard to maintain, failure-prone. Maybe it was written long ago by people who’ve long since left the company (burned out by having to maintain it) and nobody left understands it. Maybe only a few people are able to change how it behaves, and it takes those people far longer than anybody likes. Maybe it doesn’t scale in the ways you need it to. Maybe it’s also written in a language you don’t like or don’t know, or maybe it’s written on top of a framework you don’t like or don’t know.1
Whatever the reason, you’re very done with this pile of code and so is everybody around you. It needs to be replaced and you all know it. And yet, it’s the money engine for your company.
What to do?
Don’t rewrite immediately.
The temptation will be to rewrite the whole thing. You already know that you shouldn’t.
We all know that second system syndrome is a thing, and we all know that big-bang rewrites are notoriously difficult to pull off. As Gall famously said:
“A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a working simple system.” – John Gall
The other reality is that companies rarely have the time and resources to devote to rewrites, even if they have run themselves deep into tech debt. They never want to pay down that debt, and they never like the idea of giving up new feature work for a rewrite project that doesn’t move them forward.
And yet, this code needs to end up being rewritten somehow, because it’s a disaster that is costing the organization dearly, and perhaps even driving it to the brink of failure.2 This is true at the same time that you can’t dive into a big bang rewrite.
Reframe the problem and shift the goal: you want to be able to rewrite in useful pieces. Small pieces allow you to make incremental progress that can be seen to be “delivering business value” or at least measurable progress toward the end goal. Small pieces are also small systems on their own, each of which is simple enough to be kept working.
Now you’ve shifted your task to identifying useful pieces and rewriting those, and that task is more achievable. How do you identify useful pieces to rewrite? Well, this is both bad news (because you hate the code) and good news (because understanding complex systems is fun): you need to spend a lot of time with the system you have. You need to invest in it.
You need to understand it, even if you hate it.
It’s your money engine. It has to keep working.
You cannot hope to replace what you do not understand.
Understanding it deeply will allow you to find the cracks you can hammer a wedge into.3
Understanding it deeply will allow you to know when you’ve finished replacing it.
So how do you understand it?
This isn’t the same as the theory of the program, which is about how the code is constructed. You do need to know what problems the code solves for the system around it, what “affair of the world” it exists to model. More important for this task is understanding the details of its current behavior as part of a larger working system. This entire system is not just the code in the thing you want to replace. It is all the systems around that code as well: the web site, the analytics pipeline downstream from it, the internal admin workflows, the profusion of microservices that we all persist in writing around everything.
This whole system is an evolved, complex working system. It probably doesn’t have detailed specifications. It probably also does not have comprehensive tests. (If it did, you might not be in this mess.)
If the system does not have comprehensive automated tests, invest in writing those tests before doing anything else. It’s hard to talk people into writing specs for features that have existed unspecified for years, but everybody understands why tests are useful. (If somebody doesn’t, then there are many excellent books you can drop on their head to enlighten them.)
Not kicking off a testing project the moment I was in charge of this problem is my number one regret from my last job. We eventually did it and it was so valuable I was angry with myself. There were organizational reason why it was difficult for the team to commence that work earlier, including work that was genuinely urgent, but we could have started writing tests sooner! My advice to you would be to prioritize testing higher than I did, and defer what work you can until afterward.
Start by investing time in the test framework and tooling. Your goal is to make it easy for everyone on the team to write tests and to understand their results. People do what is easiest to do, so you must make the right thing easy. The importance of this work deserves a blog post all its own.4 However, anything is better than nothing.
Don’t negotiate on these tasks:
- Write integration tests. Test how the Hated Code™ calls out to everything around it. Test that the expectations of the code around the Hated Code™ are being met.
- Don’t accidentally pour glue over implementation details that should be hidden. If unit tests don’t exist at all, you might want some, but they’re not as important as integration tests that validate overall system behavior.
- Involve the whole organization in the test-writing effort. Prioritize this work alongside feature work and make on-going test-writing part of regular maintenance.
- Automate running the tests. Do not rely on humans doing anything by hand. Run them continuously against an integration environment, or in whatever context is sensible for your setup. The important thing is to have the tests run against every change intended to land in the production environment.
You’ll find bugs in the overall system while doing this. It’s a judgement call whether you should invest time in fixing them. Some bugs might be difficult to fix because of the problems that lead you to want to replace the mess; don’t waste your time. Some bugs are load-bearing because the system will have grown around them, like a tree growing around a bicycle. You can cut the bike out, but at what cost to the tree? Fixing bugs that are easy to fix gives everybody dopamine cookies and shows people around the project that the investment in testing has started to pay off, so let yourself do some of that.
If you and your team didn’t understand your system going into the testing effort, you will afterward. The tests will support any refactoring or replacement work by verifying that the entire system continues to work. They are the scaffolding around your new construction project.
Identify and exploit wedge points.
Now you can start thinking about changing the system.
Your goal here is to split up your monolithic code base by identifying good points to hammer in wedges to use to split off chunks.
Where are you going to hammer in your wedge first? Have you identified a modular boundary you can exploit to split off a chunk of functionality for a rewrite? Look for clean lines of separation: data, access methods, business logic all must come out in one piece. The common approach is to put a proxy in front of the monolith-ish thing you want to start replacing and redirect traffic from it to your rewrite. One popular term for this is “the strangler fig pattern”. I often call it “divide and conquer”.
The advantage of this approach is that it keep the pressure on the system to remain working at all times, allowing you to pay full respects to Gall. The tests are your latch on this working state: they validate that your replacement is behaving properly in context. You might find yourself writing even more tests at this point to support the validation; this is fine!
The disadvantage of this approach is that you need to have good split points, and you probably don’t. Good division points indicate where good modularity already exists and if you had that you’d probably be less unhappy with the mess.
Create split points if they don’t exist.
This is important: don’t rewrite anything yet.
Don’t proceed until you can find a good location to drive that wedge in and split off a chunk. Don’t take half measures. Some of the worst tech debt I encountered recently was in functionality that was half implemented inside the Hated Code™ monolith and half outside. The implementation details were spewed out everywhere. Changing functionality was extra difficult because it needed to be changed in two places, and one of the places was a code base that was very hard to work within. Also, once we’d fixed the primary performance bottlenecks, the secondary ones were all in how the Hated Code™ treated these satellite services as databases that it owned. Important working data vital to the operation of the system was a mashup of data from other microservices plus the monolith.
Don’t do this to yourself.
No, really, modularity is important. Parnas’s 1972 paper, “On the Criteria To Be Used in Decomposing Systems into Modules” points right at the important thing, which is that hiding information and implementation details allows you to change both. Modularity allows change.
Premature modularity is a form of premature optimization, and it hurts, but I’ve more often seen no modularity at all. Gotta go fast and break things, right? Side effects everywhere, code that has been DRYed to disastrous levels, the details of specific data structures in one place used to make decisions somewhere else, extreme cleverness that relies on implementation details in distant locations in the system. Rushed people make short-term decisions, and their hacks pile up into tangles of code.
An aside: “Don’t Repeat Yourself” aka DRY has been misunderstood and misapplied to disaster so often I would like to stop saying it to newer programmers. Often much better advice is to repeat yourself to find patterns.
If you find yourself with a function or method that has an enormous parameter list to distinguish the six different ways it might be called, you have a case of DRY madness that has broken modularity. One technique that might help if you’re in this situation is to do the least DRY thing possible: refactor to expand each code flow into one large function for each, replacing each call out to an overused long-parameter list function with the same code, inline. Simplify as you write. Strive for branchless programming as an antidote!6 The real patterns that support a better split-up of responsibility will emerge as you do this work.
Once again, your tests are going to have your back as you go. You’ll know if that flow stays working or not. You might find that your Hated Code™ is less hate-worthy after you’ve cleaned it up. Maybe you’re more in sympathy with it now? Or maybe not.
Time to drive those wedges in with a sledgehammer.
Now you can strangler-fig/divide-and-conquer/split those rocks as you go. You’ll probably get the modularity boundaries closer to right than your predecessors, because you have a lot more information than they did: you have a far more developed system to study!
If you’re tight on resources, you might choose to do nothing about any specific modular chunk of code. Leave it where it is, and make incremental improvements opportunistically. If this segment is not performing well, or is doing the wrong thing, or is hard to maintain, or if the team is far more comfortable with working in some other language ecosystem, then replace it. Prioritize potential rewrites by how much you hate the current implementation; that is, how many ways they’re failing to do what good code does.
Here’s where I remind you that modularity in your system does not require splitting its components into separate microservices. Microservice APIs are strong module boundaries; these API boundaries resist change unless you plan carefully. On the other hand, these boundaries do resist attempts at clever end-runs around that modularity.7 I like to bundle together data that is roughly similar size and changes at similar rates or in a similar style. CRUD data that is infrequently destructively updated and all lives in the same kind of database might all belong together. Geographical data that all uses PostGIS belongs with other data like that. This is itself a gigantic topic, so I won’t go further other than to remind you that microservices have tradeoffs. The important goal is to leave a system than can be more easily rewritten behind yourself.
Plan to rewrite next time.
All code has a lifespan.
Your designs make tradeoffs (always) that suit the context you’re working in:
- What language ecosystem is the current team comfortable using?
- Do you need to get this project done rapidly, so some shortcuts are okay?
- What performance characteristics are acceptable today?
- What task does this component have to perform today?
The context around working code changes over time. The business context the code exists in is guaranteed to change. Product requirements change. The tools your team is happy with today might make the team unhappy three years from now. Other parts of the system will change around it.
Make it easier for your future self or your successors to rewrite any given component of a system. If you know the lifespan of a decision, or when a scaling shift will make a component a good candidate for a rewrite, record that information right next to the code.
It’s okay to hate that code base. It is hate-able. It’s okay to want to replace it. You can replace it! But you have to put in the work first. The work I’ve had to do in this situation looks like this:
- Understand it even if you dislike it. ⬅️ treat it like a puzzle
- Write tests. For the system. Mostly integration. ⬅️ helps everything
- Identify or create wedge points. ⬅️ most of the time will go here
- Split off chunks and rewrite. ⬅️ the fun part
- Shrink the mess until it’s tolerable. ⬅️ satisfying!
- Plan so rewriting the new chunks is easier next time. ⬅️ pay it forward
Anyway, this is what I’ve learned from trying to do this work with limited resources. It’s best not to be in this situation: instead devote time to maintaining the system as a system and every bit of code in it. But most of us don’t have time machines to prevent past technical leaders from making these mistakes.
Being written in a language ecosystem you don’t like is not enough of a reason to rewrite something all by itself. If you’ve landed into a team that doesn’t know the language ecosystem that company’s money engine is written in, your first task is to correct the hiring mistake of the past. You might have to become an expert into the thing you don’t know; you might (like me) discover that you dislike the thing you had to become an expert in. Probably the real takeaway is to do better due diligence than I did, and discover in advance what flavor of mess you’re expected to clean up. But sometimes, the ecosystem mismatch is the last misery on top of a pile of miseries. ↩︎
This was literally true in my case. ↩︎
This is the part of the rock-splitting video where you haul out the drill and bore a hole to stick a wedge into. The metaphor is now out of control because you can’t drill holes in big balls of mud, but, uh, let’s pretend the mud has been pressurized into rock over many thousands of years? ↩︎
DRY is misunderstood, IMO. The original principle is “Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.” This is a good principle! It does not mean that you need to collapse any two bits of code that look mostly the same. As with everything, advice has contexts. Everything in moderation. Tef is right. ↩︎
Though I have seen people manage to do that. E.g., replicating an entire db to get at a subset of its data rather than using the API that was put in front of the db specifically to hide the implementation details of the db schema. Sigh. But even this is a case of people doing what feels easiest: if the replication tools are right there and calling an API feels harder, they’ll reach for replication. The right solution is to make doing the right thing the easiest thing for everybody. This is more work for you, which you needed, right? ↩︎