It’s time we admit we have a problem with technical debt. Everyone knows what it is, everyone is talking about, but not enough is being done about it. Time and again I have seen teams and systems end up swamped in technical debt. Swimming and eventually drowning in the stuff. Until entire projects have to be spun up to untangle the mess and start all over. This has gone beyond technical debt. Let’s call it what it really is: technical bankruptcy. The point at which a system is so governed by entropy, and the cost of change curve so out of control, that it possesses negative value and consumes all of a team’s effort just to stop the thing from collapsing.
How does technical debt get so out of control?
Technical debt gets out of control for a number of reasons. Please note that “developers suck” is not one of them. Developers vary in experience and competency but every single one I have met wants to do a good job and knows enough to build a system that is not riddled with problems – if they are given space and time to keep it clean. The causes are organizational, political and financial.
Developers are not empowered
In many organizations, decisions at all levels are made by “the business”, who see themselves fundamentally at odds with developers. Engineering is seen as an ugly but necessary cost-centre, instead of a value-centre. They are constantly under pressure to cut costs and corners, follow orders, and continuously report on and improve their “efficiency”. No wonder they don’t have a chance to keep tech debt under control. In the really strong technology companies like Google or Facebook, engineers are powerful and respected. This wasn’t done through vague “culture change programs” and pointless documents, but through formal power structures, HR policies and organizational change.
Organizations fall victim to project finance and activity based costing
We need to start admitting it: projects are a cancer. Not in every organization, not in every context, but when building a software product, the traditional “project” way of funding and managing the work is catastrophically stupid. It encourages teams thrown together and then disbanded, bad product-market fit, death marches to meet arbitrary deadlines, and leaving behind piles of technical debt. After all, cleaning that mess will come out of someone else’s project budget, right? Mine came in on time and on budget, which are the only criteria of project success, right?
Activity-based Costing is another degenerative disease that fools companies into thinking they can and should determine their opex cost-base by examining every hour logged into complex timesheeting systems. Paying down tech debt doesn’t look good in this accounting systems, so it is discouraged, punished or forbidden. The truth is, activity-based costing is provably wrong when used in a manufacturing context (read Throughput Accounting if you want to know why), and when used in abstract knowledge-based work like software development, it is far more stupid. We should be aiming to maximize throughput instead, as per XSCALE’s principles of Exponential Product Management.
Developers don’t pull work, it is pushed to them
I used to think that project financing was the cause of this horrible mess, and if we got away from it, everything would be fine. It turns out I was wrong. I went to an organization that didn’t have project financing, no project budgets or project managers, just a continual value stream of funding. Hooray! But everything was still screwed. Why? Because this “continuous value stream” was a continuous stream of work, spewed at the hapless developers like a firehose.
The poor teams had no say in what the work was, when it was due, what its priority was or how it was to be built. It was a classic “push” system: push work down the pipeline and tell them when you need it delivered, then push the next piece of work down, with no gap in between. The idea that we have to be “efficient” and achieve “100% utilization of resources” is completely misguided and inefficient, as proven by the Theory of Constraints.
People who know Lean and Kanban know that “pull systems” are far better. Work isn’t “pushed” downstream by the “business”, developers “pull” work that they feel is sensible and in good shape, when they are good and ready to do so. Teams have the slack and breathing space they need to really be efficient. Arbitrary deadlines are replaced with reasonable targets based on consensus, and upstream systems are encouraged to get their work in better shape (because teams are empowered to not pull work that isn’t ready).
Developers are not trusted and are micro-managed
When developers are not empowered and are instead working in a command-and-control bureaucracy run by project finance accountants, marketing executives and career politicians, they have little chance of creating powerful and scalable systems. Their work is scrutinised and micro-managed, often by certificate-wielding “scrum masters” (who just days ago, after completing their two-day Scrum training, were seen frantically hiding their old PMP certificates down the back of a filing cabinet). Any task that doesn’t move the current trendy graph (usually velocity or burndown, neither of which are useful in measuring actually meaningful metrics like Throughput, Net Promoter Score or Lifetime Customer Value) is discouraged.
There is a solution (well, a set of solutions)
Believe it or not, it doesn’t have to be this way. There are solutions to this problem. Their effectiveness increases exponentially if they are all used in combination rather than one in isolation.
Merciless refactoring
Ron Jeffries exhorts people to “refactor mercilessly” – I like the alternate way to put this, “merciless refactoring”. Refactoring is not something people do half an hour of here or there. It is not something that gets put into a “technical user story”, that then gets dumped to the bottom of the backlog. It is something that gets done by everybody, all the damn time. If you don’t do it continuously at a micro-level, it will end up being done in a big slab, at a macro level. You’ll find yourself doing “refactoring projects” and “legacy upgrades” and “application refreshes” that have huge risks and cost millions of dollars. Save the money by refactoring mercilessly as you go.
Refactoring should not be split out as user stories. There should be no prioritisation choices about refactoring. Engineers should not have to continuously explain the benefits of refactoring to “business people”. Refactoring is simply a way of working. It is a fundamental principle of how software engineering is done. Developers should not have to justify writing tests, choosing an IDE or using sensible variable names. They should not have to justify refactoring either. Do it early and often and never stop doing it.
Stop and fix it
Continuous improvement and kaizen are terms thrown about regularly, but they are rarely properly understood and even more rarely properly practised. It doesn’t mean holding a retro every two weeks and complaining about poor tools or lack of support or environment downtime. It means every time something goes wrong, everybody stops what they are doing and fixes it. Toyota has “Andon cords” – every time someone sees a problem on the production line, they pull the cord and production stops. It doesn’t start again until the problem is identified and fixed. That is proper kaizen. Every time someone is asked to cut corners, or put up with a shoddy tool or a sloppy process or accept a bad design decision, they refuse to participate and move on. It will slow things down at first, but vastly reduce errors and waste down the track. You need to go slower to go faster. That is not counter-intuitive; it is fundamentally obvious.
Full-stack DevOps
If a team builds a product, they own the product. They own it full-stack, in both space and time. Full-stack in space means the team is responsible for the entire tech stack, from user interface through to API through to datastore. No handovers, no middleware specialists, no component teams. We all know this way is better, so there are no excuses for not doing it this way. Full stack in time means the team owns the feature or product from its birth through to its death. No handovers, no maintenance group, no operations or BAU team. You build it, you run it, you fix it. Developers will put a lot more care and a lot less technical debt into something if they know they will be getting the call at 2 AM (though I believe they all fundamentally want to do a good job anyway).
Of course, and this cannot be emphasised enough, you cannot ask developers to own and maintain a system if they are not empowered to refactor it mercilessly and pay down technical debt. Making them responsible for on-call support and then not letting them properly care for the system and manage its tech debt is not mismanagement, it is cruelty.
Full-stack test automation
If you are continuously refactoring, you need to be continuously testing and continuously integrating, and if you are doing that, you might as well be continuously delivering. Refactoring every few hours (or even minutes!) means a massive rate of change in the system’s codebase. The risks and regression testing footprint of that rate of change demand automated testing. There is no escaping it. Any system with a non-trivial amount of complexity cannot handle dozens or hundreds of merges per day without a battery of automated tests. This automation needs to be on every layer, not just API. It needs to be full stack: from the user interface all the way through to the datastore/s.
It should be done with proper Test Driven (or even better, Behaviour Driven) Development, i.e. done inside out. Start with your acceptance or behaviour criteria, build up tests, watch them break, build up unit tests, watch them break. Then work back from the inside out: code until your unit tests pass, then code until your acceptance tests pass. Then run the full battery of regression tests. There are plenty of books on how to do this. It is not rocket science or arcane wizardry, it is simple craftsmanship and everyone should be doing it.
But didn’t Agile create this technical debt in the first place?
Some would say that Agile has built a rod for its own back. That it has created this problem for itself in the first place, by not doing sufficient up-front design, and fumbling around instead with “emergent design” and haphazard practices. I would disagree strongly with this argument. Some of the worst examples of technical debt are in the large, complex, brittle monolithic systems so common in over-scoped feature-creeped Waterfall projects.
Proper agile follows the principles of simplicity (maximising the amount of work not done) and YAGNI (You Ain’t Gonna Need It – avoid premature optimizations and unnecessary future-proofing). It recognises that source code is a liability, not an asset, and always seeks to reduce, not increase, the amount of code and complexity. Waterfall projects had technical debt too, they just didn’t talk about it and left it for the next poor project manager to try and sort out. Agile is about transparency and inspection, so identifies the technical debt earlier. This can give the impression that it has caused it, but it is not at all the case.
Conclusion: it’s called debt for a reason
The fact that we have so many systems so badly riddled with technical debt is embarrassing. It almost always happens because developers are not allowed to do their jobs properly. They are mistrusted and micro-managed by people with little real understanding of IT finance and less understanding of engineering. We have to stop technical bankruptcy and we can do it by following simple, trusted practices of software craftsmanship. If we don’t, the technical debt will come back to haunt us, and cost us many more times than whatever money we saved by cutting corners in the first place.