One year for a one-line fix

This blog post is harder to write than you might think, because it goes right into a number of people problems and the behavioral patterns of toxic organizations. I wish to preface all of this by noting that toxic organizations warp the behavior of everyone in them. People who might behave in healthy ways in healthy orgs find themselves behaving badly inside toxic systems. The only thing to do is fix the organization first. So I have sympathy for everybody involved in this story, both my unknown predecessors and the people who were right next to the problem the whole time. I am most interested in what I will do differently next time.

With that preface in mind, let’s tell a story.

Let me tell you a war story.

Once upon a time there was a dot-net monolith, one that had been poorly maintained for a long time, hacked upon by rushed people who were evaluated only by how fast they pumped out the next feature the CEO wanted. This dot-net monolith was in a poor state and everybody around it knew that. It was expensive to run (its AWS costs were enormous), expensive to work on (making changes was time-consuming and dangerous), expensive to deploy (deploys often broke and took hours to resolve), difficult to test (the test suite was a mechanical turk service that ran overnight), and difficult to understand (re-entrant side-effect-heavy functions and in-memory caches on top of external Redis caches made for some fun race condition factories). The team around it knew it had trouble.

Enter me, somebody who didn’t know a dot-net from a dot-product. I was brought in to scale out the system, which had a lot of new code written in JavaScript around that dot-net thing. I was fairly confident in my ability to make node jump through hoops. I knew that C# was Microsoft’s proprietary version of Oracle’s proprietary Java, so at least I could read the code. Mostly. At my request, I started out fixing bugs on the team that touched the most varied parts of the system, so I could get my hands dirty first, learn how things fit together in reality, and earn credibility with the overall team before I had to start making changes.

Two weeks into my new job, the entire system fell over on a regular weeknight, under regular load. And by “falling over”, I mean it became non-functional. All API endpoints began to fail to respond. The site was down. Nobody could purchase widgets and have them delivered.

Why? Nobody could say. It looked like it was Redis. At least, the CPU on the Redis cache instance was hitting 100% and when it did, everything stopped.

Now, like many of us, I was very familiar with Redis. I trusted Redis. It is often the most reliable piece of software in my stack. I’d pumped a lot of traffic through Redis at the world’s JavaScript registry, a lot more than this single-state retail outfit could possibly be sending through it. What was this system doing to Redis that was making it thrash so badly? Nobody knew. What were we putting into Redis? Nobody knew. How many objects were in it? Nobody knew. How big were they? Nobody knew.

“Where are your metrics?” I asked. There was an expensive hosted Graphite service, but nobody was looking at it. There was an expensive APM product wired up to the monolith, but nobody knew how to interpret it. There were Cloudwatch graphs! Only infra had access to these or the ability to make dashboards.

At this point I knew what my job needed to be first. I went on an observability tear over the next months, among other tears inspired by this outage.

We got through that initial outage by upgrading to AWS’s largest Elasticache, which was ruinously expensive but seemed to hold up under the load. We then mitigated the problem around the edges by taming some problems with the website hitting endpoints more than it needed to, and at the core (most meaningfully) by splitting up the cache into several different cache instances. (Two very thoughtful engineers had already browbeaten their way into being given time to refactor the code enough to make this split happen, because they knew this was a problem area before the outage happened. The implications of this sentence are entirely intentional, and we’ll come back to them.)

We limped through the weeks remaining until the day that was the big sales day for the industry, the one that was going to be the biggest day ever with $X of revenue, for some record-breaking value of X. The entire company prepped for months for this event, with marketing and incentives and ordering stock to be sold.

Three hours after opening, the system went down. Adding more instances of the monolith brought the system down harder. In the end, we had 3 hours of downtime in the middle of the hottest business day of the year, the equivalent of Black Friday, and this downtime ruined the work of everybody at the company who’d prepared for that day. It was bad. Very bad. Company-harming bad.

One year later, my colleague Chris and I identified the problem and fixed it with a one-liner.

That’s a heck of a war story.

Right? The one-line mistake that nearly killed a company, and the one-line fix that saved it. Except, well, it’s more complicated than that.

In the end, none of the observability instrumentation I added mattered.¹ It was the default Cloudwatch Redis graphs that identified the problem. When we stood up an idle cluster of the service using our new deploy system (deploy times down from 30 minutes minimum to less than 3 minutes tops)– ahem–

Hold on, you rewrote the deploy system?

Yeah. When the entire infra team was laid off we were finally able to fix probably the worst cause of daily development friction–

Hold on, the entire infra team was laid off? And this let you fix things?

Yes, as I was saying, we finally had access to everything and freedom to fix what had been obviously broken for a long time.

Hold on.

I know. There’s a lot to unpack here, and I’ve been struggling for some time to find a way to unpack it that remains kind to the people who were trapped in this toxic organization alongside me, doing bad things because that’s what the organization wanted them to do. For some of that story, read “Dysfunction junction” first.

I’ll talk about those things in a minute.

The punchline to the story.

Back to the bug. Redis. AWS’s largest Redis. CPU hitting 100%. That bug.

As I was saying, testing our new deployment system was what made us look at this problem again with fresh eyes. We used the new deployment system to stand up an idle production cluster, ready to be swapped in for the older prod cluster that used the old deploy system. This was something we’d done for every microservice in the system, so we had a lot of practice doing it, and at last we were doing the hard one, the dot-net one.

The moment we brought the new, idle cluster into existence with terraform, we noticed the Redis instance CPU spike and cause trouble to the production system. That was surprising! The new prod cluster wasn’t live yet! We looked at the full set of Redis graphs and noticed an oddity. The new connections per minute was absurdly high normally, and the idle cluster had just spiked it higher.

First, no way should any process be generating new Redis connections except on restart. That graph should be sitting at zero. Second, idle clusters shouldn’t be creating any new load on Redis except at process start.

“Huh,” we said. “Could pings from the load balancer be creating new connections? Because pings are the only traffic it’s taking. That would be fundamentally broken but it would explain this.”

So we fired up a video chat, shared a screen with the code that injected Redis into the dot-net frammistans, and set about understanding what it was doing. We learned the word “lifestyle” and read the docs on the various kinds of lifestyles: transient, scoped, and singleton. Nearly all of the Redis connection managers were singleton lifestyle, which is for the lifespan of the application. Seems good! Then we noticed one line that didn’t look like the others, injecting connections for the general-use cache:

container.Register<ICacheConnectionManager, RedisConnectionService>(Lifestyle.Scoped);

“Scoped” means to create an instance of the thingie once per request lifecycle. Once per request.

Every. Single. Endpoint. Invocation. Created. A. New. Redis. Connection. Pool.

All requests, not just requests that needed to use a Redis. Requests like the health check endpoint, the one that should be near-zero cost because load balancers hit it frequently, requests like that. This is why the Redis cpu graphs looked like some kind of exponential function on the “active thingies in the system” count, because it literally was. The system had been DOSsing itself into downtime for years.

We changed that one line to give it a singleton lifestyle and deployed the change to our (new, shiny, one of many cattle) integration environment. We observed that the new connections graph began behaving as we expected, and everything kept working. So we deployed it to production.

Really easy fix. It let us stop running the largest Elasticache AWS sells, collapse all the split-out caches into the new much more modest not-clustered cache, and made everything go faster. Scaling horizontally no longer caused the system to punch itself in the face. That plus the new fully-terraformed ALBs made dealing with big days completely routine, and engineering commenced a very quiet two years of rebuilding with an AWS bill that was a fraction of what it was before, and I mean holy heck we cut that bill down to something that was [Ceej’s editor has deleted a lot of ranting here].

However. I remain unhappy about this fix.

I should have spotted this a year before, during the original Redis-caused outages. If I had seen that graph– and I should have demanded to look at all those graphs– I would have known immediately that something was very wrong, because nothing should be creating new connections like that. But I didn’t. Why not?

Why did it take a year?

Expertise, ownership, and trust. Each of these concepts is a two-edged sword and each cut me with its second edge.

Expertise.

Expertise. I knew I did not have C# or dot-net expertise. I had to rely on the people who had it. I was also not familiar with the code that did this work, especially at two weeks in. I had to trust the people who knew dot-net and knew the code to assure me that there were no obvious howling bugs in it.

Where expertise is assumed but is not present, bad code goes unchecked. People get angry when you review their work and ask for changes, or even when you only ask questions about the work. Defensiveness can arise in low-trust environments, but it can also mask situations where people don’t have the expertise you need them to. Or situations where people have the expertise but are so pressured, stressed, and burned out that they’re not operating at full capacity.

Here the second edge cut me because I assumed without pushing that the people with dot-net expertise had already investigated the obvious possibilities. But also! I lacked this expertise myself. When we finally hired people who were expert with dot-net and comfortable with it, they laughed at this bug, because it was familiar territory for them. They’d have looked for and found it immediately.

Ownership.

When some one human or a team owns something, I feel I need to let them own it and trust their expertise. Meddling in their work can destroy their self-confidence or make them feel undermined. A feeling of ownership is good! It means you feel responsibility for that thing, and know that the burden of maintaining it rests on you.

The other edge of ownership is gatekeeping. The deploy system was obviously a block to all development by all teams. The team had a Slack channel where they negotiated who was going to merge which code for the single deploy window available on four days a week, with no deploys allowed on Fridays. Deploys were flaky and could take up to three hours to resolve. A colleague with a technical leadership role was in fact working on a better deploy system, but the infra team manager instructed their team to ignore the work.²

The infra team also jealously guarded access to things they thought belonged to them, such as access to an Athena search setup for production logs. At one point one of them locked down commits to the main monolith’s repo, announcing to the team that they no longer got to merge into “my repo”. To be clear, this was a human being who’d been burned to an absolute crisp by overwork; the blame flows upward.

Management can of course be the worst gatekeeper of all, and it was in this case. I mentioned briefly before that Redis had been identified as a problem area by some informed engineers, and they had to push hard to be allowed the time to work on it. They might have been more successful if supported by management instead of being treated as if they were wasting time that would be better spent on cranking out this month’s pet feature for the CEO.

From a distance, I can say with some confidence that gatekeeping was the worst block to diagnosing this Redis bug.

Trust.

Trust. I said the word “trust” in each of the two preceding sections, because I had to extend trust to my colleagues. You earn trust by granting trust. People live up or down to your expectations of them, and I prefer to expect the best.

You can see the downside of all of these. Where expertise is assumed but not present, bad things happen. Where ownership turns into gatekeeping, other people are blocked from fixing things even if they could help. When trust is not warranted, things get into bad states and stay that way.

“Trust, but verify.” – unknown origin, but possibly Khrushchev

Then the layoffs happened.

The gatekeepers were all gone. The experts were also (mostly) gone. The ownership and responsibility were all on me and a much smaller but motivated team.

When the ownership fell to me, I felt both responsibility and empowerment. I was no longer politely taking people at their word, because those people weren’t there any more. I was investigating and experimenting on my own, and ruthlessly testing all of my own hypotheses. I knew I didn’t have expertise, and even when I do have expertise I have learned the hard way to double-check all my own work.

I was also not bound by the past. I did not care if something had always been that way. I was okay with doing things differently. I don’t much trust myself, but I did trust the people working alongside me in that moment. And most especially, I trusted the work we did together, because we verified it together.

The sad thing is that the ownership turned into gatekeeping problem was the difficult one to surmount, the one that in retrospect I’m not sure I could have solved in any other manner than parting ways with the gatekeeping team. I am going to tentatively state a thesis: operations/infra teams as teams separate from engineering always turn into walled-off defensive gatekeepers. You cannot allow them to exist in healthy orgs. You must practice some variation on devops by embedding people with this expertise into project teams.

Maybe there’s a way to do it if you frame their goal as developer experience not as “operations” or making AWS go brrrrrrr. The goal has to be to keep people focused on their customers– the engineers building the project– and not on defense against their colleagues. The same goes for security teams: embed those experts where they can have sympathy for the problems their colleagues are trying to solve and improve their solutions early.

But I digress. Expertise, ownership, and trust are a big part of this story, but they’re not everything.

The context also mattered.

Years later I learned that one of the two engineers who’d started working on Redis before the outages had some suspicions that there was a lifestyle problem, but he was afraid to change code that had been that way the entire time he’d worked there. I had no such fears, because we had removed reasons to fear experimentation by completely rewriting the infrastructure and deployment environment to make experimentation low-cost. We’d also invested in full tracing via Honeycomb. We knew what was going on with the system in ways that we didn’t in that first outage.

The highly-contended integration environment had become many environments. (The link goes into detail about that project and tells the story of this bug from another perspective.) Deploys had been made fast and reliable. Access to information and metrics was available to everybody. Full access to AWS was available to all engineers. If a change broke production, the fix was three minutes away.

We weren’t scared to make changes any more.

I also want to call out that the team had progressed past trusting the word of people in the past about how things worked and whether or not things were feasible. We had the space and the support to read code to see if it genuinely behaved as described or if it worked differently, and experiment with changing things. Management was no longer blocking people from investigating or fixing technical debt.

What can we learn from this story?

This is what I’d like you and my future self to take away from this war story:

Assume nothing. The people around you might be wrong! You might be wrong too!
Test all hypotheses. Each test gives you more information.
Eliminate gatekeeping. No team can afford to cope with the damage done by people who want to keep information or access away from their colleagues.
Observability, even humble standard metrics, is invaluable.
You (o fellow technical leader) own everything. You must always feel the responsibility of that ownership. You can share it, but it’s always partly yours.
Trust but verify. Especially team superstitions.
Ruthlessly eliminating developer friction pays unexpected dividends.

Also, it was totally not Redis’s fault.

For this specific problem. It was and remains invaluable for other reasons. ↩︎
Most toxic behavior is driven by toxic organizations, but some toxic behavior is individual and creates that toxic organization. ↩︎