command</th>	style</th>	origin</th></tr></thead>
`sha256sum README.md</code></td>`	linux mode</td>	compiled</td></tr>
`sha256 README.md</code></td>`	bsd mode</td>	compiled, same exec</td></tr>
`shasum -a 256 README.md</code></td>`	linux mode</td>	perl, CPAN</td></tr>
`shasum -a 256 --tag README.md</code></td>`	bsd mode</td>	</td></tr> </tbody></table> Note that Homebrew wants the bare hex string, without filename decoration of any kind. If you use the github-generated shasum, you’ll need to trim the `sha256:</code> prefix.</p>` Modern terminal environment Unknown — Sun, 12 Jan 2025 13:17:11 +0000 I read Julia Evans’s blog post on “What’s involved in getting a modern terminal environment”</a> and got very excited because there are lots of great comments in that blog post, and I have a few more of my own, and a little meta-commentary.</p> The terminal and the shell</h2> Are there good modern answers for terminal software? Sort of. There are certainly many options. I still use iTerm2</a>. I try other fancy new terminal software but end up back at iTerm2 every time for the combination of Macintosh features and overall performance. The Electron ones are sluggish enough that I feel it. Warp</a> is the magical one I tinker with sometimes. I loathe the idea of online sharing features or LLM auto-completion in my terminal, so I pretend those don’t exist and that it’s free software. The existence of those features for money in Warp might either please or distress you. It is, however, truly a modern take on what a terminal experience can be.</p> Windows didn’t have any good terminal programs other than the built-in one in VSCode until recently. The default Terminal program is now just fine. I use wezterm</a> when I’m using Windows. This requires customization with Lua and is not particularly modern or magical or command-aware or anything like that, but it is zippy and therefore better than VSCode. (I note, in passing, that Windows is a pretty important environment for lots of programmers, and Rust treats Windows as a first-class target, so Rust projects can</em> do nice things for Windows users if their authors wish.)</p> The shell is no contest. Use the fish shell</a>, as Julia recommends. You nushell users are a special sort of human; you may continue being you.</p> If your fingers type !!</code> and !$</code> enough to miss those bash-isms, install oh-my-fish</a> and get the bang-bang</code> package. There are some other nice things there to snag.</p> Like Julia, I install every base16 theme there is. I change my theme colors and monospaced typeface about once a year, to keep my brain thinking everything looks different. I have no idea if this is helpful to anything or not, but I like to pretend it is. Get a Nerd font variation</a> to keep the prompt looking right. Some good ones to consider: Cascadia Code, Iosevka, Fira Code, Monaspace.</p> Oxidize everything</h2> Rust gave us all a systems programming language with ergonomics considerably better than the hand-held table saw that is C++, and the terminal experience is better for it. Look for modern Rust variations of everything, and for a handful of neat Golang tools as well.</p> Start by aliasing ls</code> to eza</a>. You will not look back.</p> I stopped fussing about prompt setup when I found Starship</a>. Well, you have to fuss, but only once and then you’re done forever. Customize with toml once, put the toml into your dotfiles repo, then you have a prompt that works with any shell you are using at the moment, in whatever terminal in whatever environment.</p> You have lots of options for reading text in the terminal that aren’t just cat with one of two pager variations we’ve had for 35 years. Why not read styled markdown</a>?</p> Install the fuzzy-finder fzf</a> and its integration with your chosen shell (which of course is fish</code>, finally a shell for the 90s). This is a subtle enhancer for everything if you start thinking of it as a part of how you find and select things. Many other tools come with fzf</code> integrations built in directly, or you can shell-script it on up. You can get ripgrep</a> (which you should be using too, come to think of it) into the mix to get find file in project</a> with fzf.</p> Editors</h2> I don’t edit text in the terminal. For most editing, I use zed</a> and zed</code>’s terminal integration. I don’t have any interest in “collaborating” with stochastic parrots, but I do have a very high interest in snappy editors with excellent language server integration. I switched to zed</code> from VSCode a couple of years ago and haven’t looked back.</p> I do still edit some files in the terminal, out of very old habit. When I edit dotfiles and other system configuration files, I type vi</code> and open them up right there. Why vi</code>? Because I am old and why type three characters when two is enough? vi</code> dates back to Bill Joy, if I recall, and that’s 40 years ago.</p> Editing text is not a modern thing to do inside the terminal, I think. But is modernity really what we’re going for? After all, people still use vim</code> and neovim</code> to write software effectively every day.</p> I say something about goals</h2> “Everything affects everything else” is true and so is the fact that nothing is perfectly consistent with everything else. It was all implemented at different times over decades by cats who resisted herding, or who were working in a slightly different context than the other cats, with different libraries. And all these grumpy cats were doing the essentially stupid thing of kind-of emulating a dead hardware terminal from a dead microcomputer company that turned into ANSI eventually. That it works at all is nice; that we can get it to do decent things is surprising. I’m not sure what it means that it’s still by far the most effective way to get programming work done for many people.</p> So the terminal is a weird mess, yup. Changing your setup is disruptive if you do it rarely. You get practice in setting things up if you throw things up into the air often, though, and that’s why I do it. I try new things far more often than I choose to integrate them into my daily shell workflows. I can’t tell you how many times I’ve tried shell history things that promise to revolutionize my shell experience that ended up driving me to distraction within 15 minutes. It’s at least, um, twice? Once? Once for sure. I found all of the above things and a lot more that I am restraining myself from listing here by experimenting a little bit every so often with things I hear about.</p> Changing my setup isn’t the goal, really. Finding things that are worth integrating from the seething stew of modern software, that’s my goal. What makes them worth integrating? Well, it’s not modernity, not primarily anyway, as I say above. Modernity doesn’t select modal editors. Modernity doesn’t reach for the terminal. Modernity abandons the VT100 and sets up a mouse and windows.</p> So we’re not trying to be modern. We’re trying to be effective. Terminals and vim</code> are effective</em> and, in the hands of experts, powerful</em>. Aim for the set of tools that make you effective. Select new tools based on how they’ll make your more effective at whatever it is you’re doing in the shell. Try things, reject some, integrate others. Tell your friends about the good stuff. (Tell me about the good stuff, too, thanks.)</p> Understanding Software Unknown — Fri, 29 Mar 2024 21:58:26 +0000 Nothing I said in this presentation will be shocking to any readers of this blog, but my audience here was the entire company. I wanted to let a group of non-programmers know what we do and how everybody contributes to the work of making useful software.</p> PDF version of the rendered slides</a></p> The Markdown version follows! The SN:</code> indicates my speaker notes.</p> Understanding software</strong></h1> and how it comes to be</h2> – @ceejbot</p> SN: A note about the slides: they’re anchor points to call out important words or to remind you where we are in the presentation. You don’t have to let them fill a whole screen if you don’t want to. There aren’t any flashing lights or animations in the presentation, either.</p> </p> SN: How does this sketch turn into a company that had thousands of developers, millions of daily users, and an effect on the entire world? At its starting point was nothing, and then software happened, and for about 15 years we had something</em>. Politics, news, culture– all happened because of this software. I’ve always found this amazing– somebody has an idea, and this THING appears out of nothingness.</p> What is software</strong>?</h2> How do we build</strong> it?</h2> What happens afterward</strong>?</h2> SN: Carl Sagan said if you want to make an apple pie, first you must invent the universe. We aren’t going to go that far back, but we are going to talk about these three questions. I need to caveat all of this: My answers to these questions come from a specific perspective– me & my career experiences. I am not going to talk about how it’s done at Google or Facebook or other weird gigantic companies. Going to talk about how software is built by small to medium sized teams, in the Silicon Valley, ones that happen to have a lot of ex-Apple product influence.</p> This company writes software</strong></h1> Everyone here contributes to this work.</li> Everyone here would benefit from understanding how we do it.</li> </ul> SN: We write a lot of software. I counted X meaningful lines the other day.</p> What</strong> is programming</strong> anyway?</h1> SN: A traditional answer is that programming is typing long text files with instructions to make a computer do things. But when I’m sitting with my feet up on my desk, or when I’m pacing around my house muttering, or when I’m scribbling in my notebook, I’m also programming. I’m going to go to one of my favorite essays of all time for another answer.</p> “[P]rogramming properly should be regarded as an activity by which the programmers form or achieve a certain kind of insight, a theory, of the matters at hand.</em> This suggestion is in contrast to what appears to be a more common notion, that programming should be regarded as a production of a program and certain other texts.”</p> — Peter Naur, “Programming as Theory-Building”</a>, 1985</p> SN: Peter Naur is the Naur of Backus-Naur Form, which some of the programmers in the audience might remember, and one of the designers of Algol, the extremely influential programming language. This is from a 1985 essay about what he’d learned about how to write and maintain and operate software. I think this is right on target. Let’s spend a moment looking at Naur’s theory of the program.</p> Naur says a programmer who has the “theory of the program” can:</p> Explain how the solution relates to the affairs of the world that it helps to handle.</li> Explain why each part of the program is what it is.</li> Respond constructively to any demand for a modification of the program so as to support the affairs of the world in a new manner.</li> </ol> SN: Naur was writing an an earlier era, so he talks about single programs here. Today, we write many programs and connect them all together into software systems. What he called “the theory of the program” is what I would call “the model of the system”, but both phrases get at the heart of the concept.</p> Software</strong> is:</h1> a lot of text files with instructions to computers (they matter!)</li> that express the authors’ understanding of a real-world problem</li> and their solution to that problem</li> (and the same for every building block they needed along the way)</li> </ul> Programming is how we get there.</p> SN: And this is what we have to understand to function effectively. Let’s zero in on one part of that.</p> To write</strong> software effectively</h1> you must understand:</h1> the affair of the world</em></li> how</em> the program goes about solving it</li> </ul> SN: The how is mind-bogglingly complex, and very few people working on any team project understand the whole thing. Some people who’ve been involved with it for a long time might have a better understanding than others, but it’s possible that nobody understands the whole thing.</p> SN: Now, I want to back up from the theory a little bit to talk about those text files. They do matter!</p> Code is communication</strong> with computers and humans.</h1> Code defines data (nouns) and functions (verbs).</li> We name things carefully because the names are meaningful to humans.</li> A program becomes a language of its own. (Hat-tip to Dijkstra.)</li> </ul> SN: In the jargon of programmers, every complex system is a domain-specific language expressing our understanding of the problem.</p> Can you guess what this code is supposed to do?</p> fn</span> ch</span>(</span>)</span> -></span> Result</span><</span>usize</span>,</span> Error</span>></span></span> {</span></span> let</span> a</span> =</span> a</span>(</span>)</span>?</span>;</span></span> Ok</span>(</span>a</span>.</span>f</span>(</span>S</span>::</span>H6</span>)</span>.</span>len</span>(</span>)</span>)</span></span> }</span></span></code></pre> SN: The programmers in the audience all guess that it’s getting the length of something, but they have no idea what that’s the length of, or what any of the other stuff does.</p> Can you guess what this code is supposed to do?</p> ///</span> Count how many hedgies are in our zoo.</span></span> fn</span> count_hedgehogs</span>(</span>)</span> -></span> Result</span><</span>usize</span>,</span> ZooInventoryError</span>></span></span> {</span></span> let</span> animals</span> =</span> fetch_all_animals</span>(</span>)</span>?</span>;</span></span> let</span> hedgie_list</span> =</span> animals</span>.</span>filter_for</span>(</span>Species</span>::</span>Hedgehog</span>)</span>;</span></span> Ok</span>(</span>hedgie_list</span>.</span>len</span>(</span>)</span>)</span></span> }</span></span></code></pre> SN: You probably have a good guess about what this means, even if you don’t know the specific programming language I’m using or any programming language at all. This code communicates to humans as well as computers. This might do the exact same thing as the previous code when run, but this version has an additional layer of useful meaning, and supports Naur’s theory-building better. (It could lie, and be about counting numbats, but we try not to do that.)</p> How do we invent</strong> that specific language to express a problem?</h1> SN: One thing that I have learned is that no two software solutions of a problem ever look alike. I know what little I know about sudoku solving from the talk $colleague gave at a lunch and learn a couple of weeks ago. But if you gave me and $colleague the task of writing a sudoku solver, we’d write completely</em> different programs. If you gave us the task of writing a solver together, we’d write something different again. This, btw, is very cool, because it says something about human minds that fascinates me. BUT despite the differences in end result, both of us would use a similar heuristic to get there.</p> Understand</strong> the real-world problem.</li> Analyze</strong> it from a software point of view.</li> Imagine</strong> a solution.</li> Align</strong> a team on the problem, the solution, and the values that shape the solution.</li> Coordinate</strong> to express that understanding in code.</li> Get feedback</strong> and iterate.</li> SHIP IT.</strong></li> </ol> SN: There are no secrets here. It works this way for all problems in software, whether small or large. Some things are easier when you’re a team of one– it’s easy to align with yourself. That might be hard with a team of 20, and very hard indeed when your team is is larger than Dunbar’s number. But this is how it works. Let’s look at a simple example.</p> </p> SN: This is Visicalc, the first spreadsheet software anybody remembers. 1977. (The first one was LANPAR in 1969.) This one invention sold personal computers to millions of small businesses and is a huge part of Microsoft’s revenue even today. Spreadsheets ate the world and run many businesses and are part of critical workflows everywhere. But somebody had to make the first one.</p> Understand:</strong> the workflow of accountants.</li> Analyze:</strong> These numbers and dates are data a computer can store; doing arithmetic on columns of data is something a computer can do.</li> Imagine:</strong> What if we let people type numbers into boxes and the computer automatically did the math?</li> Coordinate:</strong> 2 people in a room!</li> Ship:</strong> LANPAR was 1969. It didn’t ship as we understand it; but Visicalc did.</li> </ul> SN: The word “spreadsheet” comes directly from accounting. Let’s go broader, and apply the process to our shared endeavor.</p> Step 1: Understand</strong> the real-world problem</h1> Who are our customers? What are they trying to do?</li> This is difficult! Our industry is complex!</li> This is why every company needs its subject-matter experts.</li> Everybody involved in designing and implementing the software does better the more they understand the people who’ll use that software and what they’re trying to do.</li> </ul> SN: Our experts and our customer contact people keep programmers like me in touch with who we’re making tools for. I believe I speak for every person on the engineering team when I say that we all desperately want more understanding of our customers. Please! Talk to us!</p> We share</strong> what we understand.</h1> Writing and reading documents.</li> Talking to each other.</li> </ul> SN: Once we understand something, we don’t leap to writing code. Instead we share that understanding.</p> Step 2: Analyze</strong> the problem</h1> “To a person with a pencil, everything looks like a sentence. To a person with a TV camera, everything looks like an image. To a person with a computer, everything looks like data.”</p> —Neil Postman, “Five Things We Need to Know About Technological Change”</p> SN: Or more succinctly, the medium is the message, and the medium of software is data.</p> Study the data</strong></h1> The medium of software is information, or data. Software collects or generates data, then transforms that data via rules. The process of describing the data and writing the rules is what occupies us all day.</p> SN: Call out some of the nouns we track in data.</p> Study what people do</strong> with that data</h1> Data by itself is uninteresting. People are using it to do something. What?</p> SN: Talk about how our customers use their data.</p> Step 3: Ask how automating</strong> that with software would help.</h1> What if… we took a process that take weeks right now, and made it take minutes instead because software does the correlation for you?</p> SN: Marc Andreesen described this as “software eating the world”, and he should know. He invented the image tag, and that was enough for the web to eat the world.</p> Deepen</strong> that computer-focused analysis</h1> What data would the software need to have available?</li> How will we get that data in a form we can use?</li> What would we need to do with that data to present useful information to humans?</li> </ul> Nouns</strong>: how we structure</strong> our data</h1> long list of nouns</em>: so much data!</p> SN: Talk about how subject-matter experts help us identify the data.</p> Verbs</strong>: how we transform</strong> that data</h1> we receive a lot of data, transform it, and run some truly complex analyses on it</li> we present that information to human beings in a form designed to help them make important decisions</li> server engineers, UI engineers, UX designers, data engineers, and data scientists are all involved in doing this</li> </ul> SN: This is most of the work, right here. This is what the software does</em>, its verbs.</p> Step 4: Align</strong> a team</h1> on how you understand the problem</li> on the shape of your solution</li> on the values you bring to your solution</li> </ul> SN: This is what our company meeting does. Every week, we talk about what our customers are trying to do and how well we’re solving their problems.</p> Align technically</strong> on the details of our solution</h1> technical design choices</li> the details of how we represent our data</li> the building blocks of our software</li> what our architecture is</li> the values we use to decide among our options</li> </ul> SN: What programming languages are we using? How are we storing our data? Of the countless ways we might write this, which way are we picking?</p> Technical</strong> alignment comes from:</h1> Writing and reading documents.</li> Talking to each other.</li> Over and over (you don’t stop).</li> </ul> SN: Alignment is an ongoing task. We must constantly communicate in person and via design documents to make sure we all understand the direction we’re going.</p> No one person ever understands the whole thing</strong></h1> Each one of us makes decisions that push the system in the right direction.</p> We must be in alignment, or those decisions might be at cross-purposes.</p> SN: Alignment is critical, because complex software is too big for any one person.</p> Step 5: Coordinate</strong> to write all those text files.</h1> SN: DEEP SIGH. This is where all the trouble is. I could give an entire presentation on what we know about this part of it, from books people have written about their face-plants through the years. Today I’ll stick to sharing a couple of insights I hope will be useful.</p> Software development methodologies are under-studied</strong>.</h1> agile, scrum, kanban, waterfall, extreme programming, spiral, chaos, shape up, behavior-driven, lean, that weird UML-based thing, slow programming…</p> SN: All of those are real names for methodologies. Which ones result in measurable, repeatable productivity improvements? No idea. Nobody has studied this. There are a few things we do know, from looking at past projects. We do know it’s a team sport, and that communication is the core.</p> “Adding [human] power to a late software project makes it later</strong>.” — Fred Brooks, The Mythical Man-Month</em>, 1975.</p> </blockquote> SN: Why? Because communication is, as we nerds like to say, an order N squared problem. Adding the 10th person to a project team adds 9 new lines of communication to worry about. This is a great book with a lot of great project insight, including the nugget that if it takes one woman nine months to deliver a baby, it does not follow that it would take 9 women one month to do it. And yet this is something the software industry keeps trying to do…</p> We know some things are bad</strong></h1> micromanagement is awful</li> long periods of crunch are actively destructive (and we have research here)</li> projects that never end wear people out</li> </ul> SN: These things fall into the category of yeah, people are people.</p> … and some things are good</strong></h1> Do write</strong> things down.</li> Do give people and teams appropriate autonomy.</strong></li> Do collaborate</strong> on the hardest work.</li> Do treat each other with kindness</strong> and respect.</strong></li> Do create emotional safety</strong>, so people can experiment and learn.</li> </ul> SN: Huh, none of those things are about process meetings. All of these things are about enabling smart people to do their best work. Strange. Okay, let’s talk process for two more slides.</p> Most healthy projects do something agile-ish</strong>.</h1> Teams do best when they understand what they’re building, why they’re building it, and who they’re building it for.</li> Self-organization and autonomy are good.</li> Delivering working software frequently turns out to be good.</li> Communicating with the customer a lot is also good.</li> The details don’t matter much, so long as you’re talking to each other.</li> </ul> SN: The Agile Manifesto is actually good.</p> There is no silver bullet.</strong>“ — Fred Brooks again</p> </blockquote> SN: There is no single solution that works for every team in every moment.</p> Step 6. Get feedback</strong>.</h1> Feedback tells us if we’re on target or not. Spoiler: You’re almost never perfectly on target.</p> SN: Feedback loops are pretty important. We need to check on how we’re doing. We run retrospectives on incidents and on projects to see how we’re doing with our processes, and learn from our experiences. Do more of this? Less of that? Feedback loops are how learning happens.</p> Can’t we just get it right the first time?</strong></h1> Nope.</p> SN: And there’s a reason why we can’t.</p> “The map</strong> is not the territory.</strong>” — Alfred Korzybski</p> </blockquote> SN: Your mental model is not reality. The map is a model of the real world– the mountain and the terrain, and the trails across it. The map tells you a trail is there, but it does not tell you that the trail was washed out in a mudslide three days ago. We make our plans with the information we have, and then we learn from feedback how we’re wrong.</p> Ways our map is wrong</strong></h1> We didn’t understand the customer’s workflow.</li> We got our data models wrong.</li> We’re transforming our data incorrectly (or inefficiently).</li> We figured out a new approach along the way.</li> Teams didn’t align with each other, and their software doesn’t work together.</li> Software we rely on behaves unexpectedly.</li> We made mistakes while building things.</li> </ul> SN: All of these things are guaranteed to happen, mostly at a small level, but sometimes with very big concepts. So we need feedback and take active steps to get that feedback.</p> Feedback from testing</strong></h1> We test for many reasons!</p> Does this one piece do what we want it to do?</li> Are all the complex pieces working together?</li> Does the system do what we expected?</li> (Did we get lost despite following our map?)</li> </ul> SN: This is why we have QA.</p> Feedback from our customers</strong></h1> Is our system doing what our customers need?</li> (Did we reach our planned destination or did our map lie?)</li> </ul> SN: The people who regularly talk to our customers are invaluable.</p> Step 7. Ship it</strong>.</h1> Get it into the hands of customers as soon as it would be useful to them. Get revenue as soon as you’re able.</p> SN: The reality of Silicon Valley style software companies is that we all go into debt immediately to be able to pay salaries and AWS bills. We want to get out of that situation as soon as possible, so the company can keep doing its thing.</p> “Ship or die.</strong>” — Danger, Inc, internal motto, 2002</p> </blockquote> SN: Before the team shipped the first Sidekick in 2002, we said this often to each other. This over-dramatic motto came from a maniacal focus on shipping, getting our product done and out there into people’s hands. But the catch is that you’re not done when you ship.</p> What happens after</strong> you ship?</h1> Staying alive with more software.</p> SN: So it’s great we shipped instead of dying, but now we gotta keep the software alive too. Software is never finished! We continue to modify it after we release it to the world.</p> Most of the cost</strong> of software is maintaining</strong> it</h1> Every line of code we write has a maintenance cost: people, time, thinking.</p> SN: Those half-million lines of code represent complexity that has to be understood.</p> Living software systems must be operated</strong>.</h1> Software must be run to have meaning!</li> Keeping software running is an entire area of expertise.</li> Operations teams tend the software that runs the software to run the… oh no.</li> </ul> SN: Text files on GitHub don’t do much by themselves.</p> Living software systems must be changed</strong>.</h1> the world around us changes</li> new laws & regulations, new practices from our customers</li> the context in which the software runs changes</li> the team maintaining the software changes over time</li> </ul> The software must change in response.</p> Changing software requires understanding</strong> it</h1> Naur’s third point: A programmer with the theory of the system can “respond constructively to any demand for a modification of the system so as to support the affairs of the world in a new manner.”</p> SN: Let’s call back to Naur again– changing software requires understanding it. The more complex and voluminous the software, the more there is to understand.</p> Success can be a catastrophe</strong>.</h1> we need to scale up from a few customers to many</li> we learn where we need to be flexible</li> we learn where our models were incomplete</li> </ul> SN: A friend who was at Twitter during its early years describes implementing things that would get them through the next six months, by which time they’d have its replacement ready to go.</p> All software has a lifespan</strong></h1> the changes made to it slowly build up like plaque in arteries</li> the software in a big system usually gets replaced in pieces to keep the system itself working</li> the system of software itself lives a long time</li> </ul> Congratulations.</h1> Now do it all over again</strong> for the next product.</h1> SN: You figured out how to eat this thing with software. You shipped. Your customers grumble sometimes, but they’re mostly happy. PHEW. Let’s do a fast recap.</p> recap: what is software</strong>?</h1> software is, yes, text files with instructions to computers</li> it’s also an expression of our understanding of a real-world problem</li> and an expression of our analysis from a computing perspective</li> </ul> recap: how do we build</strong> it?</h1> there’s no perfect answer to this</li> building software requires a team to align on their understanding</li> plan an approach</li> coordinate with each other</li> iterate in response to feedback</li> </ul> </li> </ul> recap: what happens after we ship</strong>?</h1> software lives on long after we build it</li> most of its cost is maintenance</li> you have to understand it to maintain it</li> eventually we need to replace it</li> </ul> And that’s how we turn a napkin sketch</strong> into something that affects the physical world.</h1> Questions?</strong></h1> SN: Stop sharing screen now.</p> Accepting Work Unknown — Tue, 19 Dec 2023 14:10:00 +0000 For “you” in this document, read “you and your team”.</p> I link to some interesting reading on some of these anvils, but mostly I don’t. These are things I generally take as facts about the world, with the usual squishy “it depends sometimes” about some of them. I use agile methodology language, mostly, even though I like to say I really hate agile processes. Do I hate agile? Really? Let the anvils commence!</p> Don’t let people outside the team assign work to the team. They may propose work, but you decide if you accept that work.</p> The rate at which you accept work must be less than the rate at which you finish work, or you will have infinite work.</p> Operational incidents and meetings count as work.</p> Bug-fixing counts as work.</p> Don’t accept work that you don’t understand. “Figure out this project well enough to estimate it” is acceptable work, as is “cooperate with a product designer to get design documents into a state where they describe acceptable work”.</p> Only rarely should you say no outright to work. If it’s not well-defined, push to define the work better. (Unclear requirements make for misery on both sides.) If your team has too much work already, push for prioritization. (Something has to give. It will always give in reality, whether people admit that in advance or not.)</p> Sometimes you need to communicate the consequences of your team taking on disruptive work and let your customer decide if the cost is worth it.</p> Technical design and research counts as work.</p> Estimation counts as work. The more time you spend on accurate estimation, the less time you spend on other work, such as implementation. This is often worth the time anyway, because sometimes the business needs it.</p> Tools are not a substitute for communication.</p> One point of the retro is to figure out what your true rate of finishing work is. If you finished less than you took on, then next time take on less work. If you finished more, cautiously take on a little more.</p> You probably do not spend enough time doing retros and planning for your next sprint. One hour every two weeks isn’t enough.</p> The more you understand the work, the better you do estimating it.</p> Corollary: You do best estimating work very similar to work you’ve done before. 1</a></sup></p> Another corollary: Estimates you make at the start of a project, when you know the least about it, are the most likely to be wrong. Build in feedback loops for estimates! Communicate with your customers as estimates change.</p> Don’t let your early estimates get turned into deadlines.</p> Sometimes the business itself has deadlines. Frequent delivery of working software is a survival tactic for deadlines.</p> Sometimes you get it wrong. Use the retro to figure out what you can learn from the mistake. Remember, the map is not the territory</a>. Sometimes that clearly-marked trail turns out to have been destroyed by a mudslide.</p> High-uncertainty projects dominate software schedules.</a> The thing that’s late because the trail was washed out ends up making everything late. Maybe it was worth an advance scout? 2</a></sup></p> If hitting a date you provide matters, invest time in lowering uncertainty.</p> Fred Brooks</a> spoke truth about how communication overhead dominates work. Most complex software work can’t be parallelized or sped up the way businesses want to speed it up.</p> The fastest way to get projects done is to have an aligned team take on chunks at their own pace, without doing any planning other than technical planning. Nobody likes hearing this, but it’s a consequence of the overhead of estimating and bookkeeping.</p> Agile™</h2> Agile™ as practiced has little to do with the original principles of the movement. Those original principles might be summarized roughly as:</p> You are building things for a customer. Talk to your customer.</li> Deliver frequently.</li> Build in feedback loops so you can figure out what you’re doing that’s working and what’s not.</li> Let teams self-organize. Trust them.</li> Change is inevitable, so plan for it.</li> </ul> I wrote those points off the top of my head, so I went to the original to see how well I did at capturing its spirit. Not bad! This is what the Agile Manifesto says.</a> Go read it! It’s short! Then weep at how far we have strayed from it. Also note what isn’t there: any rigidity about sprint lengths, planning poker, burndown charts, anybody other than the team itself deciding how to do things. It sounds pretty sensible to me, to be honest. (Maybe it’s only Agile™ that I dislike?)</p> What I also like about that original manifesto is the focus on sustainability. “Sprinting” isn’t mentioned. Communication</em> sure is, though, and I’m 100% aligned with that. Talk to people involved in the project. Frequently. Even about bad news. 3</a></sup> It’s all about the communication.</p> And as you know, communication has its own overhead. I point back at Brooks, who says there’s no silver bullet.</p> Did I have a thesis?</h2> Mostly I wanted to write down some things I take as fact about planning and estimation that are often at odds with how software organizations behave. I’ve been itching whenever I hear about teams “doing agile” or “getting scrum training”. My theory is that processes are never one size fits all. You can’t be dogmatic about them. Teams vary, and so do projects. Some teams write a lot; some teams talk a lot; some teams demo a lot. Some teams pair; some teams mob program. What’s more, teams vary over time even when their membership is mostly stable, because people change and learn.</p> The best process for any project is probably one you design in the moment for the team. You never have to do this in a vacuum, because there are lots of good processes to steal from, and your team is probably doing some set of things already that are effective for them.</p> Dogmatic adherence to a half-understood Agile methodology probably ain’t it. So go back to the original! It’s pretty good.</p> The good news is that later in your career, after you’ve seen a lot and accumulated amusing war stories, you have many past projects to compare the current one to. It gets easier. ↩</a></p> </li> Okay, okay, I’ll stop abusing this poor metaphor. But if it’s high-uncertainty and important, it’s probably worth a code spike or a couple of weeks spent on research. ↩</a></p> </li> There’s a Michael Pollan “mostly plants” joke lurking here. ↩</a></p> </li> </ol> </section> A systems analysis rubric Unknown — Sun, 10 Dec 2023 11:30:56 +0000 This is a systems analysis document rubric I’ve written several variations on in recent years. I’ve genericized it a bit and updated it with my current thinking. The form of this document is something a team would have in their official processes library somewhere, as a guide to how to do analysis of a fresh problem. I’ve had this blog post sitting 90% finished for a year now, so hey, here it is!</p> NB:</strong> I have come to believe that there is no one process that works for every team. The process that makes a team most effective is a process designed for that team, for their current project. Don’t be dogmatic about anything! Think about the true goal, which is to write good software that does what it needs to do, making its users happy while its authors have chill weekends. Take the ideas here and adapt them to what your team needs.</p> I no longer call this document an RFC, because I think this term comes with the implication of a slow-moving process, which has to solicit a lot of feedback because of its importance. This is perfect when you’re designing the fundamental protocols of the Internet; it is not quite what I find myself wanting my colleagues to do. I am using the term “system analysis rubric” as I think about this task right now, because systems analysis is where my head is, and what I see missing from a lot of problem-solving.</p> “Problem statement” might also be a good name for this document, although I think it’s good to explore possible solutions in them as well as problems. Coming to a clear problem statement is possibly the most important task you have when you’re thinking about changing something or making something new.</p> Design documents: a systems analysis rubric</h1> A design document is a structured way to have and record a conversation about a problem. It is not appropriate for all problems you might be solving. The formality and length of the conversation depends on the scope and complexity of the problem. For a bug fix, you might need a short conversation with a single colleague, plus commentary in a commit message. For a major project, this process might take weeks to complete and you might write several of these documents.</p> While the process does</em> produce a document, the document is not the most important result. The important result of the design process is the exploration</em> of the problem that writing the document encourages. The conversation that accompanies the exploration aligns you and your team on an understanding of the problem. Yes, a design doc might describe a proposed solution, but this proposal is secondary to a team’s collective understanding of the problem to be solved.</p> I’m going to hammer on this point as I go here. The document exists to promote exploration and shared understanding of the problem. The document is a tool in service of a more important goal.</p> The widening conversation</h2> My design documents start as notes to myself. I attempt to structure my own thoughts about a problem by writing down what I’m thinking. The stakes are low; the document is so informal that it’s likely nothing more than bulleted lists of things that come to mind. As you go, your writing should tighten up and be more complete, but remember: the document is not the point. Don’t stress about sentence perfection. 1</a></sup></p> The audience for the design document changes as it matures. When you are writing your first notes about a problem, you might share them only with a pairing partner to get immediate feedback. As you gain confidence in your understanding of the problem, widen the audience for your document. Seek out feedback from domain experts and from your team as a whole.</p> Show your design document to its stakeholders in advance of any public discussion, to give them a chance to think and give you feedback. Follow the principle of least surprise. People can react badly to surprises even if they agree with the proposal in the main. If you can, avoid introducing complex technical topics in meetings. Meetings are best used to solidify alignment or discuss specific known open questions.</p> When you reach the step of sharing your proposal with the entire engineering organization, it will be a solid document that you feel confident about.</p> The process of exploration</h2> Step one: Research.</p> Investigate the background of the problem & document the current solutions, if they exist.</li> Document why the current solutions are inadequate, if relevant.</li> Gather relevant product documentation, if it exists. A product requirements document is ideal, and this phase might be focused on collaborating on requirements with a product team.</li> </ul> Step two: Write a clear problem statement.</p> What change would you like to effect upon the system?</li> What is happening today that you’d like to be different after the work you’re considering?</li> What are the properties of a successful solution? How will you know it’s successful?</li> Identify constraints on the solution space. Development time? Budget? Performance? A fixed point of integration?</li> Why is this the right problem to solve now?</li> What problems are you choosing not</em> to solve right now?</li> Refine your problem statement until the team aligns on it.</li> </ul> Step three: Explore possible solutions.</p> Identify and consider possible solutions.</li> Discuss tradeoffs inherent in the solutions. Evaluate them against the constraints.</li> Estimate costs of the solutions, in time / effort / complexity / maintenance / hiring.</li> If necessary, do spike implementations to test the validity of assumptions or the viability of a specific approach.</li> </ul> Step four: Reach consensus on a solution that solves the stated problem while making acceptable tradeoffs.</p> Sometimes step four does not</em> end in consensus on a solution, but instead ends in a decision to do further research. This is a good result and should not be treated as a negative by the team.</p> The design document should now be a document describing the problem, the research, and the possible solutions, and conclude with a plan of action. Congratulations! Archive the final version in the corporate wiki or in a docs folder for the resulting project. Its next audience is the person working on its replacement, who you’ve just given a good head start.</p> Now let’s review the parts again, in more detail.</p> The problem statement</h2> You’ll start with something you think is a good problem statement, but you will often</em> find that it doesn’t go into enough detail to support a good technical decision. Constraints might be missing. Stakeholders might disagree on what success looks like. Important implicit requirements might need to be unearthed.</p> The initial problem statement informs your research, but expect to change it. Push on it and iterate until you have something the team agrees on.</p> Among the constraints you implicitly take on for any project are your team’s shared values</em>. If your team hasn’t discussed those values, now is a good time to do so. Your shared values are partly a reflection of your team’s personality and culture, and partly a reflection of where your business is. A team at a new startup trying to ship something quickly for survival might value a minimal solution that can be produced rapidly. The same team following up after a successful first ship might value flexibility instead. Make the implicit explicit and state any values that might affect this project.</p> Detail on the research step</h2> Do not short-change this step! This is critical to understanding the problem. Do the background research if there is extant code. Summarize that research, with relevant links, so your readers can also understand the context.</p> Answer scaling questions if they’re relevant. Gather numbers for today, a year from now, and as far in advance as a reasonable guess can be made. Does your solution to the problem have a lifespan? Don’t look beyond that lifespan if so.</p> For data being stored and manipulated, you might ask questions like these:</p> How much data is being discussed? Is it large in total size or in quantity?</li> What actions are taken on this data? How often does it change? In what quantity?</li> Who is changing this data?</li> What are the constraints on data changes? Are there any conflict resolution requirements? Do operations need to be serializable (expensive) or will idempotency suffice (cheap)?</li> What happens if data mutations are lost?</li> How is this data expected to grow over time? Is it shardable if massive growth is expected?</li> If the data is very very large, the questions become more specialized. If you are not a data engineer, you might want to consult one.</li> </ul> For APIs, the questions might look like this:</p> What other systems are expected to call this API? To do what tasks?</li> What are the latency requirements?</li> Is this operation write heavy or read heavy?</li> How many requests/sec do we experience at peak? How will this number change over time in relation to business growth?</li> Does peak load differ from steady state load? When is the load heaviest? Does this correlate with other usage patterns in the system?</li> If you’re caching expensive work product, identify how you’ll be invalidating that cache. What fails if the cache is stale? (Do you really need a cache? Really?)</li> </ul> Failure analysis is next. This topic can be where engineers shine, because we love discussing how things fall over.</p> How might this system fail?</li> What are the consequences of failure for this system?</li> Should any of these failures be visible to or actionable by the end-user? If so, how should they be presented?</li> How should we handle the most important or unusual invisible-to-users</em> errors? Retry? Escalate to human beings? Log and move on?</li> </ul> What are the security concerns? Do a threat modeling exercise with security experts early, particularly if you’re doing something new or not handled by existing tools.</p> Are you accepting untrusted user input? How do you need to handle it?</li> Who is allowed to perform these operations or see this data?</li> Are you managing data that needs to be protected or encrypted?</li> What would an attacker gain if one got access to your data or your API?</li> What would a person with bad motives do if they have normal access to this new functionality?</li> </ul> The appropriate questions to ask depend on what your area of work is and what “affair of the world” it addresses. These questions are intended to get you started.</p> Problem statement (slight return)</h2> Come back to your initial problem statement. Can you sharpen it? Can you clearly define what a successful solution might look like now? If you’ve done the research, you probably can.</p> Don’t move forward until you have consensus that the problem statement is good.</p> Solutioneering</h2> This is where programmers love to be. We are problem-solvers and we want to jump right to solving problems, especially if we can write code to do it. Resist this urge.</em> Your solutions have a better chance of success if they are informed by a solid grasp of the problem you need to solve. Your second and third refinements of a solution are likely to be better than your first.</p> This step is often focused on navigating tradeoffs. The problem statement, if it’s sharp enough, gives you a good razor</a> to use to evaluate solutions against your success criteria.</p> What are the costs of a possible solution? How complex is it?</p> Give the solution a t-shirt size. Does it match the time budget the project has?</p> What are the risks in the solution? How might it fail to solve the problem or otherwise fail as a project?</p> What’s the solution’s blast radius? That is, how many other systems would be affected by the work? How many teams?</p> Does the solution introduce new technologies to the overall system, or does it leverage tools your team understands well? If it spends novelty points, do they buy you something worth the expense?</p> Does the solution align with the team’s values?</p> In many cases the right solution will feel good to the team discussing it, and you’ll reach consensus smoothly. When information, values, and understanding of the problem is shared, alignment is easy. If consensus is not happening, make an attempt to figure out why the team is not aligned. Is there a disagreement about values? An information disparity? Is more research needed? Bring in senior staff to help break stalemates. Bring in somebody from another team who has relevant experience. Remember that the project might need to move forward anyway because of business needs, and a half-good solution might be better than no solution in the short term.</p> The document’s final home</h2> I end up making a design</code> or docs</code> subfolder in the code repo for these documents. Your organization might have an official home for documents that isn’t the repo. I suggest that you at least store a copy next to the code, where it will survive as long as the code does. The document will drift out of sync with reality the instant anybody starts implementing the plan, but that is fine. The document exists to help future maintainers understand what their predecessors were thinking at the time.</p> Remember: the act of writing the document is more important than the document. The sharp problem statement and shared understanding of the solution were the goals of the exercise. If it got you there, it was good enough.</p> Additional reading</h2> The Rust RFC process</a> discusses the importance of the conversations.</li> Architectural Decision Records</a></li> A Structured RFC Process</a> by Phil Calçado talks about the benefits of widening the circle of review.</li> </ul> Correctness in these details can make an unconscious impression on readers that matters, so if you have the time, hey, spell-check yourself. The opposite side of this is that you as a reader of design documents need to set aside your own fussiness about spelling and grammar, should you have any, especially if the author is not a native speaker of the language they’re writing in. These things are to the side of the problem. ↩</a></p> </li> </ol> </section> Multi-factor panacea Unknown — Mon, 10 Oct 2022 10:00:00 +0000 Context: @substack</a> 1</a></sup> deleted his github account, which includes a lot of foundational source code from the early days of node. The speculation (and it is only speculation as far as I know, though with some foundation in his tweets) is that he did so because of the MFA requirements being imposed on some NPM package maintainers.</p> Here’s my take. It’s not very hot and is probably marginally more informed than many, but it’s also probably worth what you paid for it. I started writing it as a series of tweets, which is why there are some extremely terse phrasings here.</p> Proxies</h2> Okay, I’ll weigh in on this one, because I have spent time thinking about it, and because @isntitvacant</a>, @i_a_r_n_a</a>, and I made MFA happen for NPM originally.</p> Companies freeloading off of open source are worried about intentional security compromises in the software they’re benefitting from. Let’s walk through their threat model: Somebody gets access to the account of somebody who works on a package that company X uses and uses that access to publish a deliberate compromise. The update gets taken automatically by the downstream consumer, and then they are shipping their environment variables out to a third party, or running a cryptocurrrency miner, or allowing an attacker to get shell access.</p> Does forcing maintainers of “important” packages to enable MFA and never turn it off help protect companies from this threat?</p> Kinda. Restrictions on package authors protect against one category of supply chain attacks. They protect against account hijacking. The state of the world used to be that some NPM users with critical spots in the dependency graph had passwords like “password”. No, I’m not joking. Taking away that easy attack vector seems helpful, so I’m glad we shipped what we did when we did. One good design choice we made was to not</em> implement MFA via SMS, which would have been no protection at all against these threats because social engineering makes SMS not secure at all.</p> Requiring that a maintainer enable MFA for their account does not, however, protect the source you use from all supply-chain attacks. Legit maintainers have been responsible for some of the worst. Case in point: the infamous left-pad deletion was done by the package maintainer and MFA would not have helped one bit.</p> MFA is still a good thing to do</em>, but it’s not protection against what the companies freeloading on open source maintainers are worried about.</p> Why not? Because the account level is only a proxy for the level you care about. You’re far more interested in audit trails that are at the package level and then at the source level. What changed in this release? Did maintainers change? What source changed?</em></p> An historical aside</h2> @isntitvacant</a> did think about MFA from the package perspective when he designed the back end support for this! He also thought about restricting tokens to CIDR ranges, though I don’t know if that has ever been exposed. We missed the chance to go even finer-grained on access token permissions than we did. And we definitely should have done audit logs on package ownership changes.</p> My only excuse is that at the time it felt like a triumph to be able to get the feature shipped at all.</p> A side comment about the mess we’re in: NPM was designed to maximize engagement from publishers, not to be a good package manager at the scale that it reached. It was designed deliberately to be viral, not to be secure or auditable.</p> For example: The default being to take updates without thinking about them, for instance, to the point where bots do all that work of dependency updating for you. Downloads number gotta go up.</p> For example: The tarball as unit of deploy: huge, contains weird stuff that you don’t care about plus whatever silly things the package maintainer put in, hides the deltas. But it was very easy to implement and good enough.2</a></sup></p> NPM’s design pushes you into not thinking about your software supply chain by design decisions made when winning a war among competing node package managers was important to somebody.</p> Stop being pushed. Stop taking updates by default. Think about your supply chain differently.</p> Problems not proxies</h2> What are</em> you interested in when thinking about this threat? The source itself. The software you’re relying on.</p> Inspectable audit trails for changes are far more interesting, and this is not something requiring MFA for package maintainers gets you</em>. Looking at maintainers is looking at a proxy for the threat, not the threat itself.</p> Protecting against the proxy does not give you a free pass on looking at the source you’re depending on and deciding it’s okay. It does not give you a free pass to take every update there is without thinking.</p> Questions it helps to know the answers to:</p> Who published this?</li> How do you know they were that person? (Same as controller of repo? Controller of other accounts? What’s the web of identity?)</li> What was the chain of control of the source?</li> Is the source that was published the same as the source in the advertised repo?</li> What was the source delta from the last publication?</li> What does the source do?</li> </ul> And given the possibilities of bugs and</em> of some maintainer with bad goals playing the long game, only the questions about the source are on target. The rest are proxies.</p> The tech industry relies on software they do not take the time to inspect, written by strangers they mostly choose not to pay. Sometimes the industry pays people to work on very critical projects, such as Linux itself! But the web dev world rarely stops to pay the people who were around the node scene at the beginning, writing tiny modules because that was their philosophy, which then got bricked together without their participation into the foundations of modern web development.</p> Because tech industry companies still don’t want to pay for the work they build on top of–with either their time or their money–they impose requirements on those strangers to attempt to protect themselves from a proxy for the threat, with zero cost to themselves. Those strangers have every right not to participate; it wasn’t what they signed up for back then. Any access to their work you had was a gift.</p> tl;dr Use Feross’s Socket</a> to scan the source itself; you won’t catch them all; pay people to write any software you truly rely on.</p> substack the good human, not substack the objectionable paid newsletter company. ↩</a></p> </li> The tarball is a case of worse is better. And yes, that’s a complex statement itself. It was good enough and easy enough to implement that it satisfied the true requirements of the problem in the moment. Knowing the problem space as well as I do now, and in the current package manager landscape, I would design it quite differently were I to take on the project myself, today. ↩</a></p> </li> </ol> </section> Goodbye Cloudflare; hello Fastly! Unknown — Sat, 27 Aug 2022 16:53:01 +0000 KiwiFarms is a harassment website, sort of like a terrorism-only variation on the *chan sites. It specializes in harassing trans people. It doxxes them, SWATs them and their families, and does its best to drive its victims off the internet. It also has a bodycount. They are a troll farm</a>.</p> Kiwifarms gets to do this and stay on the internet because they’re being protected by Cloudflare</a>. Cloudflare has a long history of protecting incredibly vile content: they were recently infamous for hosting Daily Stormer</a>.</p> Cloudflare is exceptional in its position. From the Time article:</p> “We find anecdotally that sites prefer Cloudflare because of its lax acceptable use policies and its free DDoS protection services that help protect against vigilante attacks,” the researchers write. They note that AmmoLand, a popular guns rights blog, has praised the company “for its self-described ‘content-neutral’ stance.”</p> </blockquote> Cloudflare takes a freezepeach position on free speech: they do not acknowledge the reality that in order to protect the free speech of the many, we cannot tolerate the abusive behavior of the few. Cloudflare protects the abusers instead.</p> Liz Fong-Jones has been leading the current pressure campaign against Cloudflare most effectively.</p> Why this matters to me</h2> When I set up my blog, I hosted it in an S3 bucket behind Cloudflare, using their free plan because I have very simple needs for it. I do not want to lend them even that little support, so today I moved my blog to Fastly.</p> I moved my last two employers to Cloudflare from other CDNs. I won’t be repeating that mistake until they shape up and start removing Nazis and troll sites without</em> needing pressure campaigns to move them. I treasure my friends and it is unacceptable to me that some of them go through their lives afraid for their personal safety because of sites like Kiwifarms.</p> You might decide that freeloading off of Cloudflare is fine, because you’re siphoning resources from them. You might also be unable to pay for another CDN. Only you know your circumstances. I have the disposable income to spend a little more than I spend now on my AWS hosting bill on a CDN provider who doesn’t have to be pressured over and over again to boot sites like Daily Stormer and KiwiFarms.</p> This is how I did it, very short version:</p> Steps were:</p> set everything up in fastly</li> tell fastly about my certs</li> verify that their test url worked</li> duplicate all my dns setup in route53</li> cut over name servers with my registrar to route 53</li> </ul> The rest of this blog post goes into what I did in more detail, in the hope that I can reassure you it’s very do-able.</p> By hand, in more detail</h2> Create a Fastly account</a>. (Set up 2FA!)</li> Scan through Fastly’s getting started guide</a>. The concepts here are different from Cloudflare’s concepts. Fastly is (oversimplifying a bit) a nice front end to Varnish & VCL</a> plus a lot of POPs around the world to reduce latency to your users. VCL can do a lot. You end up with a lot more control over how things get routed, but the cost is more complexity to cope with.</li> Give Fastly a credit card so you can enable TLS.</li> </ol> Now let’s do the switch:</p> Find a new home for all of your DNS records. I used AWS’s Route 53</a> as my name server because I am very comfortable with it. Your domain registrar might provide name service; all the major cloud providers also do.</li> Duplicate all of the DNS you’ve set up in Cloudflare over in your new DNS provider. Your goal is to avoid downtime when you cut over from Cloudflare to your new nameservers.</li> Now set up a delivery service</em> in Fastly. This is a backend – the place the data comes from – plus a domain that is the face of the service – the hostname people type into their browsers. You’re setting up a mapping from domain to data source. For me, the back end is the AWS S3 bucket that holds my blog assets, and the domain name is what you see in your browser right now.</li> Make the service active. Fastly now gives you a test domain name, like blog.ceejbot.com.global.prod.fastly.net</code>, to verify that your content is available as you expect.</li> </ol> Now the slowest part: do something about your TLS certs. Because I am old-fashioned and I haven’t automated all this yet, I buy certs from my name registrar. You can also use AWS ACM, which automates things pretty well. Fastly will help you set up Let’s Encrypt</a>, which is probably the best option for most people.</p> Once Fastly is aware of your cert material somehow, you are ready to cut over. You can do this in two phases. The first phase is a double-CDN phase:</p> Update your domain in Cloudflare to point to the Fastly TLS domain they gave you when you set up TLS.</li> Turn off proxying in Cloudflare. Make the orange cloud gray.</li> </ol> Now your content should be served by Fastly instead of Cloudflare. You should see the headers change to something like this:</p> $</span> http HEAD https://blog.ceejbot.com</span></span> HTTP/1.1 200 OK</span></span> Accept-Ranges: bytes</span></span> Age: 518</span></span> Connection: keep-alive</span></span> Content-Length: 7149</span></span> Content-Type: text/html</span></span> Date: Sat, 27 Aug 2022 23:22:43 GMT</span></span> ETag: "464e0930f8616e07530366dfa7ba0567"</span></span> Last-Modified: Sat, 23 Jul 2022 20:01:40 GMT</span></span> Server: AmazonS3</span></span> Via: 1.1 varnish</span></span> X-Cache: HIT</span></span> X-Cache-Hits: 1</span></span> X-Served-By: cache-pao17472-PAO</span></span> X-Timer: S1661642564.606369,VS0,VE35</span></span> x-amz-id-2: E2eXc0YfBq4rX2rGhwOWZMbU26NYxzGcaAzlQ7+E/zHhcp19RIpct8WwFIaDQEy6TWuhluNf1ng=</span></span> x-amz-meta-md5chksum: 464e0930f8616e07530366dfa7ba0567</span></span> x-amz-request-id: 9V54SNWKBG95XYY3</span></span></code></pre> Varnish is serving my content! It’s working! Now you can safely take the last step: switch your domain’s name servers over to something other than Cloudflare. It might take a day or so for the global cache of caches that is DNS to update itself. If all went well, you switched without downtime!</p> Automation</h2> I did not use the websites to do this: I used Terraform</a> because I automate all of my personal infrastructure. It’s good practice, I tell myself. To use terraform you need to make an API key on the Fastly dashboard. Save it in your favorite password manager, then export it in the environment variable FASTLY_API_KEY</code>.</p> Set up the official terraform provider. Here’s my providers.tf</code> file:</p> terraform</span> {</span></span> required_providers</span> {</span></span> aws</span> =</span> {</span></span> source</span> =</span> "</span>hashicorp/aws</span>"</span></span> version</span> =</span> "</span>~> 4.0</span>"</span></span> }</span></span> fastly</span> =</span> {</span></span> source</span> =</span> "</span>fastly/fastly</span>"</span></span> version</span> =</span> "</span>>= 2.2.1</span>"</span></span> }</span></span> }</span></span> }</span></span></code></pre> Here’s the important part of my blog service in Terraform:</p> resource</span> "fastly_service_vcl"</span> "blog"</span> {</span></span> name</span> =</span> "</span>blog.ceejbot.com</span>"</span></span> activate</span> =</span> true</span></span> </span> domain</span> {</span></span> name</span> =</span> "</span>blog.ceejbot.com</span>"</span></span> comment</span> =</span> "</span>the blog</span>"</span></span> }</span></span> backend</span> {</span></span> address</span> =</span> "</span>blog.ceejbot.com.s3-website-us-west-2.amazonaws.com</span>"</span></span> name</span> =</span> "</span>the s3 bucket</span>"</span></span> port</span> =</span> 80</span></span> shield</span> =</span> "</span>pdx-or-us</span>"</span></span> }</span></span> }</span></span></code></pre> This terraform fragment sets up DNS so Fastly handles requests to my content:</p> resource</span> "aws_route53_zone"</span> "ceejbot-com"</span> {</span></span> name</span> =</span> "</span>ceejbot.com</span>"</span></span> }</span></span> </span> resource</span> "aws_route53_record"</span> "blog"</span> {</span></span> zone_id</span> =</span> aws_route53_zone</span>.</span>ceejbot-com</span>.</span>zone_id</span></span> name</span> =</span> "</span>blog.ceejbot.com</span>"</span></span> type</span> =</span> "</span>CNAME</span>"</span></span> records</span> =</span> [</span></span> "</span>n.sni.global.fastly.net</span>"</span>,</span></span> ]</span></span> ttl</span> =</span> 3600</span></span> }</span></span></code></pre> This isn’t all of the terraform in my setup. I also have a policy on the S3 bucket restricting access to it to Fastly’s public IP list</a>. That’s a reasonable practice to prevent an accidental gigantic AWS egress cost.</p> You can do a lot more with VCL and Varnish if you feel inclined. For a while through the mid teens, all of NPM’s registry traffic was proxied through Fastly, with a carefully maintained custom varnish file routing things as much as possible at the edge. Most of us won’t need that power, but it’s available if you do.</p> Reduce Friction Unknown — Sat, 23 Jul 2022 12:54:39 +0000 The topic of reducing friction exhausts me: Do people still need to be persuaded to help their developers go faster? Really? In this, the year 2022? But yes, in this, the year 2022, many teams require persuasion on this topic. Or rather, their leaders require persuasion that they have to do more than give lip service to this principle, and that they must invest resources in making it so, and that those resources will not be “wasted” resources, not even for that</em> person, you know the one, the official VP of Feature Factory.</p> Some leaders are not worried about wasting time, but are instead worried that devoting brains to this work will slow teams down</em>. They admit that current processes are full of friction, but claim that they have to finish whatever they’re in the middle of before they should try to fix things. They think that reducing friction is a distraction from the real</em> work. This approach is short-sighted. The best time to reduce friction for your team was the moment it came into being, and the second best time is now.</p> I’m going to cover three topics in this post. First, I’ll define what we mean by “developer friction”. Then I’ll make the case about why reducing friction is beneficial to engineering organizations, including benefits in areas I didn’t expect. And then I’ll go into concrete suggestions about how to do it, and the mindset that you need to bring to thinking about it. As is true with many other posts in this blog series, its audience is people who are technical leaders in their organization, but I hope anybody who wants to help their engineering org do better work can get something out of this.</p> Defining our terms</h2> Let’s start by defining “process”. Process is the way you habitually do things</em>. Do not confuse process with ceremony or formality, or any other term you’d like to use to describe overhead added to the core of the thing you want to get done. You always have process.</em> You might not have thoughtfully-designed, intentional process.</p> “Ceremony” is a thing you do every time, ritualistically, usually involving other people. Regular meetings are a kind of ceremony. “Formality” refers to how prescribed and enforced a process is. When people react to “process” as a bad thing, they’re usually thinking of processes with heavy formality or more ceremony than they’re worth.</p> An example of a team process: “We prefer to have code PRs reviewed before we land them in main. It’s okay if docs or other non-functional changes don’t get reviewed and go directly into main.”</p> Adding ceremony: “All changes need to go through PRs, though we don’t require review.”</p> Adding more ceremony: “All changes must go through PRs with review, but we are okay if reviews are a rubber stamp.”</p> Adding formality: “We require that all PRs be reviewed & all CI tests pass before they can land in main, and we enforce this with settings in our source code repo that only administrators can change.”</p> Here’s a non-tech example of ceremony that might help you recognize it: pointing and calling</a>. This is a ceremony that helps operators of dangerous equipment (most often trains) confirm to each other what the status of important indicators is. Station guards will point at an indicator showing which side of the train to open the doors on, and call out as they do so, making sure the train conductor knows which set of doors to open. Adding a ceremony to the process helps the operators avoid opening the wrong set of doors. Another example of this would be lockout-tagout</a>. This formal ceremony ensures that people know when dangerous equipment is deactivated and can be worked on safely.</p> Let’s talk about “friction”, the main thing this post is worried about. Friction</a> is increased in a process in each of the examples above. “Friction” is a useful metaphor here because each of those examples oppose motion</em>: they demand more energy be invested in moving the project than would be required if they weren’t there. This might be a good idea! Lockout-tagout makes equipment safer to maintain. The lowest possible friction version of the PR example above is “we don’t care if code gets reviewed; merge right into that production branch.” You can see why adding friction in requiring PRs might be good for that team.</p> Adding friction is just fine when it buys you something worthwhile.</em></p> Teams with high levels of trust don’t need more than that first version of the PR process. Teams that don’t trust each other–or are perhaps required not to trust each other because of mandated security processes–need something more like the fully-formal version. A team that needs that fully-formal version will move more slowly than the first team. Is this worth the cost? It depends on the situation! Your goal is to identify your team’s work habits and work environment and identify things that are slowing everybody down without buying you something worthwhile</em>.</p> Sometimes process is… well, ludicrous and obviously causing harm. This Twitter thread is full of pure, wasteful friction. Merely reading it raises my stress levels.</p> Let’s share tech stack horror stories: what’s the worst workflow or most absurd limitation you’ve hit with a codebase?</p> </blockquote> I’ll start: while working as a subcontractor, I wasn’t able to submit code directly for review. I had to attach the updated files to an email. 🥲</p> </blockquote> What’s yours?</p> </blockquote> — Jason Lengstorf (@jlengstorf) July 21, 2022</p> </blockquote> Process isn’t the only source of optional friction, and it might not be the most painful source. Instead, the work environment is often the worst source. The tools. The platform. CI workflows. Automation or, more likely, the absence</em> of automation. Things that break and require human intervention. Buggy tools. Slow tools. Things people need to do often that are flaky. Builds that take forever and slow down develop-test loops. Continuous integration testing that takes a long time to run and slows down landing all work. Slow deploy processes that make the cost of pushing changes live high, and therefore makes pushing changes dangerous.</p> The other term we need to define is “toil”. The English word means “labor that tires you out”. In the context of tech world jargon, we use it to mean work that’s draining or time-consuming that doesn’t seem to be related to the core of what we need to get done. Repeated work. Predictable routine work. A process that is predictable and time-consuming but has to be done by hand is toil</em>. Resolving Dependabot PRs to your repos is toil</em>: it feels like work but accomplishes nothing worthwhile.</p> You shouldn’t tolerate either toil or tools misery. They are entirely avoidable, and they’re killing your team’s velocity and making everybody unhappy. Take stock of problems in this category, prioritize them, and eliminate them.</p> Making the case</h2> You might think it would be easy to point to these sources of slow-down and say, “let’s fix things”. In practice, you might get pushback. Why? What can we, as technical leaders, do about the resistance to making things better?</p> First we must acknowledge that changing any system is difficult: systems are self-reinforcing for many reasons. People within the system see the cost</em> of change clearly, but they often don’t have good ways to measure the rewards</em> of change. Also (and let’s be honest here) all of us have lived through having change promoted to us as unalloyed good, then seen it turn out to be not so great. Or actively awful. People proposing change have a higher bar to jump over than people who want the status quo. So if you want change to happen, you have to invest energy yourself. You’ll need to make the case for action.</p> Why hasn’t anyone else made the case? Why is your team stuck here? Good questions! Remember that the people next to you in this situation probably hate the friction just as much as you do. If they could stop it, they would. Once again, we have to go to the system they’re in and look what what it reinforces. You, as an analyst of that system, have an easier time popping out of it and changing it.</p> Let’s look at some reasons why people around you might resist the push to make things go faster.</p> It didn’t happen overnight</h3> The team might be unaware of how bad the problem truly is. They might not have noticed it was happening, because it probably didn’t get bad all at once; the slowdowns and the trouble got worse slowly over time.</p> To show how bad it is and break people out of denial, you might go to the data. How costly is the friction? Measure it! Count the number of times tool X</em> explodes and the team wastes a day on cleanup. Graph how much time people spend waiting for slow builds. The data will help you prioritize, so it is not a waste. (I think gathering metrics on internal tools is a good habit for teams even when everybody’s happy.)</p> Ownership</h3> The resistance to change might come from a far more human and emotional place. People might be attached to the things they built in the past, and reluctant to retire them. Don’t be a jerk about the software past versions of the team wrote. People do the best they can given the circumstances they’re in. Solutions that solved the problems of the past might no longer be good at solving the problems of the present. Honor the work done earlier, and let people feel good about it even as you’re coaxing them into replacing it. If you can, let them own the work</em> of making their thing better. If that’s not possible, at least seek out their feedback and ask them what they’d do differently this time around. They probably have good ideas.</p> Sometimes people will block whatever work happens. They might want to retain control. They might be unable to admit they were wrong about something. The worse case I’ve seen was somebody who simply resented all authority telling them what to do about anything. Toxic orgs probably feature several people like that. Do I have to tell you what to do here? You don’t want to do it, because you’re a human being with empathy, but sometimes you have to fire people.</p> Stress</h3> Organizations with a lot of friction might have people stressed by the work of pushing things forward despite the friction. Your most dedicated and motivated colleagues might be working the hardest to do this, and suffering the worst stress as a result. Stressed people can’t imagine adding to their workload by revamping existing systems that work, however poorly. They will resist change to protect themselves from their burdens getting worse.</p> This is an own-goal on the part of the organization. Leaders can prevent this, and indeed must. Stressed people don’t do their best work. Full stop.</p> Stressed people need to have their immediate needs honored and work shifted away from them. You must not listen to their opinions about what can and cannot happen until you’ve fixed their immediate emergency. Indeed, removing friction might give them the space to imagine a better world.</p> Don’t ask them to do the work of fixing their desperate situation. Fix it for them. This one’s on management, and maybe on you, o fellow technical leader.</p> Learned helplessness</h3> The most depressing resistance to change comes from people who say that this is how bad it always is. They can’t imagine things being better.</p> Anecdote time! I once worked for a moderately successful but not quite successful enough startup that made a hardware thingie you might even have heard of. Eventually it was acquired by ConHugeCo Software, Inc, a very very very large company indeed that you’ve definitely heard of. The new corporate owners wanted their newly-acquired software team to work on project Foobar, already in motion. Foobar had a lot of existing process and tooling and a team that was already pushing it forward. They were behind. They were engaged in weird political machinations to create excuses, they were so behind. Surely this acquihired team could help!</p> Um.</p> Eventually I joined project Foobar, and I learned why it was behind. Getting a single commit into the source repo for project Foobar took at least half a day and sometimes an entire day. You had to get into line to check in. When you were head of the line, you had to resolve any merge conflicts that were caused by the people who merged in since you got into the line. (And no, this was not</em> git.) You then had to build the full thing, and that was slow. Hours slow. Then you had to test. Then you could merge. Heaven help you if you broke the build: there were people who would get mad at you about that and penalties for it were discussed.</p> “Why,” I asked somebody, “do we not have a build team making this faster and better?”</p> The answer stayed with me. It was: “Nobody wants to be on a build team. They get laid off when their work is finished.”</p> Laid off. Their work. Finished. Uh. What?</p> The culture gap was epic and unbridgeable. The project turned out to be a famous disaster. Are you surprised? No? None of us at $acquiredCompany were surprised, either. The acquiring team could not imagine healthier processes. The cudgel was their only tool. They did not fix anything because that’s the way things were.</p> This is learned helplessness. Reject it. Things can be better than that. It is not only possible but normal</em> for things to be better. I know that. You know that. Stand up for it.</p> If you can’t, leave.</p> The positive argument</h3> Let’s make the case with more positive arguments. What will you get by relentlessly reducing developer friction? The obvious benefit: the whole team will go faster. I have to call this out explicitly, because a lot of the pushback to the idea of reducing friction comes from not thinking about what this means.</p> Everybody. Goes. Faster.</p> Reducing the amount of time it takes to do something by a couple orders of magnitude can have radical effects not just in kind but in category. When it took many minutes do download a single MP3 file, nobody was streaming movies. Now that gigabit fiber is an option for many homes, we’re streaming high-definition movies on a whim. Things you couldn’t imagine happening before become normal. You can probably think of more examples like this.</p> Here’s a modern example I’ve lived a couple of times now:</p> Deploys become fast: the cost of making changes is now low. The cost of making changes is low: people become less fearful of making changes. Less fear: changes get smaller and more frequent. Small, frequent changes: less dangerous inherently, so failures happen less often. Failures happen less often: the team becomes more confident. A confident team experiments and pushes themselves into trying new things. Everything gets better.</p> This is a virtuous cycle. This particular virtuous cycle can be promoted in lots of ways–great CI for instance–but hey, even CI benefits from running fast. And frequently. And easily from a developer’s laptop and not just a remote process if you can wrangle that one. A barrier to doing something is a kind of friction too!</p> Friction is frustrating</em>. It generates stress. Nobody enjoys slogging through a ceremony they can’t see the benefits of. Nobody enjoys watching a deploy fail again</em> in the same way as the previous five times this week. Friction without payoff makes people unhappy. To my mind, this is reason enough for fixing it. Content people who are comfortable and talking regularly with their colleagues do great work; unhappy teams spend their time fretting about their unhappiness. The world is stressful. Don’t add to it. This is ethically good as well as pragmatic for whatever your shared venture is.</p> Let’s make a more banal, money-based argument next.</p> Salary is, for most companies, the single biggest cost they have. Stop wasting that money! Why are you spending money making your programmers do things by hand that could be done by a small shell script? This is overall a complex topic, and a lot of things factor into your decision to build, buy, or do nothing. Here, we’re most likely talking about build OR buy vs doing nothing at all. A fast calculation of salary hours vs payoff is useful for deciding when act as well as when not</em> to act. Make a rough estimate of how much time your team is spending wasting on waiting for builds (fixing something, pushing a repeated process by hand, etc.) for the entire year</em>, then compare that to what you’d invest into a single push into making that faster.</p> Once again, measurements help to inform your decisions. If you don’t have data, do something lightweight to get it.</p> Things to try</h2> You are convinced! You have convinced others! You are able to act to reduce your team’s friction! How do you do it?</p> Start by asking your team what is slowing them down. They will straight-up tell you what’s wrong. Listen to reports of irritation; if the irritation rises to the level of frustration pay special attention. You might not take your team’s proposed solutions</em> at face value. Here your team is like any software user, who will tell you all about the solution they’ve imagined, not the best solution you might provide. Listen to what people are trying to do and why they’re being prevented. Pay attention to the reality of their stories. Question everybody’s assumptions about the way things have to be, including your own.</p> Imagine what you would do in the ideal case, if you were designing the thing from scratch today. Take a step toward that ideal from where you are now. This is</em> possible.</p> If you’re using bad software, stop.</h3> Is your system configuration software driving you nuts? Switch to something else. (It will drive you nuts too, but perhaps less nuts.)</p> Is X</em> famous SAAS thing that was super-cheap to buy driving your team nuts? (I’m looking at you, ubiquitous but relentlessly mediocre famous suite of tools.) Switch to something else.</p> Has your team staged a revolt and started using something that isn’t the official choice? Listen to the pain of your team. Honor the pain. Switch to their choice. This isn’t about allowing chaos to reign, but about paying attention to existing signals, and paying especial</em> attention to strong signals.</p> Make team software changes definitively and without half-measures. Commit to the change. Retire the old stuff. Plan a cutover if necessary so you don’t leave mess behind: do any required data migrations. Get feedback on the results. You shouldn’t make changes like this on a whim unless the cost of change is pretty low, but doing it on the worst offenders can be a huge morale boost.</p> Treat internal tools as important software.</h3> Work on internal tools is highly-leveraged: every one of your developers will write better software when their tools are good. It is worth</em> devoting senior engineering brains to them. It is worth devoting your</em> brain to them if there is nobody else. Your job, o fellow technical leader, is to make your team successful at building the widgets your organization wants to build. We must do the things nobody else can do.</p> If using an off-the-shelf tool isn’t possible, then the tool you’re building is critical to your product. Treat it like that. Take the work seriously. Design it thoughtfully. Do your usual requirements analysis! Who’s using this tool? What are they trying to do? What are the performance and latency requirements? How should errors be handled or reported?</p> Sweat the output of internal tools. Don’t bury important results of CI in a rubbish heap of uninteresting compiler output. Tufte’s design principles</a> apply here too.1</a></sup></p> Doing this analysis on testing system output was super-fulfilling and helpful for the consumers of the test output.</p> Common tool areas for you to think about:</p> Chat and video conferencing software: is it reliable and high-quality?</li> Bug/issue/task trackers: help or administrative burden?</li> Source control software and tooling around it.</li> Development environments: setup of any common software that your team needs to use. Examples would be specific versions of a language runtime or compiler needed to develop software.</li> Internal tools that solve problems specific to your internal workflows.</li> Build systems, both for the develop/test loop and for release processes.</li> Deploying software. Is it fast? Is it reliable?</li> The substrate upon which software gets deployed.</li> Automated testing, particularly integration testing.</li> </ul> Distribute internal tools in compiled, packaged form. Don’t make people build/install them every time they need to use them. Have enough release process for these tools to ensure they work. Consult user</em> convenience, not developer convenience here. (The needs of the many, etc etc.)</p> Treat your processes as worthy of thoughtful design.</h3> I mentioned earlier that you always have process, because process is the way you usually do things. Think about your processes</em> and tweak them as needed to remove unnecessary friction from them.</p> Water runs downhill. People always do the thing that’s easiest to do. Your goal is therefore to make the right thing to do the easiest thing to do. If people are regularly doing any end-run around a process to get work done (say, regularly asking for rubber-stamp PRs so they can be unblocked), you have a process that’s not earning back its energy cost. Fix it.</p> What are the goals you want a habitual-way-of-doing-things in an area to achieve? What values do you want to express? Be clear about them. Be clear about the priorities of your values. You might need to honor high priorities and let lower priorities go unfulfilled.</p> Make sure you have a feedback loop</em> somewhere helping you evaluate your new processes. Designing processes without feedback from the lived reality is possibly worse than not designing them, because you’ll have people held accountable for doing things that turn out to be bad ideas. Iterate. Improve. Nothing need be set in stone. It’s okay to change! It’s okay to look at where people are walking right now and pave those paths. It’s a decent starting point.</p> Jump out of the system and examine its assumptions. One way of reframing the “I’m blocked by no PR reviewer here” problem is to notice that the person who’s blocked did the work alone and has no team or buddy who shares context about the work. If they paired, they would have an instant PR review, and a pretty high quality one.2</a></sup> If the work was planned work and review was blocked, perhaps time for reviews should be budgeted into your team’s plans.</p> The best process is one that your team doesn’t even think of as a process because it’s been automated into invisibility.</p> Automate.</h3> Obliterate toil: automate it.</p> Automate ruthlessly. This is where I have seen the most surprising</em> pushback. We’re programmers. Automating processes is what we do! People will flinch about this, afraid of time spent automating things that won’t pay off. Yes, we’ve all been there. So don’t do that.</em> Don’t automate things that are really one-offs. If there’s any chance you have to do the same thing more than five times3</a></sup>, automate it. If it’s complex and difficult for a human to do, automate it. If the blast radius of the explosion caused by a human doing it wrong is large, automate it. If the end results need to be the same every time, automate it.</p> Infrastructure should be automated as far as you can push it.</p> The upside of automation is that the software that does the work for you can be instrumented.</p> Measure and observe.</h3> This is a corollary of deciding to treat your tools as important software, but it’s worth calling out.</p> Measure everything, and make the results of the measurement visible.</em> Measure how long a process takes. Measure how long PRs sit unreviewed. How long each step of a deploy takes and how many deploys fail. Make all of this data easy to look at.</p> Instrument your tools so you know how often people are using them, how long the runs takes, and whether they succeed or fail. (Don’t instrument so heavy-handedly that you slow them down.)</p> My favorite way to do this is to use Honeycomb</a> to trace everything, not just our production software. At a recent job we instrumented builds, deploys, and CI runs this way. The output of those runs prominently included links to Honeycomb’s visualizations of the traces. Every build and deploy report included a link to a view like this about how long it took:</p> </p> Is this deep? No. Did it take a long time to do? Also no. Is it helpful? Definitely yes</em>. Imagine this, for everything. Imagine this, telling you about timings for every single internal tool you run, including the exit code returned and who ran it. Imagine how much better you can make every single tool your team uses with data like this.</p> You might have another tool you like to use here, which is great! Please tell me about it on Twitter!</p> The deer, they are teal</h2> Here’s what I’d like you to take away from this blog post.</p> Friction is slowing down your team.</li> The energy cost of overcoming friction needs to buy you something worthwhile, or it needs to be reduced.</li> Investigate friction by talking to your team. Frustration is an important signal.</li> Observability isn’t just for your production software: measure everything. Use data to inform your decisions.</li> Order of magnitude changes in cost result in entirely new behaviors.</li> Design your processes.</li> Design your tools.</li> Automate ruthlessly.</li> Set up feedback loops so you learn what’s working and what’s not.</li> </ul> Most importantly, you can</em> fix it. Every little bit you fix gives you more energy back so you can fix the next thing. It will</em> be worth the investment.</p> My thanks to Chris Dickinson</a> for the lockout-tagout and pointing-and-calling examples! Also my thanks to David Zink for editing my prose into a tighter form.</p> Tufte’s design principles, recapped because they are so good:</p> Above all else show the data.</li> Maximize the data-ink ratio.</li> Erase non-data-ink.</li> Erase redundant data-ink.</li> Revise and edit.</li> </ol> He’s talking about visual design, but this works for writing as well. ↩</a></p> </li> To repeat myself: PRs are best used to socialize work that’s already in a good state, not to find bugs in work somebody has already decided is finished. In other words, the useful review and tightening should happen before</em> the PR process, in some earlier phase. Pairing is good. Strong testing is good. Team discussion about ways of solving a problem are good, so the approach taken in a PR doesn’t need to be debated. The PR is to say to a wider audience: hey, this thing happened. An exception to my own approach: small, uncontroversial bug fixes are perfect for review in PRs. ↩</a></p> </li> I kinda want to say “three times” here instead of five, but you know, use your judgement. Do a little basic arithmetic on how long a thing takes and how often it’ll need to happen. Think how important getting it done consistently is. Prioritize to match. ↩</a></p> </li> </ol> </section> Against dogmatism Unknown — Sun, 29 May 2022 13:34:29 -0700 Sometimes I think that my next conference talk ought to be nothing more than a live read-through of Tef’s blog post, “Repeat yourself, do more than one thing, and rewrite everything”</a>. This is a bad idea because Tef should do that, in some post-pandemic future when international travel is safe again and I can attend and buy him a drink. So I’m going to let Tef’s blog post push me off into my own direction instead, and attempt to add something useful to his wisdom bombs.</p> Tef’s main point—worked through via examples of common advice given to programmers that is sometimes bad advice—is that all advice has a context</em>.</p> When you hear a piece of advice, you need to understand the structure and environment in place that made it true, because they can just as often make it false. Things like “Don’t Repeat Yourself” are about making a tradeoff, usually one that’s good in the small or for beginners to copy at first, but hazardous to invoke without question on larger systems. – Tef</p> </blockquote> “Don’t repeat yourself” is the advice I rail against in a recent post in this series</a> because I saw how application of the advice damaged a particular code base. Any two code paths that looked at all similar were collapsed into single methods with long parameter lists, with flags and null checks to determine mid-flow which one of the five different entry points was in use this time. This made the code difficult to understand, debug, and change, because any change had to be verified as appropriate to make for many different entry points. Every bug we worked on required careful documentation of the many ways a specific code path could be invoked and careful mental simulation of execution for each.</p> Was the maintenance cost worth whatever was saved by not duplicating some smaller sections of code? No. But probably it didn’t start out that way: it started out with somebody adding an entry point and not</em> copying code, because, well, don’t repeat yourself. And then do that a few more times, each time adding a parameter while scrupulously not repeating code, until the programmers who understood each path through were all gone.</p> I grind an axe here, of course. My point is that following the DRY advice dogmatically was a bad idea</em>. It’s the dogmatism that gets you.</p> Dogmatism says: Don’t repeat yourself means don’t repeat any code, ever.</p> Dogmatism says: This particular one project management methodology is the one true methodology! Every team at this company will do agile/scrums and always-pair-program/never-pair-program while fibonacci-pointing/playing-planning-poker.</p> Dogmatism says: Object orientation is the only way people should structure code and therefore this programming language only has classes.</p> Dogmatism says: All software must follow one of the named design patterns in the Gang of Four book/some other book and if you can’t name the pattern you’re doing something wrong.</p> Put that way it sounds silly, right? So we do we keep doing it?</p> Because we don’t like the reality that we must always do the work</em> to find the right solution to the specific problem in front of us. It’s much easier to fall back on a set of rules that we don’t have to think about or make hard decisions about. But this compromises our solutions.</p> There’s a blog post in me about how making tradeoffs well requires understanding clearly the values you’re using to select among possibilities. Every value is a razor you can use to make decisions. Dogmatism is a value! It makes decisions for you.</p> I suspect dogmatism is a value we often hold without self-reflection. That is, we can hold it as a value without being aware that it’s a value and that it is influencing our decision-making. I think it makes bad decisions. Dogmatism doesn’t let you weigh tradeoffs. And friend, it’s tradeoffs all the way down.</p> One year for a one-line fix Unknown — Tue, 24 May 2022 14:59:59 +0000 This blog post is harder to write than you might think, because it goes right into a number of people problems and the behavioral patterns of toxic organizations. I wish to preface all of this by noting that toxic organizations warp the behavior of everyone in them. People who might behave in healthy ways in healthy orgs find themselves behaving badly inside toxic systems. The only thing to do is fix the organization first. So I have sympathy for everybody involved in this story, both my unknown predecessors and the people who were right next to the problem the whole time. I am most interested in what I will do differently</em> next time.</p> With that preface in mind, let’s tell a story.</p> Let me tell you a war story.</h2> Once upon a time there was a dot-net monolith, one that had been poorly maintained for a long time, hacked upon by rushed people who were evaluated only by how fast they pumped out the next feature the CEO wanted. This dot-net monolith was in a poor state and everybody around it knew that. It was expensive to run (its AWS costs were enormous), expensive to work on (making changes was time-consuming and dangerous), expensive to deploy (deploys often broke and took hours to resolve), difficult to test (the test suite was a mechanical turk service that ran overnight), and difficult to understand (re-entrant side-effect-heavy functions and in-memory caches on top of external Redis caches made for some fun race condition factories). The team around it knew it had trouble.</p> Enter me, somebody who didn’t know a dot-net from a dot-product. I was brought in to scale out the system, which had a lot of new code written in JavaScript around that dot-net thing. I was fairly confident in my ability to make node jump through hoops. I knew that C# was Microsoft’s proprietary version of Oracle’s proprietary Java, so at least I could read the code. Mostly. At my request, I started out fixing bugs on the team that touched the most varied parts of the system, so I could get my hands dirty first, learn how things fit together in reality, and earn credibility with the overall team before I had to start making changes.</p> Two weeks into my new job, the entire system fell over on a regular weeknight, under regular load. And by “falling over”, I mean it became non-functional. All API endpoints began to fail to respond. The site was down. Nobody could purchase widgets and have them delivered.</p> Why? Nobody could say. It looked like it was Redis. At least, the CPU on the Redis cache instance was hitting 100% and when it did, everything stopped.</p> Now, like many of us, I was very familiar with Redis. I trusted Redis. It is often the most reliable piece of software in my stack. I’d pumped a lot of traffic through Redis at the world’s JavaScript registry, a lot more than this single-state retail outfit could possibly be sending through it. What was this system doing to Redis that was making it thrash so badly? Nobody knew. What were we putting into Redis? Nobody knew. How many objects were in it? Nobody knew. How big were they? Nobody knew.</p> “Where are your metrics?” I asked. There was an expensive hosted Graphite service, but nobody was looking at it. There was an expensive APM product wired up to the monolith, but nobody knew how to interpret it. There were Cloudwatch graphs! Only infra had access to these or the ability to make dashboards.</p> At this point I knew what my job needed to be first. I went on an observability tear over the next months, among other tears inspired by this outage.</p> We got through that initial outage by upgrading to AWS’s largest Elasticache, which was ruinously expensive but seemed to hold up under the load. We then mitigated the problem around the edges by taming some problems with the website hitting endpoints more than it needed to, and at the core (most meaningfully) by splitting up the cache into several different cache instances. (Two very thoughtful engineers had already browbeaten their way into being given time to refactor the code enough to make this split happen, because they knew this was a problem area before the outage happened. The implications of this sentence are entirely intentional, and we’ll come back to them.)</p> We limped through the weeks remaining until the day that was the big sales day for the industry, the one that was going to be the biggest day ever with $X of revenue, for some record-breaking value of X. The entire company prepped for months for this event, with marketing and incentives and ordering stock to be sold.</p> Three hours after opening, the system went down. Adding more instances of the monolith brought the system down harder. In the end, we had 3 hours of downtime in the middle of the hottest business day of the year, the equivalent of Black Friday, and this downtime ruined the work of everybody at the company who’d prepared for that day. It was bad. Very bad. Company-harming bad.</p> One year later, my colleague Chris and I identified the problem and fixed it with a one-liner.</p> That’s a heck of a war story.</h2> Right? The one-line mistake that nearly killed a company, and the one-line fix that saved it. Except, well, it’s more complicated than that.</p> In the end, none of the observability instrumentation I added mattered.1</a></sup> It was the default Cloudwatch Redis graphs that identified the problem. When we stood up an idle cluster of the service using our new deploy system (deploy times down from 30 minutes minimum to less than 3 minutes tops)– ahem–</p> Hold on, you rewrote the deploy system?</em></p> Yeah. When the entire infra team was laid off we were finally able to fix probably the worst cause of daily development friction–</p> Hold on, the entire infra team was laid off? And this let you fix things?</em></p> Yes, as I was saying, we finally had access to everything and freedom to fix what had been obviously broken for a long time.</p> Hold on.</em></p> I know. There’s a lot to unpack here, and I’ve been struggling for some time to find a way to unpack it that remains kind to the people who were trapped in this toxic organization alongside me, doing bad things because that’s what the organization wanted them to do. For some of that</em> story, read “Dysfunction junction”</a> first.</p> I’ll talk about those things in a minute.</p> The punchline to the story.</h2> Back to the bug. Redis. AWS’s largest Redis. CPU hitting 100%. That bug.</p> As I was saying, testing our new deployment system was what made us look at this problem again with fresh eyes. We used the new deployment system to stand up an idle production cluster</em>, ready to be swapped in for the older prod cluster that used the old deploy system. This was something we’d done for every microservice in the system, so we had a lot of practice doing it, and at last we were doing the hard one, the dot-net one.</p> The moment we brought the new, idle cluster into existence with terraform, we noticed the Redis instance CPU spike and cause trouble to the production system. That was surprising! The new prod cluster wasn’t live yet! We looked at the full set of Redis graphs and noticed an oddity. The new connections per minute was absurdly</em> high normally, and the idle cluster had just spiked it higher.</p> First, no way should any process be generating new Redis connections except on restart. That graph should be sitting at zero. Second, idle clusters shouldn’t be creating any new load on Redis except at process start.</p> “Huh,” we said. “Could pings from the load balancer be creating new connections? Because pings are the only traffic it’s taking. That would be fundamentally broken but it would explain this.”</p> So we fired up a video chat, shared a screen with the code that injected Redis into the dot-net frammistans, and set about understanding what it was doing. We learned the word “lifestyle” and read the docs on the various kinds of lifestyles: transient, scoped, and singleton. Nearly all of the Redis connection managers were singleton lifestyle, which is for the lifespan of the application. Seems good! Then we noticed one line that didn’t look like the others, injecting connections for the general-use cache:</p> container</span>.</span>Register</span><</span>ICacheConnectionManager</span>,</span> RedisConnectionService</span>></span>(</span>Lifestyle</span>.</span>Scoped</span>)</span>;</span></span></code></pre> “Scoped” means to create an instance of the thingie once per request lifecycle</em>. Once per request.</p> Every. Single. Endpoint. Invocation. Created. A. New. Redis. Connection. Pool.</p> All requests, not just requests that needed to use a Redis. Requests like the health check endpoint, the one that should be near-zero cost because load balancers hit it frequently, requests like that. This is why the Redis cpu graphs looked like some kind of exponential function on the “active thingies in the system” count, because it literally was. The system had been DOSsing itself into downtime for years.</p> We changed that one line to give it a singleton lifestyle and deployed the change to our (new, shiny, one of many cattle) integration environment. We observed that the new connections graph began behaving as we expected, and everything kept working. So we deployed it to production.</p> Really easy fix. It let us stop running the largest Elasticache AWS sells, collapse all the split-out caches into the new much more modest not-clustered cache, and made everything go faster. Scaling horizontally no longer caused the system to punch itself in the face. That plus the new fully-terraformed ALBs made dealing with big days completely routine, and engineering commenced a very quiet two years of rebuilding with an AWS bill that was a fraction of what it was before, and I mean holy heck we cut that bill down to something that was [Ceej’s editor has deleted a lot of ranting here].</p> However. I remain unhappy about this fix.</p> I should have spotted this a year before, during the original Redis-caused outages. If I had seen that graph– and I should have demanded to look at all those graphs– I would have known immediately that something was very wrong, because nothing should be creating new connections like that. But I didn’t. Why not?</p> Why did it take a year?</h2> Expertise, ownership, and trust. Each of these concepts is a two-edged sword and each cut me with its second edge.</p> Expertise.</h3> Expertise. I knew I did not have C# or dot-net expertise. I had to rely on the people who had it. I was also not familiar with the code</em> that did this work, especially at two weeks in. I had to trust the people who knew dot-net and knew the code to assure me that there were no obvious howling bugs in it.</p> Where expertise is assumed but is not present, bad code goes unchecked. People get angry when you review their work and ask for changes, or even when you only ask questions about the work. Defensiveness can arise in low-trust environments, but it can also mask situations where people don’t have the expertise you need them to. Or situations where people have the expertise but are so pressured, stressed, and burned out that they’re not operating at full capacity.</p> Here the second edge cut me because I assumed without pushing that the people with dot-net expertise had already investigated the obvious possibilities. But also! I lacked this expertise myself. When we finally hired people who were expert with dot-net and comfortable with it, they laughed at this bug, because it was familiar territory for them. They’d have looked for and found it immediately.</p> Ownership.</h3> When some one human or a team owns something, I feel I need to let them own it and trust their expertise. Meddling in their work can destroy their self-confidence or make them feel undermined. A feeling of ownership is good! It means you feel responsibility for that thing, and know that the burden of maintaining it rests on you.</p> The other edge of ownership is gatekeeping. The deploy system was obviously a block to all development by all teams. The team had a Slack channel where they negotiated who was going to merge which code for the single deploy window</em> available on four days a week, with no deploys allowed on Fridays. Deploys were flaky and could take up to three hours to resolve. A colleague with a technical leadership role was in fact working on a better deploy system, but the infra team manager instructed their team to ignore the work.2</a></sup></p> The infra team also jealously guarded access to things they thought belonged to them, such as access to an Athena search setup for production logs. At one point one of them locked down commits to the main monolith’s repo, announcing to the team that they no longer got to merge into “my repo”. To be clear, this was a human being who’d been burned to an absolute crisp by overwork; the blame flows upward.</p> Management can of course be the worst gatekeeper of all, and it was in this case. I mentioned briefly before that Redis had been identified as a problem area by some informed engineers, and they had to push hard to be allowed the time to work on it. They might have been more successful if supported</em> by management instead of being treated as if they were wasting time that would be better spent on cranking out this month’s pet feature for the CEO.</p> From a distance, I can say with some confidence that gatekeeping was the worst block to diagnosing this Redis bug.</p> Trust.</h3> Trust. I said the word “trust” in each of the two preceding sections, because I had to extend trust to my colleagues. You earn trust by granting trust. People live up or down to your expectations of them, and I prefer to expect the best.</p> You can see the downside of all of these. Where expertise is assumed but not present, bad things happen. Where ownership turns into gatekeeping, other people are blocked from fixing things even if they could help. When trust is not warranted, things get into bad states and stay that way.</p> “Trust, but verify.” – unknown origin, but possibly Khrushchev</p> </blockquote> Then the layoffs happened.</h2> The gatekeepers were all gone. The experts were also (mostly) gone. The ownership and responsibility were all on me and a much smaller but motivated team.</p> When the ownership fell to me, I felt both responsibility and empowerment. I was no longer politely taking people at their word, because those people weren’t there any more. I was investigating and experimenting on my own, and ruthlessly testing all of my own hypotheses. I knew I didn’t have expertise, and even when I do</em> have expertise I have learned the hard way to double-check all my own work.</p> I was also not bound by the past. I did not care if something had always been that way. I was okay with doing things differently. I don’t much trust myself, but I did trust the people working alongside me in that moment. And most especially, I trusted the work we did together, because we verified it together.</p> The sad thing is that the ownership turned into gatekeeping problem was the difficult one to surmount, the one that in retrospect I’m not sure I could have solved in any other manner than parting ways with the gatekeeping team. I am going to tentatively state a thesis: operations/infra teams as teams separate from engineering always turn into walled-off defensive gatekeepers. You cannot allow them to exist in healthy orgs. You must practice some variation on devops by embedding people with this expertise into project teams.</p> Maybe there’s a way to do it if you frame their goal as developer experience</em> not as “operations” or making AWS go brrrrrrr. The goal has to be to keep people focused on their customers– the engineers building the project– and not on defense against their colleagues. The same goes for security teams: embed those experts where they can have sympathy for the problems their colleagues are trying to solve and improve their solutions early.</p> But I digress. Expertise, ownership, and trust are a big part of this story, but they’re not everything.</p> The context also mattered.</h2> Years later I learned that one of the two engineers who’d started working on Redis before the outages had some suspicions that there was a lifestyle problem, but he was afraid to change code that had been that way the entire time he’d worked there. I had no such fears, because we had removed reasons to fear experimentation</em> by completely rewriting the infrastructure and deployment environment to make experimentation low-cost. We’d also invested in full tracing via Honeycomb</a>. We knew what was going on with the system in ways that we didn’t in that first outage.</p> The highly-contended integration environment had become many environments</a>. (The link goes into detail about that project and</em> tells the story of this bug from another perspective.) Deploys had been made fast and reliable. Access to information and metrics was available to everybody. Full access to AWS was available to all engineers. If a change broke production, the fix was three minutes away.</p> We weren’t scared to make changes any more.</p> I also want to call out that the team had progressed past trusting the word of people in the past about how things worked and whether or not things were feasible. We had the space and the support to read code to see if it genuinely behaved as described or if it worked differently, and experiment with changing things. Management was no longer blocking people from investigating or fixing technical debt.</p> What can we learn from this story?</h2> This is what I’d like you and my future self to take away from this war story:</p> Assume nothing. The people around you might be wrong! You might be wrong too!</li> Test all hypotheses. Each test gives you more information.</li> Eliminate gatekeeping. No team can afford to cope with the damage done by people who want to keep information or access away from their colleagues.</li> Observability, even humble standard metrics, is invaluable.</li> You (o fellow technical leader) own everything. You must always feel the responsibility of that ownership. You can share it, but it’s always partly yours.</li> Trust but verify. Especially team superstitions.</li> Ruthlessly eliminating developer friction pays unexpected dividends.</li> </ul> Also, it was totally not Redis’s fault.</p> For this specific problem. It was and remains invaluable for other reasons. ↩</a></p> </li> Most toxic behavior is driven by toxic organizations, but some toxic behavior is individual and creates</em> that toxic organization. ↩</a></p> </li> </ol> </section> Legacy you hate Unknown — Tue, 24 May 2022 13:17:00 +0000 What should you do with a pile of legacy code you hate?</p> This was the central challenge of my last job. I was partially successful at solving it, and unsuccessful in ways that I want to share with you so you can do better than I did.</p> Let’s start by clarifying the problem.</p> “Hate” is a spongy word and we can be more descriptive about why you dislike the code base you’re presented with. Maybe it’s bad code: tangled, hard to maintain, failure-prone. Maybe it was written long ago by people who’ve long since left the company (burned out by having to maintain it) and nobody left understands it. Maybe only a few people are able to change how it behaves, and it takes those people far longer than anybody likes. Maybe it doesn’t scale in the ways you need it to. Maybe it’s also</em> written in a language you don’t like or don’t know, or maybe it’s written on top of a framework you don’t like or don’t know.1</a></sup></p> Whatever the reason, you’re very done with this pile of code and so is everybody around you. It needs to be replaced and you all know it. And yet</em>, it’s the money engine for your company.</p> What to do?</p> Don’t rewrite immediately.</h2> The temptation will be to rewrite the whole thing. You already know that you shouldn’t.</p> We all know that second system syndrome</a> is a thing, and we all know that big-bang rewrites are notoriously difficult to pull off. As Gall famously said:</p> “A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a working simple system.” – John Gall</p> </blockquote> The other reality is that companies rarely have the time and resources to devote to rewrites, even if they have run themselves deep into tech debt. They never want to pay down that debt, and they never like the idea of giving up new feature work for a rewrite project that doesn’t move them forward.</p> And yet, this code needs to end up being rewritten somehow</em>, because it’s a disaster that is costing the organization dearly, and perhaps even driving it to the brink of failure.2</a></sup> This is true at the same time that you can’t dive into a big bang rewrite.</p> Reframe the problem and shift the goal: you want to be able to rewrite in useful pieces</em>. Small pieces allow you to make incremental progress that can be seen to be “delivering business value” or at least measurable progress toward the end goal. Small pieces are also</em> small systems on their own, each of which is simple enough to be kept working.</p> Now you’ve shifted your task to identifying useful pieces and rewriting those, and that task is more achievable. How do you identify useful pieces to rewrite? Well, this is both bad news (because you hate the code) and good news (because understanding complex systems is fun): you need to spend a lot of time with the system you have. You need to invest in it.</p> You need to understand it, even if you hate it.</h2> It’s your money engine. It has to keep working.</p> You cannot hope to replace what you do not understand.</p> Understanding it deeply will allow you to find the cracks you can hammer a wedge into.3</a></sup></p> Understanding it deeply will allow you to know when you’ve finished replacing it.</p> So how do you understand it?</p> This isn’t the same as the theory of the program</a>, which is about how the code is constructed. You do need to know what problems the code solves for the system around it, what “affair of the world” it exists to model. More important for this task is understanding the details of its current behavior</em> as part of a larger working system. This entire system is not just the code</em> in the thing you want to replace. It is all the systems around that code as well: the web site, the analytics pipeline downstream from it, the internal admin workflows, the profusion of microservices that we all persist in writing around everything.</p> This whole system is an evolved, complex working system. It probably doesn’t have detailed specifications. It probably also does not have comprehensive tests. (If it did, you might not be in this mess.)</p> Write tests.</h2> If the system does not have comprehensive automated tests, invest in writing those tests before doing anything else. It’s hard to talk people into writing specs for features that have existed unspecified for years, but everybody understands why tests are useful. (If somebody doesn’t, then there are many excellent books you can drop on their head to enlighten them.)</p> Not kicking off a testing project the moment I was in charge of this problem is my number one regret from my last job. We eventually did it and it was so valuable I was angry with myself. There were organizational reason why it was difficult for the team to commence that work earlier, including work that was genuinely urgent, but we could have started writing tests sooner! My advice to you would be to prioritize testing higher than I did, and defer what work you can until afterward.</p> Start by investing time in the test framework and tooling. Your goal is to make it easy for everyone on the team to write tests and to understand their results. People do what is easiest to do, so you must make the right thing easy. The importance of this work deserves a blog post all its own.4</a></sup> However, anything is better than nothing.</p> Don’t negotiate on these tasks:</p> Write integration tests. Test how the Hated Code™ calls out to everything around it. Test that the expectations of the code around the Hated Code™ are being met.</li> Don’t accidentally pour glue over implementation details that should be hidden. If unit tests don’t exist at all, you might want some, but they’re not as important as integration tests that validate overall system behavior.</li> Involve the whole organization in the test-writing effort. Prioritize this work alongside feature work and make on-going test-writing part of regular maintenance.</li> Automate running the tests. Do not rely on humans doing anything by hand. Run them continuously against an integration environment, or in whatever context is sensible for your setup. The important thing is to have the tests run against every change intended to land in the production environment.</li> </ul> You’ll find bugs in the overall system while doing this. It’s a judgement call whether you should invest time in fixing them. Some bugs might be difficult to fix because of the problems that lead you to want to replace the mess; don’t waste your time. Some bugs are load-bearing because the system will have grown around them, like a tree growing around a bicycle. You can cut the bike out, but at what cost to the tree? Fixing bugs that are easy to fix gives everybody dopamine cookies and shows people around the project that the investment in testing has started to pay off, so let yourself do some of that.</p> If you and your team didn’t understand your system going into the testing effort, you will afterward. The tests will support any refactoring or replacement work by verifying that the entire system continues to work. They are the scaffolding around your new construction project.</p> Identify and exploit wedge points.</h2> Now you can start thinking about changing the system.</p> Your goal here is to split up your monolithic code base by identifying good points to hammer in wedges to use to split off chunks.</p> Where are you going to hammer in your wedge first? Have you identified a modular boundary you can exploit to split off a chunk of functionality for a rewrite? Look for clean lines of separation: data, access methods, business logic all must come out in one piece. The common approach is to put a proxy in front of the monolith-ish thing you want to start replacing and redirect traffic from it to your rewrite. One popular term for this is “the strangler fig pattern”</a>. I often call it “divide and conquer”.</p> The advantage of this approach is that it keep the pressure on the system to remain working at all times, allowing you to pay full respects to Gall. The tests are your latch on this working state: they validate that your replacement is behaving properly in context. You might find yourself writing even more tests at this point to support the validation; this is fine!</p> The disadvantage of this approach is that you need to have good split points, and you probably don’t.</em> Good division points indicate where good modularity already exists and if you had that you’d probably be less unhappy with the mess.</p> Create split points if they don’t exist.</h2> This is important: don’t rewrite anything yet.</p> Don’t proceed until you can find a good location to drive that wedge in and split off a chunk. Don’t take half measures. Some of the worst tech debt I encountered recently was in functionality that was half implemented inside the Hated Code™ monolith and half outside. The implementation details were spewed out everywhere. Changing functionality was extra difficult because it needed to be changed in two places, and one of the places was a code base that was very hard to work within. Also, once we’d fixed the primary performance bottlenecks, the secondary ones were all in how the Hated Code™ treated these satellite services as databases that it owned. Important working data vital to the operation of the system was a mashup of data from other microservices plus the monolith.</p> Don’t do this to yourself.</p> No, really, modularity is important. Parnas’s 1972 paper, “On the Criteria To Be Used in Decomposing Systems into Modules”</a> points right at the important thing, which is that hiding information and implementation details allows you to change both. Modularity allows change.</p> Premature modularity is a form of premature optimization, and it hurts, but I’ve more often seen no modularity at all. Gotta go fast and break things, right? Side effects everywhere, code that has been DRYed to disastrous levels, the details of specific data structures in one place used to make decisions somewhere else, extreme cleverness that relies on implementation details in distant locations in the system. Rushed people make short-term decisions, and their hacks pile up into tangles of code.</p> Whatever the cause, you might have to start by refactoring internally to bring order and modularity to a ball of mud.</a>5</a></sup> Start hiding details behind interfaces.</p> An aside: “Don’t Repeat Yourself” aka DRY has been misunderstood and misapplied to disaster so often I would like to stop saying it to newer programmers. Often much better advice is to repeat yourself to find patterns</a>.</p> If you find yourself with a function or method that has an enormous parameter list to distinguish the six different ways it might be called, you have a case of DRY madness that has broken modularity. One technique that might help if you’re in this situation is to do the least DRY thing possible: refactor to expand each code flow into one large function for each, replacing each call out to an overused long-parameter list function with the same code, inline. Simplify as you write. Strive for branchless programming as an antidote!6</a></sup> The real patterns that support a better split-up of responsibility will emerge as you do this work.</p> Once again, your tests are going to have your back as you go. You’ll know if that flow stays working or not. You might find that your Hated Code™ is less hate-worthy after you’ve cleaned it up. Maybe you’re more in sympathy with it now? Or maybe not.</p> Time to drive those wedges in with a sledgehammer.</h2> Now you can strangler-fig/divide-and-conquer/split those rocks as you go. You’ll probably get the modularity boundaries closer to right than your predecessors, because you have a lot more information than they did: you have a far more developed system to study!</p> If you’re tight on resources, you might choose to do nothing</em> about any specific modular chunk of code. Leave it where it is, and make incremental improvements opportunistically. If this segment is not performing well, or is doing the wrong thing, or is hard to maintain, or if the team is far more comfortable with working in some other language ecosystem, then replace it. Prioritize potential rewrites by how much you hate the current implementation; that is, how many ways they’re failing to do what good code does.</p> Here’s where I remind you that modularity in your system does not require splitting its components into separate microservices. Microservice APIs are strong module boundaries; these API boundaries resist change unless you plan carefully. On the other hand, these boundaries do</em> resist attempts at clever end-runs around that modularity.7</a></sup> I like to bundle together data that is roughly similar size and changes at similar rates or in a similar style. CRUD data that is infrequently destructively updated and all lives in the same kind of database might all belong together. Geographical data that all uses PostGIS belongs with other data like that. This is itself a gigantic topic, so I won’t go further other than to remind you that microservices have tradeoffs. The important goal is to leave a system than can be more easily rewritten</em> behind yourself.</p> Plan to rewrite next time.</h2> All code has a lifespan.</p> Your designs make tradeoffs (always) that suit the context you’re working in:</p> What language ecosystem is the current team comfortable using?</li> Do you need to get this project done rapidly, so some shortcuts are okay?</li> What performance characteristics are acceptable today?</li> What task does this component have to perform today?</li> </ul> The context around</em> working code changes over time. The business context the code exists in is guaranteed to change. Product requirements change. The tools your team is happy with today might make the team unhappy three years from now. Other parts of the system will change around it.</p> Make it easier for your future self or your successors to rewrite any given component of a system. If you know the lifespan of a decision, or when a scaling shift will make a component a good candidate for a rewrite, record that information right next to the code.</p> The tl;dr.</h2> It’s okay to hate that code base. It is hate-able. It’s okay to want to replace it. You can replace it! But you have to put in the work first. The work I’ve had to do in this situation looks like this:</p> Understand it even if you dislike it. ⬅️ treat it like a puzzle</em></li> Write tests. For the system. Mostly integration. ⬅️ helps everything</em></li> Identify or create wedge points. ⬅️ most of the time will go here</em></li> Split off chunks and rewrite. ⬅️ the fun part</em></li> Shrink the mess until it’s tolerable. ⬅️ satisfying!</em></li> Plan so rewriting the new chunks is easier next time. ⬅️ pay it forward</em></li> </ul> Anyway, this is what I’ve learned from trying to do this work with limited resources. It’s best not to be in this situation: instead devote time to maintaining the system as a system and every bit of code in it. But most of us don’t have time machines to prevent past technical leaders from making these mistakes.</p> Being written in a language ecosystem you don’t like is not enough of a reason to rewrite something all by itself. If you’ve landed into a team that doesn’t know the language ecosystem that company’s money engine is written in, your first task is to correct the hiring mistake of the past. You might have to become an expert into the thing you don’t know; you might (like me) discover that you dislike the thing you had to become an expert in. Probably the real takeaway is to do better due diligence than I did, and discover in advance what flavor of mess you’re expected to clean up. But sometimes, the ecosystem mismatch is the last misery on top of a pile of miseries. ↩</a></p> </li> This was literally true in my case. ↩</a></p> </li> If you have never seen rocks split by hand with the wedge and feather technique, check out this video showing somebody breaking up a big boulder</a>. ↩</a></p> </li> I am nudging Chris Dickinson into blogging about how he approached this project at our mutual former employer, but until he does here’s a link to a tweet about his approach to the work</a>. ↩</a></p> </li> This is the part of the rock-splitting video where you haul out the drill and bore a hole to stick a wedge into. The metaphor is now out of control because you can’t drill holes in big balls of mud, but, uh, let’s pretend the mud has been pressurized into rock over many thousands of years? ↩</a></p> </li> DRY is misunderstood, IMO. The original principle is “Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.” This is a good principle! It does not mean that you need to collapse any two bits of code that look mostly the same. As with everything, advice has contexts. Everything in moderation. Tef is right</a>. ↩</a></p> </li> Though I have seen people manage to do that. E.g., replicating an entire db to get at a subset of its data rather than using the API that was put in front of the db specifically to hide the implementation details of the db schema. Sigh. But even this is a case of people doing what feels easiest: if the replication tools are right there and calling an API feels harder, they’ll reach for replication. The right solution is to make doing the right thing the easiest thing for everybody. This is more work for you, which you needed, right? ↩</a></p> </li> </ol> </section> Why Rust's postfix await syntax is good Unknown — Fri, 13 May 2022 16:00:31 +0000 The other day on Twitter Kat Marchán said this:</p> my strongest opinion on programming languages is that postfix .await is the single greatest innovation in the past 70+ years of programming language theory and history and you can’t convince me otherwise. — Kat Marchán has permanently left this site (@zkat__) May 12, 2022</p> </blockquote> And Jan Lehnardt asked:</p> I’ve seen a few folks say this. Do you know of a “here is how this compares to async/await keywords” for someone who barely rusts? — Jan Lehnardt is on Mastodon: @janl@narrativ.es (@janl) May 12, 2022</p> </blockquote> I didn’t know of any, and a little searching didn’t turn one up. So here’s one? I hope? If you are not programming Rust a lot, and want to know why not-language-designers like me think that Rust’s await syntax</a> is good, this is the blog post for you.</p> While looking around for somebody explaining why this is nice syntax, I found one of the discussions</a> about possibilities before it was selected. That’s a pretty long conversation, and I enjoyed skimming it. This comment</a> examining what Rust might look like with a number of the syntax possibilities was particularly neat. It immediately jumped out to me that the one they landed on (postfix field) and the close relation to it (postfix method) felt more Rust-y. But why? You might have to be a Rust user already to feel that.</p> So in order to explain why Rust’s .await</code> is a nice bit of syntax, I will start by explaining two other things: how chaining calls is idiomatic Rust, and how error propagation with another nice bit of syntax, ?</code>, supports this.</p> Hoo hah back on the chain gang</h2> Chaining is a very common idiom for taking one collection and transforming it into another, perhaps even one of a different type. This snippet takes a collection of id-having-things (any collection type, so long as it is iterable), iterates through them, plucks out the ids, and re-collects them into a Vec:</p> let</span> ids</span>:</span> Vec</span><</span>usize</span>></span> =</span> things_with_ids</span>.</span>iter</span>(</span>)</span>.</span>map</span>(</span>\|</span>xs</span>\|</span> xs</span>.</span>id</span>)</span>.</span>collect</span>(</span>)</span>;</span></span></code></pre> Here’s a slightly-edited real-world example, which chains some stuff to end up with a string:</p> let</span> malformed_kinds</span> =</span> requested_kinds</span></span> .</span>iter</span>(</span>)</span></span> .</span>filter</span>(</span>\|</span>xs</span>\|</span> !</span>is_valid</span>(</span>xs</span>)</span>)</span></span> .</span>cloned</span>(</span>)</span></span> .</span>collect</span>::</span><</span>Vec</span><</span>_</span>></span>></span>(</span>)</span></span> .</span>join</span>(</span>"</span>,</span>"</span>)</span>;</span></span></code></pre> And the idiom is seen in other areas of API design. Here’s how my little Skyrim mod tool</a> sends posts (with some editing to make it a useful example):</p> let</span> agent</span> =</span> ureq</span>::</span>AgentBuilder</span>::</span>new</span>(</span>)</span></span> .</span>timeout_read</span>(</span>Duration</span>::</span>from_secs</span>(</span>50</span>)</span>)</span></span> .</span>timeout_write</span>(</span>Duration</span>::</span>from_secs</span>(</span>5</span>)</span>)</span></span> .</span>build</span>(</span>)</span>;</span></span> let</span> maybe_response</span> =</span> agent</span></span> .</span>post</span>(</span>uri</span>)</span></span> .</span>set</span>(</span>"</span>apikey</span>"</span>,</span> &</span>self</span>.</span>apikey</span>)</span></span> .</span>set</span>(</span>"</span>user-agent</span>"</span>,</span> "</span>modcache: github.com/ceejbot/modcache</span>"</span>)</span></span> .</span>send_form</span>(</span>body</span>)</span>;</span></span></code></pre> All of this is to say: chaining like this is common in Rust.</p> Don’t break the chain</h2> Now, there isn’t any error handling visible in the above code. What does error handling look like with chaining? Does it break the chains? It used to! The ?</code> error propagation symbol is new to Rust since I first started using it, and its introduction has made writing error handling a lot nicer.</p> Rust allows you to express that an operation might fail by returning a Result</code></a>. This is a sum type:</p> enum</span> Result</span><</span>T</span>,</span> E</span>></span> {</span></span> Ok</span>(</span>T</span>)</span>,</span></span> Err</span>(</span>E</span>)</span>,</span></span> }</span></span></code></pre> If all went well, you get the Ok</code> variant with your data in it. If it did not, you get the Err</code> variant with your error type. The Rust compiler makes you handle both variations in your code.</p> Here’s a faked example of getting some data from a function that might fail, and doing something with that if we can.</p> //</span> Our fetch talks to a db so it might fail for reasons</span></span> //</span> beyond our control, so we return a result type.</span></span> fn</span> fetch_all_animals</span>(</span>)</span> -></span> Result</span><</span>Vec</span><</span>Animal</span>></span>,</span> SomeErrorType</span>></span> {</span></span> //</span> blocking call to a db here</span></span> }</span></span> </span> //</span> We depend on a fallible function, so we are fallible too.</span></span> fn</span> count_hedgehogs</span>(</span>)</span> -></span> Result</span><</span>usize</span>,</span> SomeErrorType</span>></span> {</span></span> //</span> this is a Result</span></span> let</span> maybe_animals</span> =</span> fetch_all_animals</span>(</span>)</span>;</span></span> //</span>... so we match on it to see if we succeeded or not</span></span> match</span> maybe_animals</span> {</span></span> Ok</span>(</span>animals</span>)</span> =></span> {</span></span> //</span> we got some animals! let's find the hedgies</span></span> let</span> count</span> =</span> animals</span>.</span>filter_for_hedgehogs</span>(</span>)</span>.</span>len</span>(</span>)</span>;</span></span> Ok</span>(</span>count</span>)</span></span> }</span></span> Err</span>(</span>e</span>)</span> {</span></span> //</span> We failed to get animals. We handle the error in whatever</span></span> //</span> way makes sense for the program. Here we just propagate</span></span> //</span> the error on up to the caller.</span></span> Err</span>(</span>e</span>)</span></span> }</span></span> }</span></span> }</span></span></code></pre> This error handling pattern was everywhere in my Rust code, being verbose all over the place. It’s also predictable! This makes it a good candidate for sugar. So the ?</code> syntax</a> for this was added in Rust v1.13 at the end of 2016</a>. If all you want to do is return immediately if you have an error and carry on if you got an OK result, use ?</code>.</p> fn</span> count_hedgehogs</span>(</span>)</span> -></span> Result</span><</span>usize</span>,</span> SomeErrorType</span>></span></span> {</span></span> let</span> animals</span> =</span> fetch_all_animals</span>(</span>)</span>?</span>;</span> //</span> <-- note the ?</span></span> //</span> if the fallible function failed, we have bopped that</span></span> //</span> error on out & can proceed</span></span> let</span> count</span> =</span> animals</span>.</span>filter_for_hedgehogs</span>(</span>)</span>.</span>len</span>(</span>)</span>;</span></span> Ok</span>(</span>count</span>)</span></span> }</span></span></code></pre> You can see that error handling is a lot less verbose when it can fit into this pattern. In fact, the idiomatic Rust way to implement the above function is to chain it all together:</p> let</span> count</span> =</span> fetch_all_animals</span>(</span>)</span>?</span>.</span>filter_for_hedgehogs</span>(</span>)</span>.</span>len</span>(</span>)</span>;</span></span></code></pre> Which is super-compact and might not need its own function at all. This stays super-compact if our hedgehog filter is fallible as well, though I’m not sure why it would be fallible. It would look like this:</p> let</span> count</span> =</span> fetch_all_animals</span>(</span>)</span>?</span>.</span>filter_for_hedgehogs</span>(</span>)</span>?</span>.</span>len</span>(</span>)</span>;</span></span></code></pre>Finally we get to async</code> and await</code></h2> Now! Let’s suppose we have moved to the magic land of async Rust programming and have a non-blocking db fetch for our animals.</p> //</span> we must say the magic word</span></span> async</span> fn</span> fetch_all_animals</span>(</span>)</span> -></span> Result</span><</span>Vec</span><</span>Animal</span>></span>,</span> SomeErrorType</span>></span> {</span></span> //</span> we do all the same work as before</span></span> //</span> and maybe call some async functions here too</span></span> }</span></span></code></pre> Now when we call that function, what we get back is actually a Future</code></a>. To use it, we have to call poll</code> on it, or more idiomatically, we await</code> it to resolve it to a value. (There’s a link in the further reading section if you want to learn more.) This is a lot like what happens in Javascript when we get a promise back from an async function:</p> const</span> animals</span> =</span> await</span> fetch_all_animals</span>(</span>)</span>;</span></span></code></pre> But Rust’s chosen syntax uses a field-like postfix on a Future, and this is the specific thing I think is neat:</p> let</span> animals</span> =</span> fetch_all_animals</span>(</span>)</span>.</span>await</span>;</span></span></code></pre> Look at what happens if we’re calling fallible functions and want our error handling in-line! We stick ?</code> on the .await</code> to propagate any errors and unwrap a result in-line:</p> let</span> count</span> =</span> fetch_all_animals</span>(</span>)</span>.</span>await</span>?</span>.</span>filter_for_hedgehogs</span>(</span>)</span>.</span>len</span>(</span>)</span>;</span></span> //</span> and if our hedgehog filter were both async and fallible....</span></span> let</span> count</span> =</span> fetch_all_animals</span>(</span>)</span>.</span>await</span>?</span>.</span>filter_for_hedgehogs</span>(</span>)</span>.</span>await</span>?</span>.</span>len</span>(</span>)</span>;</span></span></code></pre> That is the use case that shows why I think this specific syntax choice is brilliant. Precedence is clear. We don’t have to wrap things in parens for human readability or to control precedence. If we read a chain, the operations are mentioned in the order that they happen. It works with the existing idioms rather than against them.1</a></sup></p> Another thing that’s interesting to me here is that this choice is not</em> what most modern languages made for their syntax. Lots of them use a prefixed await</code> keyword. In javascript if we were chaining it would look like:</p> const</span> count</span> =</span> (</span>await</span> fetch_all_animals</span>(</span>)</span>)</span>.</span>filter_for_hedgehogs</span>(</span>)</span>.</span>length</span>;</span></span> //</span> and errors will throw exceptions that we're letting bubble up</span></span> //</span> and if we're chaining more than one async thing...</span></span> const</span> count</span> =</span> (</span>await</span> (</span>await</span> fetch_all_animals</span>(</span>)</span>)</span>.</span>filter_for_hedgehogs</span>(</span>)</span>)</span>.</span>length</span>;</span></span></code></pre> But I’d probably never write either of those and definitely never the second. I am far more likely to write:</p> let</span> count</span> =</span> 0</span>;</span></span> try</span> {</span></span> const</span> animals</span> =</span> await</span> fetch_all_animals</span>(</span>)</span>;</span></span> count</span> =</span> animals</span>.</span>filter_for_hedgehogs</span>(</span>)</span>.</span>length</span>;</span></span> }</span> catch</span> (</span>ex</span>)</span> {</span></span> //</span> handle the error at this level</span></span> //</span> I'd omit the try/catch if I wanted the error to propagate</span></span> }</span></span></code></pre> My aversion to chaining partly comes from the fact that I must use the parens to express my intent. The syntax of any programming language shapes what code feels idiomatic and most readable and what code feels like patting a cat tail to head. There’s nothing right or wrong about any of it, because all of them have found a way to express the concept.</p> Is postfix await a small bit of syntax? Yes. Is it thoughtfully chosen out of many possibilities? Yes. Is it very much in tune with the Rust syntax around it? Also yes. This is what I appreciate most about the Rust project: its concern for the experience of the human beings using the language.</p> Further reading</h2> If you would really like to understand Rust futures, you should read @fasterthanlime</a>’s article “Understanding Rust futures by going way too deep”</a>.</p> If you are comfortable reading Rust, and want to know more about async executors and how they work, check out whorl</a>. This repo walks you through the implementation of an async executor and shows you what await</code> desugars to. (Hat tip to Chris Dickinson for telling me about this!)</p> If you have any pointers to other posts about why this syntax is neat, please send them to me and I will link! Also, if you have your own reasons about why this syntax is nice, please do write them up and I will link those too!</p> You might say that the people who chose the await syntax had a strong hold on the theory of Rust</a>. ↩</a></p> </li> </ol> </section> Programming as Theory-Building Unknown — Thu, 12 May 2022 10:00:00 +0000 A couple of years ago now I read Peter Naur’s “Programming as Theory-Building”</a> (alternative PDF link</a>) and it was a mind-blower. Yes, this Naur is the same Naur you know from Backus-Naur Form aka BNF</a>. Anyway, you should go and read this essay now. It’s not very long. I’ll wait while you read it.</p> Back? Okay! Let’s do a bit of a crawl through some main points.</p> Highlights of the essay</h2> Naur opens with his thesis statement, the argument he’s about to make:</p> [P]rogramming properly should be regarded as an activity by which the programmers form or achieve a certain kind of insight, a theory, of the matters at hand. This suggestion is in contrast to what appears to be a more common notion, that programming should be regarded as a production of a program and certain other texts.</p> </blockquote> Programming isn’t about writing the code; it’s about understanding the problem and expressing that understanding through code. That understanding is what allows us to modify the code without harming its design. Naur discusses three real-world cases of existing programs being modified over time by a team closely connected to the original team, a team that had only documentation to go on, and the same team. The team with the closest connection was most successful at making additions that worked with the existing design, and not against it:</p> The conclusion seems inescapable that at least with certain kinds of large programs, the continued adaption, modification, and correction of errors in them, is essentially dependent on a certain kind of knowledge possessed by a group of programmers who are closely and continuously connected with them.</p> </blockquote> Naur then goes into what “theory” means, philosophically and in practice. Here are his three points about what a programmer having the theory can do, lightly edited:</p> Explain how the solution relates to the affairs of the world1</a></sup> that it helps to handle.</li> Explain why each part of the program is what it is, in other words is able to support the actual program text with a justification of some sort.</li> Respond constructively to any demand for a modification of the program so as to support the affairs of the world in a new manner.</li> </ol> </blockquote> Naur is talking about “programs” here, and today we add on top of that the systems in which many programs interoperate to do complex things. So the demands are a little higher: we need to have theories of the systems we build and maintain, as well as theories of the pieces of that system.</p> The term I’m more likely to use myself for what Naur calls “the theory of the program” would be “a mental model of the system”. Some accumulation of facts in my head has allowed me to build a map of the territory– where things are implemented or “happen” in the system– and to predict behaviors given inputs. My mental model might be shallow in places where I haven’t had to make changes and very deep and detailed where I have recently worked. I need to keep that model constantly refreshed through review and rehearsal. While I’m model-building I find myself making small changes to the code, like renaming variables for clarity once I understand them, tweaking log lines, or adding comments.</p> The ability to fluidly make functional changes</em> to the system I’ve modeled requires even deeper understanding of how it’s implemented, because there are patterns and structures in the code that I have to work with rather than against. This gets closer to what Naur means when he talks about the theory of a program. It’s not just the what, but the how and the why. Why does having a theory of the program matter? Because this enables rapid and effective modification of the program to respond to changing requirements without piling up technical debt or hacks.</p> It must be obvious that built–in program flexibility is no answer to the general demand for adapting programs to the changing circumstances of the world.</p> </blockquote> Alas, it is not obvious, we say, looking at all the premature generalization happening around us.</p> I’m going to quote this entire paragraph because of how important it feels to me, with some commentary.</p> On the basis of the Theory Building View the decay of a program text as a result of modifications made by programmers without a proper grasp of the underlying theory becomes understandable. As a matter of fact, if viewed merely as a change of the program text and of the external behaviour of the execution, a given desired modification may usually be realized in many different ways, all correct. At the same time, if viewed in relation to the theory of the program these ways may look very different, some of them perhaps conforming to that theory or extending it in a natural way, while others may be wholly inconsistent with that theory, perhaps having the character of unintegrated patches on the main part of the program.</p> </blockquote> We might call these “unintegrated patches” technical debt or hacks, but either way, we know it’s a problem when we see it. Somebody has worked against the grain of the wood when carving a new feature into the system, and it feels wrong. The hacks get in the way when you need to make the next change. They might not work well because they’re at odds with other design choices. They might be sitting right next to an already-existing affordance to add that new behavior!</p> Continuing on in this paragraph:</p> This difference of character of various changes is one that can only make sense to the programmer who possesses the theory of the program. At the same time the character of changes made in a program text is vital to the longer term viability of the program. For a program to retain its quality it is mandatory that each modification is firmly grounded in the theory of it. Indeed, the very notion of qualities such as simplicity and good structure can only be understood in terms of the theory of the program, since they characterize the actual program text in relation to such program texts that might have been written to achieve the same execution behaviour, but which exist only as possibilities in the programmer’s understanding.</p> </blockquote> People who do have the theory of the program can make changes that work with what’s there already. They know where the affordances are. Naur says that simplicity and quality only make sense in the context of that code to begin with, and this point is a good one. Let’s try another metaphor: Writing a program is like finding a domain-specific language to express the problem and its solution, a language that expresses your understanding of the domain. It’s a truism that code is communication. It is primarily communication with other humans, not with a compiler, with a set of verbs and nouns chosen by you as the best expression of your understanding of the problem. Other people reading your code must learn to read your new language, and to make changes they need to write it.</p> How do programmers learn a theory of the system? Naur says documentation and the source code are not enough, and someone new to a system needs hands-on mentoring:</p> What is required is that the new programmer has the opportunity to work in close contact with the programmers who already possess the theory, so as to be able to become familiar with the place of the program in the wider context of the relevant real world situations and so as to acquire the knowledge of how the program works and how unusual program reactions and program modifications are handled within the program theory.</p> </blockquote> Humans learn through guided practice with others. Spend time working with people who understand a system, and you’ll begin to understand it too.</p> This is great if the people who have the theory of a system are still around to talk to.</p> Nobody’s around any more</h2> Real world is often more like Naur’s “group B” case, where further development on software happened without the benefit of close contact with theory-holders. I’ve described this a couple of times as a being a software archaeologist, digging out bits of architecture and sorting through refuse pits to figure out how a past software team lived and what the heck they were thinking about. Given reality, let’s ask two practical questions in response to Naur’s article:</p> How can you rebuild the theory of a program or system if its original authors aren’t around to teach you?</li> How can you leave useful information behind yourself to help future maintainers rebuild the theory you have?</li> </ol> In my experience, I had to spend a lot of time reading code and building my own mental model of the software, how it was constructed, and how the pieces worked together– the archaeologist metaphor I mentioned above. I had to reconstruct the theory by looking at the textual artifact and what it was doing in practice.</p> My colleague Chris Dickinson</a> does some interesting things while doing code spelunks. He generates other artifacts as he goes, such as textual call diagrams that he can turn into graphs with graphviz. He’ll do this as a vertical slice for a specific code path as well, ending up with a detailed call flow chart showing every network traversal or call made to build a single web page, for instance. There are cognitive reasons why making drawings or notes like this is helpful– you improve your understanding of a concept by expressing it in a different form than you’re receiving it. (This is related to why taking notes in a lecture is helpful. Hear -> write -> read.)</p> I often tried to use commit logs to figure out why specific changes were made, but was more often frustrated than enlightened. The past generations of programmers at this company did not often write helpful commit messages.2</a></sup> One specific commit that introduced an incredibly expensive bug had a commit message like “scope fixes”. Why was the change made? What did the programmer intend? Nobody knows.</p> I had to supplement by talking with people who weren’t familiar with the source code but could tell me what it did and why that was desirable or not. These people might not be the programmer-operators that Naur discusses, but they are expert user-operators. They have a theory of the program too! The one remaining programmer on staff who really understood a particular piece of software was priceless, and I’m grateful they were as amiable about explaining things as they were.</p> An aside about retention</h2> I wish now to point out what might be obvious to you about the cost of team turnover. When there’s one human being left on a team who understands how that pile of legacy code works, you’re in trouble. If there are none, you’re in worse trouble. You have to hire people who can walk into messes cold and figure them out without help, and those people don’t come cheap.</p> Keep people around. Give them raises rather than making them leave to get more money. Keep them feeling good in their daily work.</p> Since we haven’t prevented, let’s try curing</h2> Given this experience, and given the lightbulb moment that reading Naur gave me, I changed my development practice. I started thinking about ways I could help other programmers– maybe programmers I’d never meet– build useful theories about the software they inherited to maintain. If I found good ways to do this, I could then socialize those approaches and turn them into team practices.</p> Here are some of the things I’ve started doing.</p> I deliberately distinguished maintainer documentation</em> from user documentation</em>. The people who need to consume a service’s API are a completely different audience from the people who need to maintain that service. They might have different skillsets and programming languages: a person working on a website is probably writing in Typescript, while the service might be in Rust, C#, or anything at all. They have completely different concerns as well. The maintainer needs to dip into the source and needs to read details about the internals. Making the consumer of an API dip into the internals of its implementation would be a waste of their time.</p> Maintainers are the people who need the theory, so I invested time writing documentation for them. That documentation belongs as close to the source code as possible, and at least partly inside it in the form of comments. Don’t waste time documenting what can be seen through simple reading. Document why</em> that function exists and what purpose it serves in the software. When might I call it? Does it have side effects? Is there anything important about the inputs and outputs that I might not be able to deduce by reading the source of the function? All of those things are clues about the thinking of the original author of the function that can help their successor figure out what that author’s theory of the program was.</p> Chunk up a level: writing about that function might help a maintainer fix a bug with it, but that isn’t sufficient for getting across the theory of the program. There are structural choices you make as you put together the program, as well as major decisions you make that inform the design. More specifically, the program exists to solve a problem, some “affair of the world” that Naur refers to. What was that problem? Is there a concise statement of that problem anywhere? What approach did you take to solving that problem statement? What tradeoffs did you make and why? What values did you hold as you made those tradeoffs? Why did you organize the source code in that particular way? What belongs where?</p> I started putting design documents, decision records, notes about spikes, and so on into the same repo as the source code. If you’re doing code deep-dives and generating artifacts about what you learn, check those artifacts into the source repo too! Put the videos somewhere durable and link to them. I duplicated some of these documents in the company’s documentation platform of choice, but the duplicates were not as important to me. I can’t guarantee a future maintainer will discover those artifacts.</em> In fact, I can’t predict that any documentation platform other</em> that the source code repo will exist in the future.</p> Toward the end of our joint tenure at out most recent job, my colleague Chris and I recorded videos of us doing code-centric deep dives through specific interesting, important, or especially difficult-to-understand aspects of the system. My hope is that these more conversational artifacts substitute in some way for having human mentors present to give people personal guided tours.</p> Oh yeah, and I continued writing tomes in my PR/commit messages, and being fussy about what actually lands into the mainline branch from my PRs.</p> A theory of theory-building</h2> Here are Naur’s points, rephrased:</p> The original authors of a system develop a theory of that system as they work, which comes from their understanding of the problem they’re solving and the decisions they made designing the solution.</li> Programmers need to share that theory in order to make changes to what the system does successfully, without degrading its quality.</li> People learn how complex systems work by being taught about them by other humans.</li> Documentation isn’t enough, even if it a) exists, b) is truthful, and c) is discovered and read.</li> </ul> And here are my reactions:</p> Do as much teaching of other programmers as possible while you’re in a job.</li> Since turnover is a fact of life, and you won’t stay at any one job forever, do your best to leave artifacts behind that help your successors theory-build.</li> Bargain with Naur’s position about documentation not being enough to by investing time into writing documentation aimed at describing the theory of the program</em>.</li> Put that documentation as close to the source code as possible, because the source code is the only artifact guaranteed to survive.</li> Write good commit messages.</li> </ul> The next people to come along will still have to put in the work, but you’ll have at least tried to to make it easier on them.</p> I love this “affairs of the world” phrasing for “the thing the program is supposed to do”. It points right at the fact that the requirements come from external context. ↩</a></p> </li> I think that git commit -m</code> is a culprit here. It encourages people to type very short messages. The Github concept of the “pull request” improves on this by giving people a big text box to type in, so they aren’t pushed to keep their messages to 50 characters tops. This isn’t enough, though, if teams don’t have a culture of encouraging each other to write good PR descriptions, or of squashing branches down into single commits with a good message. ↩</a></p> </li> </ol> </section> Dysfunction junction Unknown — Tue, 10 May 2022 10:00:00 +0000 You know Conway’s Law</a>, of course. I’ll recap it here, just in case:</p> Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure. — Melvin E. Conway</p> </blockquote> This law is a deep one, because communication drives everything that humans do. We are deeply social; we cooperate with each other on large projects by communicating with each other. Communication is a reflection of our social structures because it defines our social structures! Of course the things we make reflect those structures. (Put like that it sounds obvious, but Conway’s statement is so pithy.)</p> Conway’s Law has some fun expressions, like:</p> If two teams are communicating well with leaders who get along, their software will work well together. Conversely, if two managers don’t</em> get along, the software written by their teams won’t work together well.</li> Bugs form along organizational fracture lines.</li> Organizational silos become software silos.</li> The engineers in the org right now don’t like the software written by the engineering team two generations ago. 1</a></sup></li> Broken organizations produce broken software.</li> </ul> Broken human systems make broken software systems</h2> The darkest expression of Conway’s Law is that organizational dysfunction is expressed as software dysfunction. If the leader of an organization is failing, the software it builds starts to fail as well. If you as a technical leader are tasked with fixing a broken software system, you might need to start by fixing the broken organization that produced it.</p> Is the site falling down on the regular? Is shipping new features impossible? Is the tech debt at meme levels? Look for:</p> A toxic culture.</li> Teams that don’t communicate or cooperate.</li> A rapid feature-shipping pace that devalues software quality.</li> Managers who aren’t caring for their reports.</li> Managers who don’t accept organizational priorities.</li> Leadership that incentivizes bad behavior or unsustainable behavior.</li> </ul> And by “leadership”, I mean all the way up to the CEO.</p> I’ve been struck by the influence of company leadership on overall work quality over and over again. Each company has a culture that it promotes, consciously or unconsciously, suffusing its personality through everybody there. The culture starts with the CEO and cascades downward. Intelligent, thoughtful CEOs inspire the people around them to be the same. Bullying CEOs drive good people away. Outright stupid CEOs (and I’ve seen several of these in recent years) make the teams around them stupid and inspire bad work.</p> The management that’s right next to your engineering team has even more direct effect than an exec team. Management can turn an engineering team into a 0.1x team. I’ve seen leadership that inspired what people told me was the worst work of their careers. This leadership actively made their teams worse. Depressing, yes, but flip it: leadership can turn a team into a 10x team through inspiration, support, good incentives, emotional safety, and all the other things that good management can do. You</em> can provide this leadership.2</a></sup></p> I maintain that you must</em> if you want to be effective as a technical leader.</p> Systems are self-reinforcing</h2> I resisted coming to the conclusion that I had to think at the organizational level to fix technical problems for a long time. Surely, I thought, surely</em> the organization will welcome solid, sensible suggestions aimed at fixing the trouble everybody agrees we’re in. The reality was far messier than this. Sensible suggestions might be nodded at as sensible, but their implementation will be resisted overtly and covertly.</p> The reasons why are squishy and human. People respond to incentives, and they do more of what the organization around them rewards them for doing. They get practice at whatever that is. They unconsciously build structure around themselves to support whatever they’re being rewarded for doing. The human organization that produces the software is a system all on its own, with feedback loops and incentive structures. When the org is healthy, those feedback loops promote a virtuous cycle. When the org is unhealthy, the feedback loops promote a doom spiral.</p> If the organization remains dysfunctional and you try to fix its output, it will resist your technical fixes, subvert improvement projects, and continue to drift back to the status quo. Managers will refuse to accept org-wide priorities, refuse to move staff to critical projects, refuse to cooperate with each other. Teams will attempt to maintain their silos. People will dig in rather than change their practices. Even more toxic things can happen in extremely broken organizations.</p> Systems are self-sustaining and self-reinforcing. To repair a software system, you have to repair the human system that built it, and you do that by breaking those self-reinforcing loops3</a></sup>.</p> People will tell you what’s wrong</h2> The first step, therefore is to discover the incentive loops. Investigate and diagnose organization dysfunction methodically, the way you’d diagnose software.</p> One exhausting but effective way to figure out what the worst problems are is to sit down in a one-on-one with every single member of the engineering team and</em> with adjacent teams. Your team is smart and they’ll tell you what’s wrong if they feel safe with you. Adjacent teams will have insights that people directly in the mess might not have, so include them too.</p> Ask everybody the same questions. I let everybody know in advance that I am doing this and to expect a calendar invite. I give everybody the questions in advance, and message people with personal reminders that they aren’t in trouble and I am meeting with everybody. Not everybody will need that reminder, but the ones who do will appreciate it keenly.</p> What questions should you ask? It depends on the circumstances, but you might try asking people if they feel they can do their jobs effectively, and then ask them to expand on their answer. Keep the questions open-ended and focused on the experiences of the person you’re talking to.</p> Take notes as you talk with people. (I ask for permission as I do this.) Themes will emerge. After talking to everyone in your organization, you’ll know what’s not working and what needs to be repaired.</p> Be kind to people</h2> If you’re like me, you might find yourself angry about some of the things you learn in this exploration of the organization. Anger is a important response to learning that your values have been violated. It’s a signal you should pay attention to. That doesn’t mean you vent it at anybody else. Instead, use it to identify urgent work.</p> Remember also that you’re not the only person who has figured out that things are broken. People in the org know it and have responded in their own ways. People will work very hard to fix things in their own specific corners, with what control they have. These efforts aren’t going to be coordinated and might end up at cross-purposes with each other, unintentionally making things worse. Honor the intent. You are here to coordinate.</p> You must exercise power</h2> You have diagnosed the problem. Now you want to fix it, because you are a software engineer and fixing things is what you do.</p> If you have reporting authority and can start making organizational changes, this is the time to use that authority. If you’re a technical leader without that authority, it’s harder. You will need an ally who does</em> have it, and you will need to have the respect and trust of that person. Perhaps this person is an exec who agrees that the company isn’t getting what it needs from engineering. Maybe this person is a new leader brought in at your request.</p> Spend time with this ally getting aligned with them on the values you’re bringing to the problem. What matters most? What does a healthy organization look like?</p> You will also need to invest time to gain the respect and trust of the team around you. How do you gain trust? By behaving in a trustworthy way. Demonstrate that you know your trade. Show leadership in small ways. Handle incidents well (there might be lots of opportunity to do this). You might not have hard power, but you definitely have soft power. Influence. Persuasive skills. Moral authority. The ability to inspire people. Learn to use these as best you can.</p> If all else fails, get that promotion to a pure leadership position you’ve been avoiding so you can stay technical. You can always escape it later (I say, as a person who escaped it).</p> Change is about alignment</h2> Let’s assume the happy path: you have an ally with organizational power, and you’re aligned with that person on values and goals. What happens next is alignment and reinvention for the broader organization. Now you make sure everyone in the org, starting with line managers, shares leadership’s values and understands the end goal. Now you clean up or change how the org accepts work and executes on it, giving all your processes a good shakeout.</p> Entire books could be written about this part and have been, of course, because what you’re doing is setting up a healthy engineering organization.</p> This is what happens next:</p> The company around you has to understand that the engineering org is not going to ship features for a bit.</li> Managers need to get into alignment first, or leave.</li> If managers have burned their relationships with their reports too badly, they might have to leave anyway.</li> People stuck in toxic patterns must leave.4</a></sup></li> You need to design a new way for the org to accept work and execute on it.</li> Leadership must rebuild trust by offering trust.</li> Leadership must rebuild respect by offering respect.</li> </ul> Now give the team a project they can succeed with that will improve something noticeably. Give everybody a win. Reduce that technical mess just a tiny bit. The team will be behind you, helping. (Finally! You get to fix some of the technical problems you’ve been itching to fix!)</p> Remember that lots of the people in that org want change too. They’re good engineers who are doing bad work, and they know it. They’ll be relieved to execute on a plan to make things better. They will shine if you let them.</p> Looking back at doing this</h2> Change is difficult to coax into being. You know the saying about how people have to want to change? This is true of organizations as well, as an collective decision from everyone in that org. The organization needs to feel that the cost of changing is less than the cost of continuing to operate as it has. Change is</em> costly and risky; local maxima are attractive even if they’re not very maximal. Also note that moving from a local maximum to a higher spot requires going downhill first. That’s something that looks very dangerous to an embattled exec who’s already under pressure because their org is underperforming.</p> Doing this at an organization was the hardest thing I’ve done in my career. I feel I was only partially sucessful. I have notes from early in the process about things that needed to change that were still applicable two years later. You might not be willing to take this work on, which is okay. If you can’t make traction, it’s okay to leave. Organizations deserve to fail sometimes.</p> It was incredibly satisfying to hear from my colleagues that the changes made their work lives better, though, so the rewards are deep.</p> Acknowlegements</h2> Thanks to Chris Dickinson</a> who read early drafts of this post. Read his blog!</a></p> I keep thinking about that insight @isntitvacant had about Conway’s Law being true over time as well, and the software that felt good in the hands of a past time being uncomfortable to a team that has entirely turned over. — Ceej “oh well” Silverio (@ceejbot) February 8, 2021</p> </blockquote> This is Conway’s Law over time. Teams are immutable: adding or removing a person to a team produces a different team. After enough change, the team is different enough that it no longer recognizes itself in the software system it produces. The result is people being vaguely unhappy about software that might be working perfectly well. This probably deserves its own short blog post. ↩</a></p> </li> It’s always a leadership problem. ↩</a></p> </li> My thinking here is influenced by Donella Meadows</a>’s Thinking in Systems</em>. You should read the book, but this blog post has a great summary</a> if you’d like a teaser. ↩</a></p> </li> Orgs in failure modes can burn people so badly that they can’t get past that emotionally. Burned people will often be perfectly fine and healthy if they get to press a big reset button and go somewhere else. ↩</a></p> </li> </ol> </section> Problem statement Unknown — Sun, 08 May 2022 10:09:21 -0700 The first problem with blogging is deciding where and how to do it. I’ve written static site generators myself in python, ruby, and javascript. I considered writing another one in Rust, my current language of choice, but I decided that this would be too much of a distraction. I have a month between jobs at best, so I need to focus.</p> If you’re curious, this one is generated with Hugo</a>, which I chose because of how many themes are available without me having to fuss and write one. The theme is a lightly-modified Risotto</a>. The static files are hosted on AWS S3 behind Cloudflare’s free personal plan, because I don’t yet need any of their paid features. I manage the AWS and Cloudflare settings using terraform. I don’t yet have all my personal cloud infrastructure terraformed, but I might take the time to do that over my month+ between jobs. I have flinched from doing that in the past because it felt like too much work, but it was less work than I feared, and I now have reproducible results.</p> It’s been a long time since I blogged regularly. I still have a presence on tumblr</a> but I’ve let it lag behind in the last couple of years. Work became intense and draining, and burnout meant I dropped a number of hobbies. I think it’s time to resume this one, though, with a focus on the kinds of topics I would have pitched as conference talks before the COVID pandemic.</p> Here are some topics I’d like to visit:</p> The connection between technical dysfunction and organizational dysfunction, and why you have to deal with the organization first.</li> How reading Peter Naur’s “Programming as Theory-Building”</a> changed my priorities as a technical leader.</li> Why technical correctness is the least useful kind of correctness in the real world (and how none of us ever make decisions based on this anyway).</li> Problem statements + values statements: how to arrive at decent solutions to problems if technical correctness isn’t useful.</li> What to do with a legacy monolith implemented with a language and framework you don’t know and/or dislike.</li> Where “performance” comes from in the messy real world, and where it doesn’t come from.</li> Why it took me a year to arrive at a one-line fix for a massive performance problem, and how I hope to shorten that time should I encounter a similar situation again.</li> Why a relentless focus on reducing developer friction pays off in team productivity, and some ways to do this.</li> The cost of breaking tech hiring so badly that productive, product-shipping engineers feel they have to “grind leetcode problems” to pass an interview.</li> How dogmatism is an enemy: let’s not be dogmatic about methodologies, “best practices”, “design patterns”, or anything, really.</li> When microservices are appropriate and when they’re not, and some varied approaches to slicing systems up.</li> Data-centric analysis for systems design and how to coax people into doing it.</li> </ul> I am not a fan of single right answers for any of these topics. Most real-world tasks are complex enough that many approaches to them are viable, and the “right” approach will depend on the context. What’s your starting point? What does the system around you do right now? What are the people working on it comfortable with? What do they do well and what do they struggle with? Which tools fit their hands best? So in my theory, the best advice I can give anybody is about how to ask and answer questions about that context, and tell some stories about what worked and didn’t work for me.</p> I also hope to learn enough that five years from now I’ll be convinced that the Ceej writing these posts in 2022 was a prime chump who got much of this wrong. I’ll write another set of blog posts, and keep the blogging economy chugging.</p> About Unknown — Sat, 07 May 2022 11:38:39 -0700 I’ve been on the internet since 1987, which places me in the early generation of people being idiots online. Every mistake it’s possible to make, I’ve made. Blogging is one of those mistakes, and it’s one I have repeated since back when they were called “web journals”.</p> I decided to start blogging again during an interlude between jobs. I’d like to write about what I learned through my last decade-plus of work in a changed Silicon Valley, in a startup scene that’s quite different from the scene I entered in the early 90s. The technical problems have changed in the cloud era, which is fairly easy to observe, and many people have written about this.</p> I’d like to try writing about the human factors that you need to consider when approaching technical problems. After thirty years in this profession, I seem to have accidentally learned some things, mostly through making mistakes. Maybe I can help you avoid those mistakes!</p> Anyway, here’s another blog.</p>

Ceejbot's notes

x Why was this fun (while work was not)?

Unknown — Fri, 16 Jan 2026 02:33:07 +0000

Let’s start by clarifying what I mean by the question.

I’m resurrecting a years-old nearly-finished blog post draft here. What’s funny to me is the time period when I wrote it. I was working for an utter car-crash of a company, one that I am absolutely certain in retrospect existed only to raise money serially and keep its founder’s friends employed. Any time the product looked like it might ship, they pivoted. (I do not care about this. It’s on the VC who falls for it, IMO.)

O Silly Valley, how silly you are.

I was fairly miserable in the brief time I worked there. This side project was a light in the darkness at the time, and apparently I wrote about it.

Last month, over the course of two weekends, I designed, implemented, and shipped a little command-line tool. What this tool does is not particularly important, except to note that it solved a problem I had repeatedly encountered while working on some internal tools for my job. I understood the problem domain well, and it was indeed an irritating problem, and solving it will make my next set of shell scripts easier to deal with. Also I was somewhat annoyed that I’d just spent a month writing tools in bash^{1</a> and not in Rust, a language I had wanted to be using. I scratched two itches at once by writing this tool in Rust. I went some extra miles with this one: I wrote tests. I wrote docs. I wrote examples. I automated its release process. I published it on crates.io, something I rarely bother to do with my Rust projects. And then I learned how to create a homebrew tap</a> so I could make pre-built executables available to people who don’t have Rust installed.}

I do not expect anyone but myself to use this tool, so all of this work was “wasted work”, in some sense. If I’d done all this in my workplace and then had it all thrown away, I’d have felt discouraged. But here, in a side project, it was pure fun. It was fun to read the documentation for libraries I ended up not using. It was fun to install tools that were vaguely related and try them out.

It was fun to write some code, realize it was all wrong, and rewrite it again more tightly. And then to go back to it again later and tighten it some more. To write comments noting that something I’d written wasn’t great and I’d get back to it. Maybe.

It was fun to polish it all up and finish it. It was fun to hit a level of completeness that I am rarely able to reach in my employment, even when I’m the person running a team and encouraging polish and completeness.

Why was this entire process so much more fun than anything I experience during normal day jobs? 2</a>

I let myself blurt out some answers to that one here:

It scratched my itch, not somebody else’s.</li>

I knew what I was building and why.</li>

I knew who I was building it for.</li>

I didn’t have to deal with bike-shedding</a> and what-about-ery from people second-guessing design decisions.</li>
There was nobody else’s taste to consider as I designed.</li>
I was free to make decisions; no need to socialize them and deal with disagreement.</li>
I could explore the problem space freely. Dead ends were fine.</li>
There was no time pressure.</li>
My work sessions weren’t interrupted by context-switching into different work. (They were interrupted by non-work concerns, like feeding hungry cats.)</li>
The requirements didn’t change unexpectedly.</li>
When the project goals did shift, the entire team of one understood why and were aligned with the need to change.</li>
There was no scope creep. When an acquaintance suggested an interesting possible feature, I was free to reject the feature.</li>
I could pursue my own standards for quality.</li>
How often do I have time in a paid job to write second and third drafts of code? But that’s when code (and prose writing) starts to get good.</li> </ul>
There’s lots of overlap in those, but I’ll let them stand without editing because the repetition helps the themes emerge. None of those themes are surprising to me, by the way, and I suspect they’re mostly not surprising to you.
Why, I ask myself, is work almost never this fun? Should work be this fun? Is it possible, even?
Motivations</h2>
A while ago I watched a video with some very cool illustrations for a TED-talk-ish lecture by Dan Pink: The surprising truth about what motivates us</a>. It’s only about 10 minutes long. Don’t let the TED-talk-ness stop you; go watch it.
You back? Cool. Let’s talk about that lecture. Now, there are things in I would argue with–twelve years on I’m a lot more cynical about open source than Pink was then–but that’s not the heart of his point. The heart of his is point is what motivates humans to do complex tasks:
- purpose</li> </ul>
 I note, in retrospect, that all three of those needs show up in my blurts about reasons why this project was more fun than a lot of work I’ve done for pay. I’m going to use them to give structure to the rest of this discussion. How did my personal project provide them? How does work not? Can work provide them?
 Autonomy</h2>
 Autonomy is somewhat at odds with the need for coordination and cooperation that dominates engineering in workplaces.
 When I mention “what-about-ery”, what I mean is objections from somebody whose values are different from mine. Or who is thinking about edge cases I’m not. Maybe they’re right and maybe they’re not; the values misalignment means our calls about what to do are different. Maybe we both put effort into resolving it and both come out feeling okay. That’s more energy invested than I would invest in considering and then rejecting an idea in a solo project. “Maybe I should do this,” I think, then “nah, not right now.” Energy investment over.
 Dealing with disagreements is a normal part of group work. Disagreements are conflict, even if often only mild conflict, and they must be resolved. Resolution takes energy. Sometimes our colleagues are not great at negotiating disagreement; sometimes we’re the ones who aren’t great at it. Either way, in work contexts we must spend this time and energy, because if we do not resolve these conflicts, we as a team don’t commit fully to the decisions the team makes. (I gesture in the direction of The Five Dysfunctions of a Team</a>.)
 Nobody else’s taste: This one is interesting. I have some fairly strong preferences about naming things and how I want to write text that users read, and how I want to present data. I like a particular clean prose style (aside from the parentheticals). I often have to suppress that taste when working in groups. It is not helpful to fuss about variable names in pull request reviews, unless those names are very misleading. I’ll fuss about then if I’m pairing with you, then decline to die on that hill if you push back. This is good manners. But when I’m by myself, I can rename that variable just to see how the code reads, editing to my satisfaction. Or I can leave in that ridiculous joke that will make me laugh when I find the code again in a couple of years.
 Autonomy is not fully at odds with the idea of working on teams. Aligned teams that are clear about their goals and the values they’re holding as they work toward their goals are not going to feel much tension here. The team might feel autonomy as a group and work together on making decisions. Or they might delegate well so that people can work on separate projects with the freedom to make decisions. But I think working in teams does requiring losing some autonomy. This is a tradeoff; it buys you the satisfaction of working on a larger project than you can do on your own.
 Alignment</h3>
 I noticed a couple of things there about changing requirements. I did change requirements for the tool midway through! A friend made a suggestion–how about json output? I thought about this and realized it would fit nicely into my goal of using toml-sourced data in shell scripts. Select complex data from a toml file, emit it as json, and pipe that into jq</code> for further processing. I accepted the suggestion and added the feature. Doing that work made me think harder about what “emitting values in a form bash can consume immediately” meant, and the whole thing got a little better.
 Aligning a team of one is a lot easier than aligning a team of five. Or a company of fifty.
 Alignment is, however, critical for any group of people working on a software project together. (Probably any project, but software is the thing I’ve spent my life doing.) Leaders have to put in a lot of work here. Dicta from above don’t produce good results.
 Mastery</h2> My blurts above call out specifically that I learned to do some new things with this project. I got to practice writing Rust instead of writing bash. Now, all the bash I’ve been writing for my day job has in fact made my bash a lot better. I have quoting bugs less often. But… that wasn’t what I wanted to get better at. It was the right choice for the problem in the moment.Purpose</h2> This is the scratching my own itch experience. I knew why I wanted it. I giggled my head off when I used the tool as part of its own release process</a>. It was doing what it was supposed to do! I had succeeded! I had the tool I wanted! That feeling of delight is something that’s kept me writing software despite all the nonsense in the industry itself, despite toxic work environments, despite bad project management, despite bad product design, despite sociopathic company founders, despite failures in the market. (That’s a litany of the worst of it; my career has mostly featured environments that were good, with the occasional awful standout.) Of course it was fun!</h2> I land on: of course it was fun! Of course revising it over the years since I wrote it was fun! I scratched my own itch. I have used it in every open-source project I’ve written since that moment. I made changes in the API and was happy about them. It’s useful to me. I don’t care if it’s useful to anybody else. This is why I write software. The first program I ever wrote was a D&D character generator, written in pen in a spiral bound notebook because I’d just read a BASIC manual but didn’t have a computer to run programs on. I was scratching my own itch about something I was having fun with at the time. When I finally got to type that program into an actual computer—a couple of years later, when my parents bought me an Apple ][+—watching it run was a delight. Years after that, when I was a somewhat older kid at a now-kinda-famous Silicon Valley startup</a>, exercising code for my first feature there, I felt the same absurd joy. I sat there for half an hour, making the device do the thing I’d implemented over and over. That was fun. So why was writing this tool fun? Because it was my work. Done for me. To my level of quality. Solving my problem. At my pace. And I got to giggle watching ti work. Lesson</h2> Are you not having fun? Okay. I get it. Work is a total drag. Maybe you should find another job? But if you can’t find your jouissance in your work-work, find it in your personal projects. Remind yourself why you write software. Go do something silly and useful only to you, and have fun. Find it on ths sly at work: do something that makes you happy in that work project. Revise the code to your satisfaction. Pause and look at the feature you just implemented and delight in how cool it is that you made something that didn’t exist before now “happen” in some virtual way. Giggle at it. bash is very good at orchestrating other command line tools and connecting output together in the famed unix pipelines. That’s what shell scripting languages are for! It’s a nightmare in many other ways. When’s the last time you messed up bash quoting? Exactly. It was, however, the right choice for the project in front of me; doing the work in any other tool would have been work work with a less-maintainable result. ↩</a> </li> I’m not picking on my current day job here. I’m thinking of all of them across the years. ETA much later: lol. That day job was hilariously bad. ↩</a> </li> </ol> </section>



Private homebrew taps
Unknown — Sun, 15 Jun 2025 12:01:30 +0000
I recently built a few things to make it easier for my team to distribute the tools they’ve been building to the entire engineering organization. I had the advantage of knowing that everybody was on Mac laptops running very close to the latest release. I had some idea of their general habits and blind spots after 18 months of observation. There were a handful of problems I needed to solve here, and a private tap looked like it would solve all of them.</p>
The problems were:</p>
Discoverability.</strong> We can document things all we want using Notion or whatever the corporate information system of the moment is, but unless somebody thinks to go hunting, they won’t find the documentation. brew search our-github-org</code> is easy to remember, however, because everybody is already using homebrew.</p>
Automated installation.</strong> It’s straightforward to install the latest GitHub release of something in a shell script, if you can get the person to run the shell script. Running brew install foo</code> is even easier. It’s also easy to say this out loud to somebody if you can’t type it.</p>
Updates.</strong> We can mostly coax people into onboarding into new repos by making a just setup</code></a> recipe conventional in as many of them as we touch, but only mostly. People definitely don’t remember to run setup scripts again to update. brew upgrade</code>, on the other hand, is something they will run occasionally.</p>
Signed Mac executables.</strong> We’re building some tools that are notarized Mac executables with installers. Brew knows how to install all the things.</p>
The obstacles to using a private tap are mostly that distribution of not-open-source software is of zero interest to the Hombrew project; they’re a package manager for open source for MacOS (and Linux too). You have to build this yourself. Fortunately, it’s not hard.</p>
By Fediverse friend request, I share my approach. Here’s the outline:</p>

Use a download strategy</a> that can fetch assets from release in private repos.</li>
Publish internal tools to your tap with formulas that mark them as using this new strategy.</li>
Show people the slight bit of magic needed to tap a private repo.</li>
</ul>
You can do an Internet search for articles about how to write a download strategy that uses curl</code>.  I chose to write a download strategy that uses gh</code>, the GitHub cli</a>. I’ve been using gh</code> in actions for a while because it’s so easy to use for certain tasks, like, well, downloading artifacts from private repos.</p>
If you’ve maintained your own tap before, skip down to the strategy section</a> and snag that. Read on for the full details if making brew taps is new to you.</p>
Get gh</code> installed</h2>
Get everybody set up with gh</code>, because they’ll need it installed to use this strategy.  Using gh</code> as a credentials helper for GitHub makes the next step more convenient, but is not required. GitHub has good installation instructions.</a> You can script it with some human action needed like this:</p>
brew install gh</span></span>
gh auth login</span></span>
gh auth setup-git # optionally</span></span></code></pre>The tap repo</h2>
Now get your tap repo set up.</p>


Create a private repo in your organization named your-org/homebrew-tap</code>. You can leave out the “homebrew” part of the name in most of the usage instructions, because it is implied. Naming the repo unambiguously is merely a convention.</p>
</li>

Make sure everybody who needs to install tools from it has at least read access. Probably you have a GitHub team that’s “all developers” or the equivalent.</p>
</li>

Make sure everybody who needs to publish tools by hand to it has at least write access.</p>
</li>

Create a fine-grained GitHub token that can clone the repo and write new commits to it. Make this available as an organization</em> secret, so any repo can use it in a release workflow.</p>
</li>

Finally, get everybody to tap it, using one of these invocations:</p>
</li>
</ul>
brew</span> tap</span> your-org/tap</span> git://github.com/your-org/homebrew-tap.git</span></span>
#</span> to use ssh to access it</span></span>
brew</span> tap</span> your-org/tap</span> https://github.com/your-org/homebrew-tap</span></span>
#</span> or if you're authing with the gh helper:</span></span>
brew</span> tap</span> your-org/tap</span></span></code></pre>
Now we have everybody tapping an empty cask. Let’s fill it.</p>
Formula files</h2>
These are predictable. Generating them is a template rendering problem that can be solved in a number of ways. Here’s a template I use for typical formulas.</a> You can write some yourself by looking at examples.</p>
Generating them by hand is, however, de trop. This is what computers are for, and especially what workflows that generate GitHub releases are for.</p>
You can find actions to do this by searching on GitHub</a>. Here’s where you’ll want the API token you generated above with the ability to make commits to the tap repo. You’ll do a set of builds in a workflow, create the release, upload assets, use whatever you’ve chosen to generate a formula file, then finally commit the formula file to your tap repo.</p>
Here’s an example tap update step</a> from one of my repos. Note the two access tokens passed in as env vars. One is that token that can commit to your tap repo; this needs to be provided to gh</code> via the env var GH_TOKEN</code>. The other is the token GitHub generates for the workflow run, which has access to the tool repo. You’ll want to to read release information from the GitHub api with gh</code>, which I found a million times easier to do than clunking my way through workflow step inputs and outputs. (Your mileage may vary.)</p>
In fact, you can probably use the gh</code> build-in gotemplate formatting feature to make it emit a formula file, if you can cope with really long inline template strings.</p>
gh release view -R org/repo --template "some long template string here"</span></span></code></pre>
The twist here is that we want to change up the standard template by adding a custom download strategy to our tap.</p>
Downloading release assets (by strategy)</h2>
Here’s the download strategy itself. You can either embed this into each formula (the lazy way, which I chose), or put it into a file in your tap repo and then require that file in each formula.</p>
require</span> "</span>download_strategy</span>"</span></span>
require</span> "</span>utils/formatter</span>"</span></span>
require</span> "</span>utils/github</span>"</span></span>
require</span> "</span>system_command</span>"</span></span>
</span>
class</span> GitHubCliDownloadStrategy</span> <</span> CurlDownloadStrategy</span></span>
  require</span> "</span>utils/formatter</span>"</span></span>
  require</span> "</span>utils/github</span>"</span></span>
  require</span> "</span>system_command</span>"</span></span>
</span>
  def</span> initialize</span>(</span>url</span>,</span> name</span>,</span> version</span>,</span> **</span>meta</span>)</span></span>
    super</span></span>
    #</span> Extract owner and repo from the URL</span></span>
    match_data</span> =</span> %r{</span>^https?://github</span>\.</span>com/</span>(</span>?<owner></span>[</span>^/</span>]</span>+</span>)</span>/</span>(</span>?<repo></span>[</span>^/</span>]</span>+</span>)</span>/releases/download</span>}</span>.</span>match</span>(</span>@</span>url</span>)</span></span>
      return</span> unless</span> match_data</span></span>
</span>
      @</span>owner</span> =</span> match_data</span>[</span>:</span>owner</span>]</span></span>
      @</span>repo</span> =</span> match_data</span>[</span>:</span>repo</span>]</span></span>
      @</span>filename</span> =</span> File</span>.</span>basename</span>(</span>@</span>url</span>)</span></span>
  end</span></span>
</span>
  def</span> fetch</span>(</span>timeout</span>:</span> nil</span>)</span></span>
    ohai </span>"</span>Downloading </span>#{</span>url</span>}</span> using GitHub CLI</span>"</span></span>
    if</span> cached_location</span>.</span>exist?</span></span>
        puts</span> "</span>Already downloaded: </span>#{</span>cached_location</span>}</span>"</span></span>
    else</span></span>
      begin</span></span>
          temporary_path</span>.</span>dirname</span>.</span>mkpath</span></span>
</span>
          #</span> note path hack</span></span>
          system_command</span>(</span>"</span>/opt/homebrew/bin/gh</span>"</span>,</span> args</span>:</span> [</span></span>
                "</span>release</span>"</span>,</span> "</span>download</span>"</span>,</span></span>
                "</span>-R</span>"</span>,</span> "</span>#{</span>@</span>owner</span>}</span>/</span>#{</span>@</span>repo</span>}</span>"</span>,</span></span>
                "</span>--pattern</span>"</span>,</span> "</span>#{</span>@</span>filename</span>}</span>"</span>,</span></span>
                "</span>-D</span>"</span>,</span> "</span>#{</span>temporary_path</span>}</span>"</span></span>
             ]</span>,</span> print_stderr</span>:</span> true</span>)</span></span>
      rescue</span> ErrorDuringExecution</span></span>
          raise</span> GitHubCliDownloadStrategyError</span>,</span> "</span>GitHub CLI download failed for: </span>#{</span>url</span>}</span>"</span></span>
      end</span></span>
      cached_location</span>.</span>dirname</span>.</span>mkpath</span></span>
</span>
      #</span> Find the downloaded file in the temporary path</span></span>
      downloaded_file</span> =</span> Dir</span>[</span>"</span>#{</span>temporary_path</span>}</span>/*</span>"</span>]</span>.</span>first</span></span>
</span>
      if</span> downloaded_file</span></span>
          FileUtils</span>.</span>mv</span>(</span>downloaded_file</span>,</span> cached_location</span>)</span></span>
      else</span></span>
          raise</span> GitHubCliDownloadStrategyError</span>,</span> "</span>Downloaded file not found in </span>#{</span>temporary_path</span>}</span>"</span></span>
      end</span></span>
    end</span></span>
</span>
    symlink_location</span>.</span>dirname</span>.</span>mkpath</span></span>
    FileUtils</span>.</span>ln_s</span> cached_location</span>.</span>relative_path_from</span>(</span>symlink_location</span>.</span>dirname</span>)</span>,</span> symlink_location</span>,</span> force</span>:</span> true</span></span>
  end</span></span>
end</span></span></code></pre>
As you can see from the horrible hack noted in the comment, I didn’t bother figuring out how paths are set up for homebrew scripts. Maybe you know more than I do.</p>
You’ll want to edit the part of your formula file where you list available asset files by OS and architecture to mention the strategy like this:</p>
if</span> OS</span>.</span>mac?</span> &&</span> Hardware</span>::</span>CPU</span>.</span>arm?</span></span>
    url    </span>"</span>https://github.com/ceejbot/codefact/releases/download/v1.0.2/codefact-aarch64-apple-darwin.tar.gz</span>"</span>,</span> using</span>:</span> GitHubCliDownloadStrategy</span></span>
    sha256 </span>"</span>0e0d03a2f787f6d875ff02ce91cf495cc95878ace96b9a9c8f3073a6a9688b44</span>"</span></span>
end</span></span></code></pre>Test it</h2>
Get some people to test the whole process and make sure everything works. Verify that your docs are clear enough that even the people who break everything can manage to make it work. Test your release workflows. Test updates.</p>
You’re done.</p>
Autogenerating formula files (Rust bins only)</h2>
As I advised you to do above, I looked for GitHub actions to generate formula files for me from a template. There’s one that looks heavily used, but I immediately had difficulties with python’s requests</code> module with it. My patience for debugging python packaging problems is about zero these days, so I banged out a Rust cli tool to write exactly the formula file that the late cargo-dist</code> used to write for me. This tool, formulaic</code>, also has a flag for specifying using the gh</code> strategy instead of figuring it out automatically.</p>
This tool reads the latest GitHub release and the Cargo manifest file it’s given the path to, which means it only works for Rust executables. It’s a quite predictable thing that fills out a template. The source is on GitHub</a>. You can see its self-generated formula file</a> in my own Homebrew tap. Snag and edit to taste if it saves you some time, so long as you also share your changes. (See the license.)</p>
Random shasum trivia</h2>
GitHub seems to have recently started adding sha 256 sums to release asset data, so they don’t have to be calculated the way I’m doing it. Unless you’re very cautious, that is. You can use gh</code> to get the shasum of any specific asset:</p>
gh release view -R ceejbot/tomato --json assets | jq -r '.assets[] | .name, .digest'</span></span></code></pre>
Here’s something I learned while observing that for some reason there are three different ways to get a sha256 sum of a file out of the box on MacOS. There are two output variations, because of course there are.</p>
command</th> style</th> origin</th></tr></thead>

sha256sum README.md</code></td> linux mode</td> compiled</td></tr>
sha256 README.md</code></td> bsd mode</td> compiled, same exec</td></tr>
shasum -a 256 README.md</code></td> linux mode</td> perl, CPAN</td></tr>
shasum -a 256 --tag README.md</code></td> bsd mode</td> </td></tr>
</tbody></table>
Note that Homebrew wants the bare hex string, without filename decoration of any kind. If you use the github-generated shasum, you’ll need to trim the sha256:</code> prefix.</p>


Modern terminal environment
Unknown — Sun, 12 Jan 2025 13:17:11 +0000
I read Julia Evans’s blog post on “What’s involved in getting a modern terminal environment”</a> and got very excited because there are lots of great comments in that blog post, and I have a few more of my own, and a little meta-commentary.</p>
The terminal and the shell</h2>
Are there good modern answers for terminal software? Sort of. There are certainly many options. I still use iTerm2</a>. I try other fancy new terminal software but end up back at iTerm2 every time for the combination of Macintosh features and overall performance. The Electron ones are sluggish enough that I feel it. Warp</a> is the magical one I tinker with sometimes. I loathe the idea of online sharing features or LLM auto-completion in my terminal, so I pretend those don’t exist and that it’s free software. The existence of those features for money in Warp might either please or distress you. It is, however, truly a modern take on what a terminal experience can be.</p>
Windows didn’t have any good terminal programs other than the built-in one in VSCode until recently. The default Terminal program is now just fine. I use wezterm</a> when I’m using Windows. This requires customization with Lua and is not particularly modern or magical or command-aware or anything like that, but it is zippy and therefore better than VSCode. (I note, in passing, that Windows is a pretty important environment for lots of programmers, and Rust treats Windows as a first-class target, so Rust projects can</em> do nice things for Windows users if their authors wish.)</p>
The shell is no contest. Use the fish shell</a>, as Julia recommends. You nushell users are a special sort of human; you may continue being you.</p>
If your fingers type !!</code> and !$</code> enough to miss those bash-isms, install oh-my-fish</a> and get the bang-bang</code> package. There are some other nice things there to snag.</p>
Like Julia, I install every base16 theme there is. I change my theme colors and monospaced typeface about once a year, to keep my brain thinking everything looks different. I have no idea if this is helpful to anything or not, but I like to pretend it is. Get a Nerd font variation</a> to keep the prompt looking right. Some good ones to consider: Cascadia Code, Iosevka, Fira Code, Monaspace.</p>
Oxidize everything</h2>
Rust gave us all a systems programming language with ergonomics considerably better than the hand-held table saw that is C++, and the terminal experience is better for it. Look for modern Rust variations of everything, and for a handful of neat Golang tools as well.</p>
Start by aliasing ls</code> to eza</a>. You will not look back.</p>
I stopped fussing about prompt setup when I found Starship</a>. Well, you have to fuss, but only once and then you’re done forever. Customize with toml once, put the toml into your dotfiles repo, then you have a prompt that works with any shell you are using at the moment, in whatever terminal in whatever environment.</p>
You have lots of options for reading text in the terminal that aren’t just cat with one of two pager variations we’ve had for 35 years. Why not read styled markdown</a>?</p>
Install the fuzzy-finder fzf</a> and its integration with your chosen shell (which of course is fish</code>, finally a shell for the 90s). This is a subtle enhancer for everything if you start thinking of it as a part of how you find and select things. Many other tools come with fzf</code> integrations built in directly, or you can shell-script it on up. You can get ripgrep</a> (which you should be using too, come to think of it) into the mix to get find file in project</a> with fzf.</p>
Editors</h2>
I don’t edit text in the terminal. For most editing, I use zed</a> and zed</code>’s terminal integration. I don’t have any interest in “collaborating” with stochastic parrots, but I do have a very high interest in snappy editors with excellent language server integration. I switched to zed</code> from VSCode a couple of years ago and haven’t looked back.</p>
I do still edit some files in the terminal, out of very old habit. When I edit dotfiles and other system configuration files, I type vi</code> and open them up right there. Why vi</code>? Because I am old and why type three characters when two is enough? vi</code> dates back to Bill Joy, if I recall, and that’s 40 years ago.</p>
Editing text is not a modern thing to do inside the terminal, I think. But is modernity really what we’re going for? After all, people still use vim</code> and neovim</code> to write software effectively every day.</p>
I say something about goals</h2>
“Everything affects everything else” is true and so is the fact that nothing is perfectly consistent with everything else. It was all implemented at different times over decades by cats who resisted herding, or who were working in a slightly different context than the other cats, with different libraries. And all these grumpy cats were doing the essentially stupid thing of kind-of emulating a dead hardware terminal from a dead microcomputer company that turned into ANSI eventually. That it works at all is nice; that we can get it to do decent things is surprising. I’m not sure what it means that it’s still by far the most effective way to get programming work done for many people.</p>
So the terminal is a weird mess, yup. Changing your setup is disruptive if you do it rarely. You get practice in setting things up if you throw things up into the air often, though, and that’s why I do it. I try new things far more often than I choose to integrate them into my daily shell workflows. I can’t tell you how many times I’ve tried shell history things that promise to revolutionize my shell experience that ended up driving me to distraction within 15 minutes. It’s at least, um, twice? Once? Once for sure. I found all of the above things and a lot more that I am restraining myself from listing here by experimenting a little bit every so often with things I hear about.</p>
Changing my setup isn’t the goal, really. Finding things that are worth integrating from the seething stew of modern software, that’s my goal. What makes them worth integrating? Well, it’s not modernity, not primarily anyway, as I say above.  Modernity doesn’t select modal editors. Modernity doesn’t reach for the terminal. Modernity abandons the VT100 and sets up a mouse and windows.</p>
So we’re not trying to be modern. We’re trying to be effective. Terminals and vim</code> are effective</em> and, in the hands of experts, powerful</em>. Aim for the set of tools that make you effective. Select new tools based on how they’ll make your more effective at whatever it is you’re doing in the shell. Try things, reject some, integrate others. Tell your friends about the good stuff. (Tell me about the good stuff, too, thanks.)</p>


Understanding Software
Unknown — Fri, 29 Mar 2024 21:58:26 +0000
Nothing I said in this presentation will be shocking to any readers of this blog, but my audience here was the entire company. I wanted to let a group of non-programmers know what we do and how everybody contributes to the work of making useful software.</p>
PDF version of the rendered slides</a></p>
The Markdown version follows! The SN:</code> indicates my speaker notes.</p>

Understanding software</strong></h1>
and how it comes to be</h2>
– @ceejbot</p>
SN: A note about the slides: they’re anchor points to call out important words or to remind you where we are in the presentation. You don’t have to let them fill a whole screen if you don’t want to. There aren’t any flashing lights or animations in the presentation, either.</p>

</p>
SN: How does this sketch turn into a company that had thousands of developers, millions of daily users, and an effect on the entire world? At its starting point was nothing, and then software happened, and for about 15 years we had something</em>. Politics, news, culture– all happened because of this software. I’ve always found this amazing– somebody has an idea, and this THING appears out of nothingness.</p>

What is software</strong>?</h2>
How do we build</strong> it?</h2>
What happens afterward</strong>?</h2>
SN: Carl Sagan said if you want to make an apple pie, first you must invent the universe. We aren’t going to go that far back, but we are going to talk about these three questions. I need to caveat all of this: My answers to these questions come from a specific perspective– me & my career experiences. I am not going to talk about how it’s done at Google or Facebook or other weird gigantic companies. Going to talk about how software is built by small to medium sized teams, in the Silicon Valley, ones that happen to have a lot of ex-Apple product influence.</p>

This company writes software</strong></h1>

Everyone here contributes to this work.</li>
Everyone here would benefit from understanding how we do it.</li>
</ul>
SN: We write a lot of software. I counted X meaningful lines the other day.</p>

What</strong> is programming</strong> anyway?</h1>
SN: A traditional answer is that programming is typing long text files with instructions to make a computer do things. But when I’m sitting with my feet up on my desk, or when I’m pacing around my house muttering, or when I’m scribbling in my notebook, I’m also programming. I’m going to go to one of my favorite essays of all time for another answer.</p>

“[P]rogramming properly should be regarded as an activity by which the programmers form or achieve a certain kind of insight, a theory, of the matters at hand.</em> This suggestion is in contrast to what appears to be a more common notion, that programming should be regarded as a production of a program and certain other texts.”</p>
— Peter Naur, “Programming as Theory-Building”</a>, 1985</p>
SN: Peter Naur is the Naur of Backus-Naur Form, which some of the programmers in the audience might remember, and one of the designers of Algol, the extremely influential programming language. This is from a 1985 essay about what he’d learned about how to write and maintain and operate software. I think this is right on target. Let’s spend a moment looking at Naur’s theory of the program.</p>

Naur says a programmer who has the “theory of the program” can:</p>

Explain how the solution relates to the affairs of the world that it helps to handle.</li>
Explain why each part of the program is what it is.</li>
Respond constructively to any demand for a modification of the program so as to support the affairs of the world in a new manner.</li>
</ol>
SN: Naur was writing an an earlier era, so he talks about single programs here. Today, we write many programs and connect them all together into software systems. What he called “the theory of the program” is what I would call “the model of the system”, but both phrases get at the heart of the concept.</p>

Software</strong> is:</h1>

a lot of text files with instructions to computers (they matter!)</li>
that express the authors’ understanding of a real-world problem</li>
and their solution to that problem</li>
(and the same for every building block they needed along the way)</li>
</ul>
Programming is how we get there.</p>
SN: And this is what we have to understand to function effectively. Let’s zero in on one part of that.</p>

To write</strong> software effectively</h1>
you must understand:</h1>

the affair of the world</em></li>
how</em> the program goes about solving it</li>
</ul>
SN: The how is mind-bogglingly complex, and very few people working on any team project understand the whole thing. Some people who’ve been involved with it for a long time might have a better understanding than others, but it’s possible that nobody understands the whole thing.</p>
SN: Now, I want to back up from the theory a little bit to talk about those text files. They do matter!</p>

Code is communication</strong> with computers and humans.</h1>

Code defines data (nouns) and functions (verbs).</li>
We name things carefully because the names are meaningful to humans.</li>
A program becomes a language of its own. (Hat-tip to Dijkstra.)</li>
</ul>
SN: In the jargon of programmers, every complex system is a domain-specific language expressing our understanding of the problem.</p>

Can you guess what this code is supposed to do?</p>
fn</span> ch</span>(</span>)</span> -></span> Result</span><</span>usize</span>,</span> Error</span>></span></span>
{</span></span>
    let</span> a</span> =</span> a</span>(</span>)</span>?</span>;</span></span>
    Ok</span>(</span>a</span>.</span>f</span>(</span>S</span>::</span>H6</span>)</span>.</span>len</span>(</span>)</span>)</span></span>
}</span></span></code></pre>
SN: The programmers in the audience all guess that it’s getting the length of something, but they have no idea what that’s the length of, or what any of the other stuff does.</p>

Can you guess what this code is supposed to do?</p>
///</span> Count how many hedgies are in our zoo.</span></span>
fn</span> count_hedgehogs</span>(</span>)</span> -></span> Result</span><</span>usize</span>,</span> ZooInventoryError</span>></span></span>
{</span></span>
    let</span> animals</span> =</span> fetch_all_animals</span>(</span>)</span>?</span>;</span></span>
    let</span> hedgie_list</span> =</span> animals</span>.</span>filter_for</span>(</span>Species</span>::</span>Hedgehog</span>)</span>;</span></span>
    Ok</span>(</span>hedgie_list</span>.</span>len</span>(</span>)</span>)</span></span>
}</span></span></code></pre>
SN: You probably have a good guess about what this means, even if you don’t know the specific programming language I’m using or any programming language at all. This code communicates to humans as well as computers. This might do the exact same thing as the previous code when run, but this version has an additional layer of useful meaning, and supports Naur’s theory-building better. (It could lie, and be about counting numbats, but we try not to do that.)</p>

How do we invent</strong> that specific language to express a problem?</h1>
SN: One thing that I have learned is that no two software solutions of a problem ever look alike. I know what little I know about sudoku solving from the talk $colleague gave at a lunch and learn a couple of weeks ago. But if you gave me and $colleague the task of writing a sudoku solver, we’d write completely</em> different programs. If you gave us the task of writing a solver together, we’d write something different again. This, btw, is very cool, because it says something about human minds that fascinates me. BUT despite the differences in end result, both of us would use a similar heuristic to get there.</p>


Understand</strong> the real-world problem.</li>
Analyze</strong> it from a software point of view.</li>
Imagine</strong> a solution.</li>
Align</strong> a team on the problem, the solution, and the values that shape the solution.</li>
Coordinate</strong> to express that understanding in code.</li>
Get feedback</strong> and iterate.</li>
SHIP IT.</strong></li>
</ol>
SN: There are no secrets here. It works this way for all problems in software, whether small or large. Some things are easier when you’re a team of one– it’s easy to align with yourself. That might be hard with a team of 20, and very hard indeed when your team is is larger than Dunbar’s number. But this is how it works. Let’s look at a simple example.</p>

</p>
SN: This is Visicalc, the first spreadsheet software anybody remembers. 1977. (The first one was LANPAR in 1969.) This one invention sold personal computers to millions of small businesses and is a huge part of Microsoft’s revenue even today. Spreadsheets ate the world and run many businesses and are part of critical workflows everywhere. But somebody had to make the first one.</p>


Understand:</strong> the workflow of accountants.</li>
Analyze:</strong> These numbers and dates are data a computer can store; doing arithmetic on columns of data is something a computer can do.</li>
Imagine:</strong> What if we let people type numbers into boxes and the computer automatically did the math?</li>
Coordinate:</strong> 2 people in a room!</li>
Ship:</strong> LANPAR was 1969. It didn’t ship as we understand it; but Visicalc did.</li>
</ul>
SN: The word “spreadsheet” comes directly from accounting. Let’s go broader, and apply the process to our shared endeavor.</p>

Step 1: Understand</strong> the real-world problem</h1>

Who are our customers? What are they trying to do?</li>
This is difficult! Our industry is complex!</li>
This is why every company needs its subject-matter experts.</li>
Everybody involved in designing and implementing the software does better the more they understand the people who’ll use that software and what they’re trying to do.</li>
</ul>
SN: Our experts and our customer contact people keep programmers like me in touch with who we’re making tools for. I believe I speak for every person on the engineering team when I say that we all desperately want more understanding of our customers. Please! Talk to us!</p>

We share</strong> what we understand.</h1>

Writing and reading documents.</li>
Talking to each other.</li>
</ul>
SN: Once we understand something, we don’t leap to writing code. Instead we share that understanding.</p>

Step 2: Analyze</strong> the problem</h1>
“To a person with a pencil, everything looks like a sentence. To a person with a TV camera, everything looks like an image. To a person with a computer, everything looks like data.”</p>
—Neil Postman, “Five Things We Need to Know About Technological Change”</p>
SN: Or more succinctly, the medium is the message, and the medium of software is data.</p>

Study the data</strong></h1>
The medium of software is information, or data. Software collects or generates data, then transforms that data via rules. The process of describing the data and writing the rules is what occupies us all day.</p>
SN: Call out some of the nouns we track in data.</p>

Study what people do</strong> with that data</h1>
Data by itself is uninteresting. People are using it to do something. What?</p>
SN: Talk about how our customers use their data.</p>

Step 3: Ask how automating</strong> that with software would help.</h1>
What if… we took a process that take weeks right now, and made it take minutes instead because software does the correlation for you?</p>
SN: Marc Andreesen described this as “software eating the world”, and he should know. He invented the image tag, and that was enough for the web to eat the world.</p>

Deepen</strong> that computer-focused analysis</h1>

What data would the software need to have available?</li>
How will we get that data in a form we can use?</li>
What would we need to do with that data to present useful information to humans?</li>
</ul>

Nouns</strong>: how we structure</strong> our data</h1>
long list of nouns</em>: so much data!</p>
SN: Talk about how subject-matter experts help us identify the data.</p>

Verbs</strong>: how we transform</strong> that data</h1>

we receive a lot of data, transform it, and run some truly complex analyses on it</li>
we present that information to human beings in a form designed to help them make important decisions</li>
server engineers, UI engineers, UX designers, data engineers, and data scientists are all involved in doing this</li>
</ul>
SN: This is most of the work, right here. This is what the software does</em>, its verbs.</p>

Step 4: Align</strong> a team</h1>

on how you understand the problem</li>
on the shape of your solution</li>
on the values you bring to your solution</li>
</ul>
SN: This is what our company meeting does. Every week, we talk about what our customers are trying to do and how well we’re solving their problems.</p>

Align technically</strong> on the details of our solution</h1>

technical design choices</li>
the details of how we represent our data</li>
the building blocks of our software</li>
what our architecture is</li>
the values we use to decide among our options</li>
</ul>
SN: What programming languages are we using? How are we storing our data? Of the countless ways we might write this, which way are we picking?</p>

Technical</strong> alignment comes from:</h1>

Writing and reading documents.</li>
Talking to each other.</li>
Over and over (you don’t stop).</li>
</ul>
SN: Alignment is an ongoing task. We must constantly communicate in person and via design documents to make sure we all understand the direction we’re going.</p>

No one person ever understands the whole thing</strong></h1>
Each one of us makes decisions that push the system in the right direction.</p>
We must be in alignment, or those decisions might be at cross-purposes.</p>
SN: Alignment is critical, because complex software is too big for any one person.</p>

Step 5: Coordinate</strong> to write all those text files.</h1>
SN: DEEP SIGH. This is where all the trouble is. I could give an entire presentation on what we know about this part of it, from books people have written about their face-plants through the years. Today I’ll stick to sharing a couple of insights I hope will be useful.</p>

Software development methodologies are under-studied</strong>.</h1>
agile, scrum, kanban, waterfall, extreme programming, spiral, chaos, shape up, behavior-driven, lean, that weird UML-based thing, slow programming…</p>
SN: All of those are real names for methodologies. Which ones result in measurable, repeatable productivity improvements? No idea. Nobody has studied this. There are a few things we do know, from looking at past projects. We do know it’s a team sport, and that communication is the core.</p>


“Adding [human] power to a late software project makes it later</strong>.”
— Fred Brooks, The Mythical Man-Month</em>, 1975.</p>
</blockquote>
SN: Why? Because communication is, as we nerds like to say, an order N squared problem. Adding the 10th person to a project team adds 9 new lines of communication to worry about. This is a great book with a lot of great project insight, including the nugget that if it takes one woman nine months to deliver a baby, it does not follow that it would take 9 women one month to do it. And yet this is something the software industry keeps trying to do…</p>

We know some things are bad</strong></h1>

micromanagement is awful</li>
long periods of crunch are actively destructive (and we have research here)</li>
projects that never end wear people out</li>
</ul>
SN: These things fall into the category of yeah, people are people.</p>

… and some things are good</strong></h1>

Do write</strong> things down.</li>
Do give people and teams appropriate autonomy.</strong></li>
Do collaborate</strong> on the hardest work.</li>
Do treat each other with kindness</strong> and respect.</strong></li>
Do create emotional safety</strong>, so people can experiment and learn.</li>
</ul>
SN: Huh, none of those things are about process meetings. All of these things are about enabling smart people to do their best work. Strange. Okay, let’s talk process for two more slides.</p>

Most healthy projects do something agile-ish</strong>.</h1>

Teams do best when they understand what they’re building, why they’re building it, and who they’re building it for.</li>
Self-organization and autonomy are good.</li>
Delivering working software frequently turns out to be good.</li>
Communicating with the customer a lot is also good.</li>
The details don’t matter much, so long as you’re talking to each other.</li>
</ul>
SN: The Agile Manifesto is actually good.</p>


There is no silver bullet.</strong>“
— Fred Brooks again</p>
</blockquote>
SN: There is no single solution that works for every team in every moment.</p>

Step 6. Get feedback</strong>.</h1>
Feedback tells us if we’re on target or not. Spoiler: You’re almost never perfectly on target.</p>
SN: Feedback loops are pretty important. We need to check on how we’re doing. We run retrospectives on incidents and on projects to see how we’re doing with our processes, and learn from our experiences. Do more of this? Less of that? Feedback loops are how learning happens.</p>

Can’t we just get it right the first time?</strong></h1>
Nope.</p>
SN: And there’s a reason why we can’t.</p>


“The map</strong> is not the territory.</strong>”
— Alfred Korzybski</p>
</blockquote>
SN: Your mental model is not reality. The map is a model of the real world– the mountain and the terrain, and the trails across it. The map tells you a trail is there, but it does not tell you that the trail was washed out in a mudslide three days ago. We make our plans with the information we have, and then we learn from feedback how we’re wrong.</p>

Ways our map is wrong</strong></h1>

We didn’t understand the customer’s workflow.</li>
We got our data models wrong.</li>
We’re transforming our data incorrectly (or inefficiently).</li>
We figured out a new approach along the way.</li>
Teams didn’t align with each other, and their software doesn’t work together.</li>
Software we rely on behaves unexpectedly.</li>
We made mistakes while building things.</li>
</ul>
SN: All of these things are guaranteed to happen, mostly at a small level, but sometimes with very big concepts. So we need feedback and take active steps to get that feedback.</p>

Feedback from testing</strong></h1>
We test for many reasons!</p>

Does this one piece do what we want it to do?</li>
Are all the complex pieces working together?</li>
Does the system do what we expected?</li>
(Did we get lost despite following our map?)</li>
</ul>
SN: This is why we have QA.</p>

Feedback from our customers</strong></h1>

Is our system doing what our customers need?</li>
(Did we reach our planned destination or did our map lie?)</li>
</ul>
SN: The people who regularly talk to our customers are invaluable.</p>

Step 7. Ship it</strong>.</h1>
Get it into the hands of customers as soon as it would be useful to them. Get revenue as soon as you’re able.</p>
SN: The reality of Silicon Valley style software companies is that we all go into debt immediately to be able to pay salaries and AWS bills. We want to get out of that situation as soon as possible, so the company can keep doing its thing.</p>


“Ship or die.</strong>” — Danger, Inc, internal motto, 2002</p>
</blockquote>
SN: Before the team shipped the first Sidekick in 2002, we said this often to each other. This over-dramatic motto came from a maniacal focus on shipping, getting our product done and out there into people’s hands. But the catch is that you’re not done when you ship.</p>

What happens after</strong> you ship?</h1>
Staying alive with more software.</p>
SN: So it’s great we shipped instead of dying, but now we gotta keep the software alive too. Software is never finished! We continue to modify it after we release it to the world.</p>

Most of the cost</strong> of software is maintaining</strong> it</h1>
Every line of code we write has a maintenance cost: people, time, thinking.</p>
SN: Those half-million lines of code represent complexity that has to be understood.</p>

Living software systems must be operated</strong>.</h1>

Software must be run to have meaning!</li>
Keeping software running is an entire area of expertise.</li>
Operations teams tend the software that runs the software to run the… oh no.</li>
</ul>
SN: Text files on GitHub don’t do much by themselves.</p>

Living software systems must be changed</strong>.</h1>

the world around us changes</li>
new laws & regulations, new practices from our customers</li>
the context in which the software runs changes</li>
the team maintaining the software changes over time</li>
</ul>
The software must change in response.</p>

Changing software requires understanding</strong> it</h1>
Naur’s third point: A programmer with the theory of the system can “respond constructively to any demand for a modification of the system so as to support the affairs of the world in a new manner.”</p>
SN: Let’s call back to Naur again– changing software requires understanding it. The more complex and voluminous the software, the more there is to understand.</p>

Success can be a catastrophe</strong>.</h1>

we need to scale up from a few customers to many</li>
we learn where we need to be flexible</li>
we learn where our models were incomplete</li>
</ul>
SN: A friend who was at Twitter during its early years describes implementing things that would get them through the next six months, by which time they’d have its replacement ready to go.</p>

All software has a lifespan</strong></h1>

the changes made to it slowly build up like plaque in arteries</li>
the software in a big system usually gets replaced in pieces to keep the system itself working</li>
the system of software itself lives a long time</li>
</ul>

Congratulations.</h1>
Now do it all over again</strong> for the next product.</h1>
SN: You figured out how to eat this thing with software. You shipped. Your customers grumble sometimes, but they’re mostly happy. PHEW. Let’s do a fast recap.</p>

recap: what is software</strong>?</h1>

software is, yes, text files with instructions to computers</li>
it’s also an expression of our understanding of a real-world problem</li>
and an expression of our analysis from a computing perspective</li>
</ul>

recap: how do we build</strong> it?</h1>

there’s no perfect answer to this</li>
building software requires a team to

align on their understanding</li>
plan an approach</li>
coordinate with each other</li>
iterate in response to feedback</li>
</ul>
</li>
</ul>

recap: what happens after we ship</strong>?</h1>

software lives on long after we build it</li>
most of its cost is maintenance</li>
you have to understand it to maintain it</li>
eventually we need to replace it</li>
</ul>

And that’s how we turn a napkin sketch</strong> into something that affects the physical world.</h1>

Questions?</strong></h1>
SN: Stop sharing screen now.</p>


Accepting Work
Unknown — Tue, 19 Dec 2023 14:10:00 +0000
For “you” in this document, read “you and your team”.</p>
I link to some interesting reading on some of these anvils, but mostly I don’t. These are things I generally take as facts about the world, with the usual squishy “it depends sometimes” about some of them. I use agile methodology language, mostly, even though I like to say I really hate agile processes. Do I hate agile? Really? Let the anvils commence!</p>

Don’t let people outside the team assign work to the team. They may propose work, but you decide if you accept that work.</p>
The rate at which you accept work must be less than the rate at which you finish work, or you will have infinite work.</p>
Operational incidents and meetings count as work.</p>
Bug-fixing counts as work.</p>
Don’t accept work that you don’t understand. “Figure out this project well enough to estimate it” is acceptable work, as is “cooperate with a product designer to get design documents into a state where they describe acceptable work”.</p>
Only rarely should you say no outright to work. If it’s not well-defined, push to define the work better. (Unclear requirements make for misery on both sides.) If your team has too much work already, push for prioritization. (Something has to give. It will always give in reality, whether people admit that in advance or not.)</p>
Sometimes you need to communicate the consequences of your team taking on disruptive work and let your customer decide if the cost is worth it.</p>
Technical design and research counts as work.</p>
Estimation counts as work. The more time you spend on accurate estimation, the less time you spend on other work, such as implementation. This is often worth the time anyway, because sometimes the business needs it.</p>
Tools are not a substitute for communication.</p>
One point of the retro is to figure out what your true rate of finishing work is. If you finished less than you took on, then next time take on less work. If you finished more, cautiously take on a little more.</p>
You probably do not spend enough time doing retros and planning for your next sprint. One hour every two weeks isn’t enough.</p>
The more you understand the work, the better you do estimating it.</p>
Corollary: You do best estimating work very similar to work you’ve done before. 1</a></sup></p>
Another corollary: Estimates you make at the start of a project, when you know the least about it, are the most likely to be wrong. Build in feedback loops for estimates! Communicate with your customers as estimates change.</p>
Don’t let your early estimates get turned into deadlines.</p>
Sometimes the business itself has deadlines. Frequent delivery of working software is a survival tactic for deadlines.</p>
Sometimes you get it wrong. Use the retro to figure out what you can learn from the mistake. Remember, the map is not the territory</a>. Sometimes that clearly-marked trail turns out to have been destroyed by a mudslide.</p>
High-uncertainty projects dominate software schedules.</a> The thing that’s late because the trail was washed out ends up making everything late. Maybe it was worth an advance scout? 2</a></sup></p>
If hitting a date you provide matters, invest time in lowering uncertainty.</p>
Fred Brooks</a> spoke truth about how communication overhead dominates work. Most complex software work can’t be parallelized or sped up the way businesses want to speed it up.</p>
The fastest way to get projects done is to have an aligned team take on chunks at their own pace, without doing any planning other than technical planning. Nobody likes hearing this, but it’s a consequence of the overhead of estimating and bookkeeping.</p>
Agile™</h2>
Agile™ as practiced has little to do with the original principles of the movement. Those original principles might be summarized roughly as:</p>

You are building things for a customer. Talk to your customer.</li>
Deliver frequently.</li>
Build in feedback loops so you can figure out what you’re doing that’s working and what’s not.</li>
Let teams self-organize. Trust them.</li>
Change is inevitable, so plan for it.</li>
</ul>
I wrote those points off the top of my head, so I went to the original to see how well I did at capturing its spirit. Not bad! This is what the Agile Manifesto says.</a> Go read it! It’s short! Then weep at how far we have strayed from it. Also note what isn’t there: any rigidity about sprint lengths, planning poker, burndown charts, anybody other than the team itself deciding how to do things. It sounds pretty sensible to me, to be honest. (Maybe it’s only Agile™ that I dislike?)</p>
What I also like about that original manifesto is the focus on sustainability. “Sprinting” isn’t mentioned. Communication</em> sure is, though, and I’m 100% aligned with that. Talk to people involved in the project. Frequently. Even about bad news. 3</a></sup> It’s all about the communication.</p>
And as you know, communication has its own overhead. I point back at Brooks, who says there’s no silver bullet.</p>
Did I have a thesis?</h2>
Mostly I wanted to write down some things I take as fact about planning and estimation that are often at odds with how software organizations behave. I’ve been itching whenever I hear about teams “doing agile” or “getting scrum training”. My theory is that processes are never one size fits all. You can’t be dogmatic about them. Teams vary, and so do projects. Some teams write a lot; some teams talk a lot; some teams demo a lot. Some teams pair; some teams mob program. What’s more, teams vary over time even when their membership is mostly stable, because people change and learn.</p>
The best process for any project is probably one you design in the moment for the team. You never have to do this in a vacuum, because there are lots of good processes to steal from, and your team is probably doing some set of things already that are effective for them.</p>
Dogmatic adherence to a half-understood Agile methodology probably ain’t it. So go back to the original! It’s pretty good.</p>



The good news is that later in your career, after you’ve seen a lot and accumulated amusing war stories, you have many past projects to compare the current one to. It gets easier. ↩</a></p>
</li>

Okay, okay, I’ll stop abusing this poor metaphor. But if it’s high-uncertainty and important, it’s probably worth a code spike or a couple of weeks spent on research. ↩</a></p>
</li>

There’s a Michael Pollan “mostly plants” joke lurking here. ↩</a></p>
</li>
</ol>
</section>


A systems analysis rubric
Unknown — Sun, 10 Dec 2023 11:30:56 +0000
This is a systems analysis document rubric I’ve written several variations on in recent years. I’ve genericized it a bit and updated it with my current thinking. The form of this document is something a team would have in their official processes library somewhere, as a guide to how to do analysis of a fresh problem. I’ve had this blog post sitting 90% finished for a year now, so hey, here it is!</p>
NB:</strong> I have come to believe that there is no one process that works for every team. The process that makes a team most effective is a process designed for that team, for their current project. Don’t be dogmatic about anything! Think about the true goal, which is to write good software that does what it needs to do, making its users happy while its authors have chill weekends. Take the ideas here and adapt them to what your team needs.</p>
I no longer call this document an RFC, because I think this term comes with the implication of a slow-moving process, which has to solicit a lot of feedback because of its importance. This is perfect when you’re designing the fundamental protocols of the Internet; it is not quite what I find myself wanting my colleagues to do. I am using the term “system analysis rubric” as I think about this task right now, because systems analysis is where my head is, and what I see missing from a lot of problem-solving.</p>
“Problem statement” might also be a good name for this document, although I think it’s good to explore possible solutions in them as well as problems. Coming to a clear problem statement is possibly the most important task you have when you’re thinking about changing something or making something new.</p>
Design documents: a systems analysis rubric</h1>
A design document is a structured way to have and record a conversation about a problem. It is not appropriate for all problems you might be solving. The formality and length of the conversation depends on the scope and complexity of the problem. For a bug fix, you might need a short conversation with a single colleague, plus commentary in a commit message. For a major project, this process might take weeks to complete and you might write several of these documents.</p>
While the process does</em> produce a document, the document is not the most important result. The important result of the design process is the exploration</em> of the problem that writing the document encourages. The conversation that accompanies the exploration aligns you and your team on an understanding of the problem. Yes, a design doc might describe a proposed solution, but this proposal is secondary to a team’s collective understanding of the problem to be solved.</p>
I’m going to hammer on this point as I go here. The document exists to promote exploration and shared understanding of the problem. The document is a tool in service of a more important goal.</p>
The widening conversation</h2>
My design documents start as notes to myself. I attempt to structure my own thoughts about a problem by writing down what I’m thinking. The stakes are low; the document is so informal that it’s likely nothing more than bulleted lists of things that come to mind. As you go, your writing should tighten up and be more complete, but remember: the document is not the point. Don’t stress about sentence perfection. 1</a></sup></p>
The audience for the design document changes as it matures. When you are writing your first notes about a problem, you might share them only with a pairing partner to get immediate feedback. As you gain confidence in your understanding of the problem, widen the audience for your document. Seek out feedback from domain experts and from your team as a whole.</p>
Show your design document to its stakeholders in advance of any public discussion, to give them a chance to think and give you feedback. Follow the principle of least surprise. People can react badly to surprises even if they agree with the proposal in the main. If you can, avoid introducing complex technical topics in meetings. Meetings are best used to solidify alignment or discuss specific known open questions.</p>
When you reach the step of sharing your proposal with the entire engineering organization, it will be a solid document that you feel confident about.</p>
The process of exploration</h2>
Step one: Research.</p>

Investigate the background of the problem & document the current solutions, if they exist.</li>
Document why the current solutions are inadequate, if relevant.</li>
Gather relevant product documentation, if it exists. A product requirements document is ideal, and this phase might be focused on collaborating on requirements with a product team.</li>
</ul>
Step two: Write a clear problem statement.</p>

What change would you like to effect upon the system?</li>
What is happening today that you’d like to be different after the work you’re considering?</li>
What are the properties of a successful solution? How will you know it’s successful?</li>
Identify constraints on the solution space. Development time? Budget? Performance? A fixed point of integration?</li>
Why is this the right problem to solve now?</li>
What problems are you choosing not</em> to solve right now?</li>
Refine your problem statement until the team aligns on it.</li>
</ul>
Step three: Explore possible solutions.</p>

Identify and consider possible solutions.</li>
Discuss tradeoffs inherent in the solutions. Evaluate them against the constraints.</li>
Estimate costs of the solutions, in time / effort / complexity / maintenance / hiring.</li>
If necessary, do spike implementations to test the validity of assumptions or the viability of a specific approach.</li>
</ul>
Step four: Reach consensus on a solution that solves the stated problem while making acceptable tradeoffs.</p>
Sometimes step four does not</em> end in consensus on a solution, but instead ends in a decision to do further research. This is a good result and should not be treated as a negative by the team.</p>
The design document should now be a document describing the problem, the research, and the possible solutions, and conclude with a plan of action. Congratulations! Archive the final version in the corporate wiki or in a docs folder for the resulting project. Its next audience is the person working on its replacement, who you’ve just given a good head start.</p>
Now let’s review the parts again, in more detail.</p>
The problem statement</h2>
You’ll start with something you think is a good problem statement, but you will often</em> find that it doesn’t go into enough detail to support a good technical decision. Constraints might be missing. Stakeholders might disagree on what success looks like. Important implicit requirements might need to be unearthed.</p>
The initial problem statement informs your research, but expect to change it. Push on it and iterate until you have something the team agrees on.</p>
Among the constraints you implicitly take on for any project are your team’s shared values</em>. If your team hasn’t discussed those values, now is a good time to do so. Your shared values are partly a reflection of your team’s personality and culture, and partly a reflection of where your business is. A team at a new startup trying to ship something quickly for survival might value a minimal solution that can be produced rapidly. The same team following up after a successful first ship might value flexibility instead. Make the implicit explicit and state any values that might affect this project.</p>
Detail on the research step</h2>
Do not short-change this step! This is critical to understanding the problem. Do the background research if there is extant code. Summarize that research, with relevant links, so your readers can also understand the context.</p>
Answer scaling questions if they’re relevant. Gather numbers for today, a year from now, and as far in advance as a reasonable guess can be made. Does your solution to the problem have a lifespan? Don’t look beyond that lifespan if so.</p>
For data being stored and manipulated, you might ask questions like these:</p>

How much data is being discussed? Is it large in total size or in quantity?</li>
What actions are taken on this data? How often does it change? In what quantity?</li>
Who is changing this data?</li>
What are the constraints on data changes? Are there any conflict resolution requirements? Do operations need to be serializable (expensive) or will idempotency suffice (cheap)?</li>
What happens if data mutations are lost?</li>
How is this data expected to grow over time? Is it shardable if massive growth is expected?</li>
If the data is very very large, the questions become more specialized. If you are not a data engineer, you might want to consult one.</li>
</ul>
For APIs, the questions might look like this:</p>

What other systems are expected to call this API? To do what tasks?</li>
What are the latency requirements?</li>
Is this operation write heavy or read heavy?</li>
How many requests/sec do we experience at peak? How will this number change over time in relation to business growth?</li>
Does peak load differ from steady state load? When is the load heaviest? Does this correlate with other usage patterns in the system?</li>
If you’re caching expensive work product, identify how you’ll be invalidating that cache. What fails if the cache is stale? (Do you really need a cache? Really?)</li>
</ul>
Failure analysis is next. This topic can be where engineers shine, because we love discussing how things fall over.</p>

How might this system fail?</li>
What are the consequences of failure for this system?</li>
Should any of these failures be visible to or actionable by the end-user? If so, how should they be presented?</li>
How should we handle the most important or unusual invisible-to-users</em> errors? Retry? Escalate to human beings? Log and move on?</li>
</ul>
What are the security concerns? Do a threat modeling exercise with security experts early, particularly if you’re doing something new or not handled by existing tools.</p>

Are you accepting untrusted user input? How do you need to handle it?</li>
Who is allowed to perform these operations or see this data?</li>
Are you managing data that needs to be protected or encrypted?</li>
What would an attacker gain if one got access to your data or your API?</li>
What would a person with bad motives do if they have normal access to this new functionality?</li>
</ul>
The appropriate questions to ask depend on what your area of work is and what “affair of the world” it addresses. These questions are intended to get you started.</p>
Problem statement (slight return)</h2>
Come back to your initial problem statement. Can you sharpen it? Can you clearly define what a successful solution might look like now? If you’ve done the research, you probably can.</p>
Don’t move forward until you have consensus that the problem statement is good.</p>
Solutioneering</h2>
This is where programmers love to be. We are problem-solvers and we want to jump right to solving problems, especially if we can write code to do it. Resist this urge.</em> Your solutions have a better chance of success if they are informed by a solid grasp of the problem you need to solve. Your second and third refinements of a solution are likely to be better than your first.</p>
This step is often focused on navigating tradeoffs. The problem statement, if it’s sharp enough, gives you a good razor</a> to use to evaluate solutions against your success criteria.</p>
What are the costs of a possible solution? How complex is it?</p>
Give the solution a t-shirt size. Does it match the time budget the project has?</p>
What are the risks in the solution? How might it fail to solve the problem or otherwise fail as a project?</p>
What’s the solution’s blast radius? That is, how many other systems would be affected by the work? How many teams?</p>
Does the solution introduce new technologies to the overall system, or does it leverage tools your team understands well? If it spends novelty points, do they buy you something worth the expense?</p>
Does the solution align with the team’s values?</p>
In many cases the right solution will feel good to the team discussing it, and you’ll reach consensus smoothly. When information, values, and understanding of the problem is shared, alignment is easy. If consensus is not happening, make an attempt to figure out why the team is not aligned. Is there a disagreement about values? An information disparity? Is more research needed? Bring in senior staff to help break stalemates. Bring in somebody from another team who has relevant experience. Remember that the project might need to move forward anyway because of business needs, and a half-good solution might be better than no solution in the short term.</p>
The document’s final home</h2>
I end up making a design</code> or docs</code> subfolder in the code repo for these documents. Your organization might have an official home for documents that isn’t the repo. I suggest that you at least store a copy next to the code, where it will survive as long as the code does. The document will drift out of sync with reality the instant anybody starts implementing the plan, but that is fine. The document exists to help future maintainers understand what their predecessors were thinking at the time.</p>
Remember: the act of writing the document is more important than the document. The sharp problem statement and shared understanding of the solution were the goals of the exercise. If it got you there, it was good enough.</p>
Additional reading</h2>

The Rust RFC process</a> discusses the importance of the conversations.</li>
Architectural Decision Records</a></li>
A Structured RFC Process</a> by Phil Calçado talks about the benefits of widening the circle of review.</li>
</ul>



Correctness in these details can make an unconscious impression on readers that matters, so if you have the time, hey, spell-check yourself. The opposite side of this is that you as a reader of design documents need to set aside your own fussiness about spelling and grammar, should you have any, especially if the author is not a native speaker of the language they’re writing in. These things are to the side of the problem. ↩</a></p>
</li>
</ol>
</section>


Multi-factor panacea
Unknown — Mon, 10 Oct 2022 10:00:00 +0000
Context: @substack</a> 1</a></sup> deleted his github account, which includes a lot of foundational source code from the early days of node. The speculation (and it is only speculation as far as I know, though with some foundation in his tweets) is that he did so because of the MFA requirements being imposed on some NPM package maintainers.</p>
Here’s my take. It’s not very hot and is probably marginally more informed than many, but it’s also probably worth what you paid for it. I started writing it as a series of tweets, which is why there are some extremely terse phrasings here.</p>
Proxies</h2>
Okay, I’ll weigh in on this one, because I have spent time thinking about it, and because @isntitvacant</a>, @i_a_r_n_a</a>, and I made MFA happen for NPM originally.</p>
Companies freeloading off of open source are worried about intentional security compromises in the software they’re benefitting from. Let’s walk through their threat model: Somebody gets access to the account of somebody who works on a package that company X uses and uses that access to publish a deliberate compromise. The update gets taken automatically by the downstream consumer, and then they are shipping their environment variables out to a third party, or running a cryptocurrrency miner, or allowing an attacker to get shell access.</p>
Does forcing maintainers of “important” packages to enable MFA and never turn it off help protect companies from this threat?</p>
Kinda. Restrictions on package authors protect against one category of supply chain attacks. They protect against account hijacking. The state of the world used to be that some NPM users with critical spots in the dependency graph had passwords like “password”. No, I’m not joking. Taking away that easy attack vector seems helpful, so I’m glad we shipped what we did when we did. One good design choice we made was to not</em> implement MFA via SMS, which would have been no protection at all against these threats because social engineering makes SMS not secure at all.</p>
Requiring that a maintainer enable MFA for their account does not, however, protect the source you use from all supply-chain attacks. Legit maintainers have been responsible for some of the worst. Case in point: the infamous left-pad deletion was done by the package maintainer and MFA would not have helped one bit.</p>
MFA is still a good thing to do</em>, but it’s not protection against what the companies freeloading on open source maintainers are worried about.</p>
Why not? Because the account level is only a proxy for the level you care about. You’re far more interested in audit trails that are at the package level and then at the source level. What changed in this release? Did maintainers change? What source changed?</em></p>
An historical aside</h2>
@isntitvacant</a> did think about MFA from the package perspective when he designed the back end support for this! He also thought about restricting tokens to CIDR ranges, though I don’t know if that has ever been exposed. We missed the chance to go even finer-grained on access token permissions than we did. And we definitely should have done audit logs on package ownership changes.</p>
My only excuse is that at the time it felt like a triumph to be able to get the feature shipped at all.</p>
A side comment about the mess we’re in: NPM was designed to maximize engagement from publishers, not to be a good package manager at the scale that it reached. It was designed deliberately to be viral, not to be secure or auditable.</p>
For example: The default being to take updates without thinking about them, for instance, to the point where bots do all that work of dependency updating for you. Downloads number gotta go up.</p>
For example: The tarball as unit of deploy: huge, contains weird stuff that you don’t care about plus whatever silly things the package maintainer put in, hides the deltas. But it was very easy to implement and good enough.2</a></sup></p>
NPM’s design pushes you into not thinking about your software supply chain by design decisions made when winning a war among competing node package managers was important to somebody.</p>
Stop being pushed. Stop taking updates by default. Think about your supply chain differently.</p>
Problems not proxies</h2>
What are</em> you interested in when thinking about this threat? The source itself. The software you’re relying on.</p>
Inspectable audit trails for changes are far more interesting, and this is not something requiring MFA for package maintainers gets you</em>. Looking at maintainers is looking at a proxy for the threat, not the threat itself.</p>
Protecting against the proxy does not give you a free pass on looking at the source you’re depending on and deciding it’s okay. It does not give you a free pass to take every update there is without thinking.</p>
Questions it helps to know the answers to:</p>

Who published this?</li>
How do you know they were that person? (Same as controller of repo? Controller of other accounts? What’s the web of identity?)</li>
What was the chain of control of the source?</li>
Is the source that was published the same as the source in the advertised repo?</li>
What was the source delta from the last publication?</li>
What does the source do?</li>
</ul>
And given the possibilities of bugs and</em> of some maintainer with bad goals playing the long game, only the questions about the source are on target. The rest are proxies.</p>
The tech industry relies on software they do not take the time to inspect, written by strangers they mostly choose not to pay. Sometimes the industry pays people to work on very critical projects, such as Linux itself! But the web dev world rarely stops to pay the people who were around the node scene at the beginning, writing tiny modules because that was their philosophy, which then got bricked together without their participation into the foundations of modern web development.</p>
Because tech industry companies still don’t want to pay for the work they build on top of–with either their time or their money–they impose requirements on those strangers to attempt to protect themselves from a proxy for the threat, with zero cost to themselves. Those strangers have every right not to participate; it wasn’t what they signed up for back then. Any access to their work you had was a gift.</p>
tl;dr Use Feross’s Socket</a> to scan the source itself; you won’t catch them all; pay people to write any software you truly rely on.</p>



substack the good human, not substack the objectionable paid newsletter company. ↩</a></p>
</li>

The tarball is a case of worse is better. And yes, that’s a complex statement itself. It was good enough and easy enough to implement that it satisfied the true requirements of the problem in the moment. Knowing the problem space as well as I do now, and in the current package manager landscape, I would design it quite differently were I to take on the project myself, today. ↩</a></p>
</li>
</ol>
</section>


Goodbye Cloudflare; hello Fastly!
Unknown — Sat, 27 Aug 2022 16:53:01 +0000
KiwiFarms is a harassment website, sort of like a terrorism-only variation on the *chan sites. It specializes in harassing trans people. It doxxes them, SWATs them and their families, and does its best to drive its victims off the internet. It also has a bodycount. They are a troll farm</a>.</p>
Kiwifarms gets to do this and stay on the internet because they’re being protected by Cloudflare</a>. Cloudflare has a long history of protecting incredibly vile content: they were recently infamous for hosting Daily Stormer</a>.</p>
Cloudflare is exceptional in its position. From the Time article:</p>

“We find anecdotally that sites prefer Cloudflare because of its lax acceptable use policies and its free DDoS protection services that help protect against vigilante attacks,” the researchers write. They note that AmmoLand, a popular guns rights blog, has praised the company “for its self-described ‘content-neutral’ stance.”</p>
</blockquote>
Cloudflare takes a freezepeach position on free speech: they do not acknowledge the reality that in order to protect the free speech of the many, we cannot tolerate the abusive behavior of the few. Cloudflare protects the abusers instead.</p>
Liz Fong-Jones has been leading the current pressure campaign against Cloudflare most effectively.</p>
Why this matters to me</h2>
When I set up my blog, I hosted it in an S3 bucket behind Cloudflare, using their free plan because I have very simple needs for it. I do not want to lend them even that little support, so today I moved my blog to Fastly.</p>
I moved my last two employers to Cloudflare from other CDNs. I won’t be repeating that mistake until they shape up and start removing Nazis and troll sites without</em> needing pressure campaigns to move them. I treasure my friends and it is unacceptable to me that some of them go through their lives afraid for their personal safety because of sites like Kiwifarms.</p>
You might decide that freeloading off of Cloudflare is fine, because you’re siphoning resources from them. You might also be unable to pay for another CDN. Only you know your circumstances. I have the disposable income to spend a little more than I spend now on my AWS hosting bill on a CDN provider who doesn’t have to be pressured over and over again to boot sites like Daily Stormer and KiwiFarms.</p>
This is how I did it, very short version:</p>
Steps were:</p>

set everything up in fastly</li>
tell fastly about my certs</li>
verify that their test url worked</li>
duplicate all my dns setup in route53</li>
cut over name servers with my registrar to route 53</li>
</ul>
The rest of this blog post goes into what I did in more detail, in the hope that I can reassure you it’s very do-able.</p>
By hand, in more detail</h2>

Create a Fastly account</a>. (Set up 2FA!)</li>
Scan through Fastly’s getting started guide</a>. The concepts here are different from Cloudflare’s concepts. Fastly is (oversimplifying a bit) a nice front end to Varnish & VCL</a> plus a lot of POPs around the world to reduce latency to your users. VCL can do a lot. You end up with a lot more control over how things get routed, but the cost is more complexity to cope with.</li>
Give Fastly a credit card so you can enable TLS.</li>
</ol>
Now let’s do the switch:</p>

Find a new home for all of your DNS records. I used AWS’s Route 53</a> as my name server because I am very comfortable with it. Your domain registrar might provide name service; all the major cloud providers also do.</li>
Duplicate all of the DNS you’ve set up in Cloudflare over in your new DNS provider. Your goal is to avoid downtime when you cut over from Cloudflare to your new nameservers.</li>
Now set up a delivery service</em> in Fastly. This is a backend – the place the data comes from – plus a domain that is the face of the service – the hostname people type into their browsers. You’re setting up a mapping from domain to data source. For me, the back end is the AWS S3 bucket that holds my blog assets, and the domain name is what you see in your browser right now.</li>
Make the service active. Fastly now gives you a test domain name, like blog.ceejbot.com.global.prod.fastly.net</code>, to verify that your content is available as you expect.</li>
</ol>
Now the slowest part: do something about your TLS certs. Because I am old-fashioned and I haven’t automated all this yet, I buy certs from my name registrar. You can also use AWS ACM, which automates things pretty well. Fastly will help you set up Let’s Encrypt</a>, which is probably the best option for most people.</p>
Once Fastly is aware of your cert material somehow, you are ready to cut over. You can do this in two phases. The first phase is a double-CDN phase:</p>

Update your domain in Cloudflare to point to the Fastly TLS domain they gave you when you set up TLS.</li>
Turn off proxying in Cloudflare. Make the orange cloud gray.</li>
</ol>
Now your content should be served by Fastly instead of Cloudflare. You should see the headers change to something like this:</p>
$</span> http HEAD https://blog.ceejbot.com</span></span>
HTTP/1.1 200 OK</span></span>
Accept-Ranges: bytes</span></span>
Age: 518</span></span>
Connection: keep-alive</span></span>
Content-Length: 7149</span></span>
Content-Type: text/html</span></span>
Date: Sat, 27 Aug 2022 23:22:43 GMT</span></span>
ETag: "464e0930f8616e07530366dfa7ba0567"</span></span>
Last-Modified: Sat, 23 Jul 2022 20:01:40 GMT</span></span>
Server: AmazonS3</span></span>
Via: 1.1 varnish</span></span>
X-Cache: HIT</span></span>
X-Cache-Hits: 1</span></span>
X-Served-By: cache-pao17472-PAO</span></span>
X-Timer: S1661642564.606369,VS0,VE35</span></span>
x-amz-id-2: E2eXc0YfBq4rX2rGhwOWZMbU26NYxzGcaAzlQ7+E/zHhcp19RIpct8WwFIaDQEy6TWuhluNf1ng=</span></span>
x-amz-meta-md5chksum: 464e0930f8616e07530366dfa7ba0567</span></span>
x-amz-request-id: 9V54SNWKBG95XYY3</span></span></code></pre>
Varnish is serving my content! It’s working! Now you can safely take the last step: switch your domain’s name servers over to something other than Cloudflare. It might take a day or so for the global cache of caches that is DNS to update itself. If all went well, you switched without downtime!</p>
Automation</h2>
I did not use the websites to do this: I used Terraform</a> because I automate all of my personal infrastructure. It’s good practice, I tell myself. To use terraform you need to make an API key on the Fastly dashboard. Save it in your favorite password manager, then export it in the environment variable FASTLY_API_KEY</code>.</p>
Set up the official terraform provider. Here’s my providers.tf</code> file:</p>
terraform</span> {</span></span>
  required_providers</span> {</span></span>
	aws</span> =</span> {</span></span>
	  source</span>  =</span> "</span>hashicorp/aws</span>"</span></span>
	  version</span> =</span> "</span>~> 4.0</span>"</span></span>
	}</span></span>
	fastly</span> =</span> {</span></span>
	  source</span>  =</span> "</span>fastly/fastly</span>"</span></span>
	  version</span> =</span> "</span>>= 2.2.1</span>"</span></span>
	}</span></span>
  }</span></span>
}</span></span></code></pre>
Here’s the important part of my blog service in Terraform:</p>
resource</span> "fastly_service_vcl"</span> "blog"</span> {</span></span>
  name</span> =</span> "</span>blog.ceejbot.com</span>"</span></span>
  activate</span> =</span> true</span></span>
</span>
  domain</span> {</span></span>
	name</span>    =</span> "</span>blog.ceejbot.com</span>"</span></span>
	comment</span> =</span> "</span>the blog</span>"</span></span>
  }</span></span>
  backend</span> {</span></span>
	address</span>       =</span> "</span>blog.ceejbot.com.s3-website-us-west-2.amazonaws.com</span>"</span></span>
	name</span>          =</span> "</span>the s3 bucket</span>"</span></span>
	port</span>          =</span> 80</span></span>
	shield</span>        =</span> "</span>pdx-or-us</span>"</span></span>
  }</span></span>
}</span></span></code></pre>
This terraform fragment sets up DNS so Fastly handles requests to my content:</p>
resource</span> "aws_route53_zone"</span> "ceejbot-com"</span> {</span></span>
  name</span>         =</span> "</span>ceejbot.com</span>"</span></span>
}</span></span>
</span>
resource</span> "aws_route53_record"</span> "blog"</span> {</span></span>
  zone_id</span> =</span> aws_route53_zone</span>.</span>ceejbot-com</span>.</span>zone_id</span></span>
  name</span>    =</span> "</span>blog.ceejbot.com</span>"</span></span>
  type</span> =</span> "</span>CNAME</span>"</span></span>
  records</span> =</span> [</span></span>
	"</span>n.sni.global.fastly.net</span>"</span>,</span></span>
  ]</span></span>
  ttl</span> =</span> 3600</span></span>
}</span></span></code></pre>
This isn’t all of the terraform in my setup. I also have a policy on the S3 bucket restricting access to it to Fastly’s public IP list</a>. That’s a reasonable practice to prevent an accidental gigantic AWS egress cost.</p>
You can do a lot more with VCL and Varnish if you feel inclined. For a while through the mid teens, all of NPM’s registry traffic was proxied through Fastly, with a carefully maintained custom varnish file routing things as much as possible at the edge. Most of us won’t need that power, but it’s available if you do.</p>


Reduce Friction
Unknown — Sat, 23 Jul 2022 12:54:39 +0000
The topic of reducing friction exhausts me: Do people still need to be persuaded to help their developers go faster? Really? In this, the year 2022? But yes, in this, the year 2022, many teams require persuasion on this topic. Or rather, their leaders require persuasion that they have to do more than give lip service to this principle, and that they must invest resources in making it so, and that those resources will not be “wasted” resources, not even for that</em> person, you know the one, the official VP of Feature Factory.</p>
Some leaders are not worried about wasting time, but are instead worried that devoting brains to this work will slow teams down</em>. They admit that current processes are full of friction, but claim that they have to finish whatever they’re in the middle of before they should try to fix things. They think that reducing friction is a distraction from the real</em> work. This approach is short-sighted. The best time to reduce friction for your team was the moment it came into being, and the second best time is now.</p>
I’m going to cover three topics in this post. First, I’ll define what we mean by “developer friction”. Then I’ll make the case about why reducing friction is beneficial to engineering organizations, including benefits in areas I didn’t expect. And then I’ll go into concrete suggestions about how to do it, and the mindset that you need to bring to thinking about it. As is true with many other posts in this blog series, its audience is people who are technical leaders in their organization, but I hope anybody who wants to help their engineering org do better work can get something out of this.</p>
Defining our terms</h2>
Let’s start by defining “process”. Process is the way you habitually do things</em>. Do not confuse process with ceremony or formality, or any other term you’d like to use to describe overhead added to the core of the thing you want to get done. You always have process.</em> You might not have thoughtfully-designed, intentional process.</p>
“Ceremony” is a thing you do every time, ritualistically, usually involving other people. Regular meetings are a kind of ceremony. “Formality” refers to how prescribed and enforced a process is. When people react to “process” as a bad thing, they’re usually thinking of processes with heavy formality or more ceremony than they’re worth.</p>
An example of a team process: “We prefer to have code PRs reviewed before we land them in main. It’s okay if docs or other non-functional changes don’t get reviewed and go directly into main.”</p>
Adding ceremony: “All changes need to go through PRs, though we don’t require review.”</p>
Adding more ceremony: “All changes must go through PRs with review, but we are okay if reviews are a rubber stamp.”</p>
Adding formality: “We require that all PRs be reviewed & all CI tests pass before they can land in main, and we enforce this with settings in our source code repo that only administrators can change.”</p>
Here’s a non-tech example of ceremony that might help you recognize it: pointing and calling</a>. This is a ceremony that helps operators of dangerous equipment (most often trains) confirm to each other what the status of important indicators is. Station guards will point at an indicator showing which side of the train to open the doors on, and call out as they do so, making sure the train conductor knows which set of doors to open. Adding a ceremony to the process helps the operators avoid opening the wrong set of doors. Another example of this would be lockout-tagout</a>. This formal ceremony ensures that people know when dangerous equipment is deactivated and can be worked on safely.</p>
Let’s talk about “friction”, the main thing this post is worried about. Friction</a> is increased in a process in each of the examples above. “Friction” is a useful metaphor here because each of those examples oppose motion</em>: they demand more energy be invested in moving the project than would be required if they weren’t there. This might be a good idea! Lockout-tagout makes equipment safer to maintain. The lowest possible friction version of the PR example above is “we don’t care if code gets reviewed; merge right into that production branch.” You can see why adding friction in requiring PRs might be good for that team.</p>
Adding friction is just fine when it buys you something worthwhile.</em></p>
Teams with high levels of trust don’t need more than that first version of the PR process. Teams that don’t trust each other–or are perhaps required not to trust each other because of mandated security processes–need something more like the fully-formal version. A team that needs that fully-formal version will move more slowly than the first team. Is this worth the cost? It depends on the situation! Your goal is to identify your team’s work habits and work environment and identify things that are slowing everybody down without buying you something worthwhile</em>.</p>
Sometimes process is… well, ludicrous and obviously causing harm. This Twitter thread is full of pure, wasteful friction. Merely reading it raises my stress levels.</p>

Let’s share tech stack horror stories: what’s the worst workflow or most absurd limitation you’ve hit with a codebase?</p>
</blockquote>

I’ll start: while working as a subcontractor, I wasn’t able to submit code directly for review. I had to attach the updated files to an email. 🥲</p>
</blockquote>

What’s yours?</p>
</blockquote>

— Jason Lengstorf (@jlengstorf) July 21, 2022</p>
</blockquote>
Process isn’t the only source of optional friction, and it might not be the most painful source. Instead, the work environment is often the worst source. The tools. The platform. CI workflows. Automation or, more likely, the absence</em> of automation. Things that break and require human intervention. Buggy tools. Slow tools. Things people need to do often that are flaky. Builds that take forever and slow down develop-test loops. Continuous integration testing that takes a long time to run and slows down landing all work. Slow deploy processes that make the cost of pushing changes live high, and therefore makes pushing changes dangerous.</p>
The other term we need to define is “toil”. The English word means “labor that tires you out”. In the context of tech world jargon, we use it to mean work that’s draining or time-consuming that doesn’t seem to be related to the core of what we need to get done. Repeated work. Predictable routine work. A process that is predictable and time-consuming but has to be done by hand is toil</em>. Resolving Dependabot PRs to your repos is toil</em>: it feels like work but accomplishes nothing worthwhile.</p>
You shouldn’t tolerate either toil or tools misery. They are entirely avoidable, and they’re killing your team’s velocity and making everybody unhappy. Take stock of problems in this category, prioritize them, and eliminate them.</p>
Making the case</h2>
You might think it would be easy to point to these sources of slow-down and say, “let’s fix things”. In practice, you might get pushback. Why? What can we, as technical leaders, do about the resistance to making things better?</p>
First we must acknowledge that changing any system is difficult: systems are self-reinforcing for many reasons. People within the system see the cost</em> of change clearly, but they often don’t have good ways to measure the rewards</em> of change. Also (and let’s be honest here) all of us have lived through having change promoted to us as unalloyed good, then seen it turn out to be not so great. Or actively awful. People proposing change have a higher bar to jump over than people who want the status quo. So if you want change to happen, you have to invest energy yourself. You’ll need to make the case for action.</p>
Why hasn’t anyone else made the case? Why is your team stuck here? Good questions! Remember that the people next to you in this situation probably hate the friction just as much as you do. If they could stop it, they would. Once again, we have to go to the system they’re in and look what what it reinforces. You, as an analyst of that system, have an easier time popping out of it and changing it.</p>
Let’s look at some reasons why people around you might resist the push to make things go faster.</p>
It didn’t happen overnight</h3>
The team might be unaware of how bad the problem truly is. They might not have noticed it was happening, because it probably didn’t get bad all at once; the slowdowns and the trouble got worse slowly over time.</p>
To show how bad it is and break people out of denial, you might go to the data. How costly is the friction? Measure it! Count the number of times tool X</em> explodes and the team wastes a day on cleanup. Graph how much time people spend waiting for slow builds. The data will help you prioritize, so it is not a waste. (I think gathering metrics on internal tools is a good habit for teams even when everybody’s happy.)</p>
Ownership</h3>
The resistance to change might come from a far more human and emotional place. People might be attached to the things they built in the past, and reluctant to retire them. Don’t be a jerk about the software past versions of the team wrote. People do the best they can given the circumstances they’re in. Solutions that solved the problems of the past might no longer be good at solving the problems of the present. Honor the work done earlier, and let people feel good about it even as you’re coaxing them into replacing it. If you can, let them own the work</em> of making their thing better. If that’s not possible, at least seek out their feedback and ask them what they’d do differently this time around. They probably have good ideas.</p>
Sometimes people will block whatever work happens. They might want to retain control. They might be unable to admit they were wrong about something. The worse case I’ve seen was somebody who simply resented all authority telling them what to do about anything. Toxic orgs probably feature several people like that. Do I have to tell you what to do here? You don’t want to do it, because you’re a human being with empathy, but sometimes you have to fire people.</p>
Stress</h3>
Organizations with a lot of friction might have people stressed by the work of pushing things forward despite the friction. Your most dedicated and motivated colleagues might be working the hardest to do this, and suffering the worst stress as a result. Stressed people can’t imagine adding to their workload by revamping existing systems that work, however poorly. They will resist change to protect themselves from their burdens getting worse.</p>
This is an own-goal on the part of the organization. Leaders can prevent this, and indeed must. Stressed people don’t do their best work. Full stop.</p>
Stressed people need to have their immediate needs honored and work shifted away from them. You must not listen to their opinions about what can and cannot happen until you’ve fixed their immediate emergency. Indeed, removing friction might give them the space to imagine a better world.</p>
Don’t ask them to do the work of fixing their desperate situation. Fix it for them. This one’s on management, and maybe on you, o fellow technical leader.</p>
Learned helplessness</h3>
The most depressing resistance to change comes from people who say that this is how bad it always is. They can’t imagine things being better.</p>
Anecdote time! I once worked for a moderately successful but not quite successful enough startup that made a hardware thingie you might even have heard of. Eventually it was acquired by ConHugeCo Software, Inc, a very very very large company indeed that you’ve definitely heard of. The new corporate owners wanted their newly-acquired software team to work on project Foobar, already in motion. Foobar had a lot of existing process and tooling and a team that was already pushing it forward. They were behind. They were engaged in weird political machinations to create excuses, they were so behind. Surely this acquihired team could help!</p>
Um.</p>
Eventually I joined project Foobar, and I learned why it was behind. Getting a single commit into the source repo for project Foobar took at least half a day and sometimes an entire day. You had to get into line to check in. When you were head of the line, you had to resolve any merge conflicts that were caused by the people who merged in since you got into the line. (And no, this was not</em> git.) You then had to build the full thing, and that was slow. Hours slow. Then you had to test. Then you could merge. Heaven help you if you broke the build: there were people who would get mad at you about that and penalties for it were discussed.</p>
“Why,” I asked somebody, “do we not have a build team making this faster and better?”</p>
The answer stayed with me. It was: “Nobody wants to be on a build team. They get laid off when their work is finished.”</p>
Laid off. Their work. Finished. Uh. What?</p>
The culture gap was epic and unbridgeable. The project turned out to be a famous disaster. Are you surprised? No? None of us at $acquiredCompany were surprised, either. The acquiring team could not imagine healthier processes. The cudgel was their only tool. They did not fix anything because that’s the way things were.</p>
This is learned helplessness. Reject it. Things can be better than that. It is not only possible but normal</em> for things to be better. I know that. You know that. Stand up for it.</p>
If you can’t, leave.</p>
The positive argument</h3>
Let’s make the case with more positive arguments. What will you get by relentlessly reducing developer friction? The obvious benefit: the whole team will go faster. I have to call this out explicitly, because a lot of the pushback to the idea of reducing friction comes from not thinking about what this means.</p>
Everybody. Goes. Faster.</p>
Reducing the amount of time it takes to do something by a couple orders of magnitude can have radical effects not just in kind but in category. When it took many minutes do download a single MP3 file, nobody was streaming movies. Now that gigabit fiber is an option for many homes, we’re streaming high-definition movies on a whim. Things you couldn’t imagine happening before become normal. You can probably think of more examples like this.</p>
Here’s a modern example I’ve lived a couple of times now:</p>
Deploys become fast: the cost of making changes is now low. 

The cost of making changes is low: people become less fearful of making changes. 

Less fear: changes get smaller and more frequent. 

Small, frequent changes: less dangerous inherently, so failures happen less often. 

Failures happen less often: the team becomes more confident.

A confident team experiments and pushes themselves into trying new things. 

Everything gets better.</p>
This is a virtuous cycle. This particular virtuous cycle can be promoted in lots of ways–great CI for instance–but hey, even CI benefits from running fast. And frequently. And easily from a developer’s laptop and not just a remote process if you can wrangle that one. A barrier to doing something is a kind of friction too!</p>
Friction is frustrating</em>. It generates stress. Nobody enjoys slogging through a ceremony they can’t see the benefits of. Nobody enjoys watching a deploy fail again</em> in the same way as the previous five times this week. Friction without payoff makes people unhappy. To my mind, this is reason enough for fixing it. Content people who are comfortable and talking regularly with their colleagues do great work; unhappy teams spend their time fretting about their unhappiness. The world is stressful. Don’t add to it. This is ethically good as well as pragmatic for whatever your shared venture is.</p>
Let’s make a more banal, money-based argument next.</p>
Salary is, for most companies, the single biggest cost they have. Stop wasting that money! Why are you spending money making your programmers do things by hand that could be done by a small shell script? This is overall a complex topic, and a lot of things factor into your decision to build, buy, or do nothing. Here, we’re most likely talking about build OR buy vs doing nothing at all. A fast calculation of salary hours vs payoff is useful for deciding when act as well as when not</em> to act. Make a rough estimate of how much time your team is spending wasting on waiting for builds (fixing something, pushing a repeated process by hand, etc.) for the entire year</em>, then compare that to what you’d invest into a single push into making that faster.</p>
Once again, measurements help to inform your decisions. If you don’t have data, do something lightweight to get it.</p>
Things to try</h2>
You are convinced! You have convinced others! You are able to act to reduce your team’s friction! How do you do it?</p>
Start by asking your team what is slowing them down. They will straight-up tell you what’s wrong. Listen to reports of irritation; if the irritation rises to the level of frustration pay special attention. You might not take your team’s proposed solutions</em> at face value. Here your team is like any software user, who will tell you all about the solution they’ve imagined, not the best solution you might provide. Listen to what people are trying to do and why they’re being prevented. Pay attention to the reality of their stories. Question everybody’s assumptions about the way things have to be, including your own.</p>
Imagine what you would do in the ideal case, if you were designing the thing from scratch today. Take a step toward that ideal from where you are now. This is</em> possible.</p>
If you’re using bad software, stop.</h3>
Is your system configuration software driving you nuts? Switch to something else. (It will drive you nuts too, but perhaps less nuts.)</p>
Is X</em> famous SAAS thing that was super-cheap to buy driving your team nuts? (I’m looking at you, ubiquitous but relentlessly mediocre famous suite of tools.) Switch to something else.</p>
Has your team staged a revolt and started using something that isn’t the official choice? Listen to the pain of your team. Honor the pain. Switch to their choice. This isn’t about allowing chaos to reign, but about paying attention to existing signals, and paying especial</em> attention to strong signals.</p>
Make team software changes definitively and without half-measures. Commit to the change. Retire the old stuff. Plan a cutover if necessary so you don’t leave mess behind: do any required data migrations. Get feedback on the results. You shouldn’t make changes like this on a whim unless the cost of change is pretty low, but doing it on the worst offenders can be a huge morale boost.</p>
Treat internal tools as important software.</h3>
Work on internal tools is highly-leveraged: every one of your developers will write better software when their tools are good. It is worth</em> devoting senior engineering brains to them. It is worth devoting your</em> brain to them if there is nobody else. Your job, o fellow technical leader, is to make your team successful at building the widgets your organization wants to build. We must do the things nobody else can do.</p>
If using an off-the-shelf tool isn’t possible, then the tool you’re building is critical to your product. Treat it like that. Take the work seriously. Design it thoughtfully. Do your usual requirements analysis! Who’s using this tool? What are they trying to do? What are the performance and latency requirements? How should errors be handled or reported?</p>
Sweat the output of internal tools. Don’t bury important results of CI in a rubbish heap of uninteresting compiler output. Tufte’s design principles</a> apply here too.1</a></sup></p>
Doing this analysis on testing system output was super-fulfilling and helpful for the consumers of the test output.</p>
Common tool areas for you to think about:</p>

Chat and video conferencing software: is it reliable and high-quality?</li>
Bug/issue/task trackers: help or administrative burden?</li>
Source control software and tooling around it.</li>
Development environments: setup of any common software that your team needs to use. Examples would be specific versions of a language runtime or compiler needed to develop software.</li>
Internal tools that solve problems specific to your internal workflows.</li>
Build systems, both for the develop/test loop and for release processes.</li>
Deploying software. Is it fast? Is it reliable?</li>
The substrate upon which software gets deployed.</li>
Automated testing, particularly integration testing.</li>
</ul>
Distribute internal tools in compiled, packaged form. Don’t make people build/install them every time they need to use them. Have enough release process for these tools to ensure they work. Consult user</em> convenience, not developer convenience here. (The needs of the many, etc etc.)</p>
Treat your processes as worthy of thoughtful design.</h3>
I mentioned earlier that you always have process, because process is the way you usually do things. Think about your processes</em> and tweak them as needed to remove unnecessary friction from them.</p>
Water runs downhill. People always do the thing that’s easiest to do. Your goal is therefore to make the right thing to do the easiest thing to do. If people are regularly doing any end-run around a process to get work done (say, regularly asking for rubber-stamp PRs so they can be unblocked), you have a process that’s not earning back its energy cost. Fix it.</p>
What are the goals you want a habitual-way-of-doing-things in an area to achieve? What values do you want to express? Be clear about them. Be clear about the priorities of your values. You might need to honor high priorities and let lower priorities go unfulfilled.</p>
Make sure you have a feedback loop</em> somewhere helping you evaluate your new processes. Designing processes without feedback from the lived reality is possibly worse than not designing them, because you’ll have people held accountable for doing things that turn out to be bad ideas. Iterate. Improve. Nothing need be set in stone. It’s okay to change! It’s okay to look at where people are walking right now and pave those paths. It’s a decent starting point.</p>
Jump out of the system and examine its assumptions. One way of reframing the “I’m blocked by no PR reviewer here” problem is to notice that the person who’s blocked did the work alone and has no team or buddy who shares context about the work. If they paired, they would have an instant PR review, and a pretty high quality one.2</a></sup> If the work was planned work and review was blocked, perhaps time for reviews should be budgeted into your team’s plans.</p>
The best process is one that your team doesn’t even think of as a process because it’s been automated into invisibility.</p>
Automate.</h3>
Obliterate toil: automate it.</p>
Automate ruthlessly. This is where I have seen the most surprising</em> pushback. We’re programmers. Automating processes is what we do! People will flinch about this, afraid of time spent automating things that won’t pay off. Yes, we’ve all been there. So don’t do that.</em> Don’t automate things that are really one-offs. If there’s any chance you have to do the same thing more than five times3</a></sup>, automate it. If it’s complex and difficult for a human to do, automate it. If the blast radius of the explosion caused by a human doing it wrong is large, automate it. If the end results need to be the same every time, automate it.</p>
Infrastructure should be automated as far as you can push it.</p>
The upside of automation is that the software that does the work for you can be instrumented.</p>
Measure and observe.</h3>
This is a corollary of deciding to treat your tools as important software, but it’s worth calling out.</p>
Measure everything, and make the results of the measurement visible.</em> Measure how long a process takes. Measure how long PRs sit unreviewed. How long each step of a deploy takes and how many deploys fail. Make all of this data easy to look at.</p>
Instrument your tools so you know how often people are using them, how long the runs takes, and whether they succeed or fail. (Don’t instrument so heavy-handedly that you slow them down.)</p>
My favorite way to do this is to use Honeycomb</a> to trace everything, not just our production software. At a recent job we instrumented builds, deploys, and CI runs this way. The output of those runs prominently included links to Honeycomb’s visualizations of the traces. Every build and deploy report included a link to a view like this about how long it took:</p>
</p>
Is this deep? No. Did it take a long time to do? Also no. Is it helpful? Definitely yes</em>. Imagine this, for everything. Imagine this, telling you about timings for every single internal tool you run, including the exit code returned and who ran it. Imagine how much better you can make every single tool your team uses with data like this.</p>
You might have another tool you like to use here, which is great! Please tell me about it on Twitter!</p>
The deer, they are teal</h2>
Here’s what I’d like you to take away from this blog post.</p>

Friction is slowing down your team.</li>
The energy cost of overcoming friction needs to buy you something worthwhile, or it needs to be reduced.</li>
Investigate friction by talking to your team. Frustration is an important signal.</li>
Observability isn’t just for your production software: measure everything. Use data to inform your decisions.</li>
Order of magnitude changes in cost result in entirely new behaviors.</li>
Design your processes.</li>
Design your tools.</li>
Automate ruthlessly.</li>
Set up feedback loops so you learn what’s working and what’s not.</li>
</ul>
Most importantly, you can</em> fix it. Every little bit you fix gives you more energy back so you can fix the next thing. It will</em> be worth the investment.</p>

My thanks to Chris Dickinson</a> for the lockout-tagout and pointing-and-calling examples! Also my thanks to David Zink for editing my prose into a tighter form.</p>



Tufte’s design principles, recapped because they are so good:</p>

Above all else show the data.</li>
Maximize the data-ink ratio.</li>
Erase non-data-ink.</li>
Erase redundant data-ink.</li>
Revise and edit.</li>
</ol>
He’s talking about visual design, but this works for writing as well. ↩</a></p>
</li>

To repeat myself: PRs are best used to socialize work that’s already in a good state, not to find bugs in work somebody has already decided is finished. In other words, the useful review and tightening should happen before</em> the PR process, in some earlier phase. Pairing is good. Strong testing is good. Team discussion about ways of solving a problem are good, so the approach taken in a PR doesn’t need to be debated. The PR is to say to a wider audience: hey, this thing happened. An exception to my own approach: small, uncontroversial bug fixes are perfect for review in PRs. ↩</a></p>
</li>

I kinda want to say “three times” here instead of five, but you know, use your judgement. Do a little basic arithmetic on how long a thing takes and how often it’ll need to happen. Think how important getting it done consistently is. Prioritize to match. ↩</a></p>
</li>
</ol>
</section>


Against dogmatism
Unknown — Sun, 29 May 2022 13:34:29 -0700
Sometimes I think that my next conference talk ought to be nothing more than a live read-through of Tef’s blog post, “Repeat yourself, do more than one thing, and rewrite everything”</a>. This is a bad idea because Tef should do that, in some post-pandemic future when international travel is safe again and I can attend and buy him a drink. So I’m going to let Tef’s blog post push me off into my own direction instead, and attempt to add something useful to his wisdom bombs.</p>
Tef’s main point—worked through via examples of common advice given to programmers that is sometimes bad advice—is that all advice has a context</em>.</p>

When you hear a piece of advice, you need to understand the structure and environment in place that made it true, because they can just as often make it false. Things like “Don’t Repeat Yourself” are about making a tradeoff, usually one that’s good in the small or for beginners to copy at first, but hazardous to invoke without question on larger systems. – Tef</p>
</blockquote>
“Don’t repeat yourself” is the advice I rail against in a recent post in this series</a> because I saw how application of the advice damaged a particular code base. Any two code paths that looked at all similar were collapsed into single methods with long parameter lists, with flags and null checks to determine mid-flow which one of the five different entry points was in use this time. This made the code difficult to understand, debug, and change, because any change had to be verified as appropriate to make for many different entry points. Every bug we worked on required careful documentation of the many ways a specific code path could be invoked and careful mental simulation of execution for each.</p>
Was the maintenance cost worth whatever was saved by not duplicating some smaller sections of code? No. But probably it didn’t start out that way: it started out with somebody adding an entry point and not</em> copying code, because, well, don’t repeat yourself. And then do that a few more times, each time adding a parameter while scrupulously not repeating code, until the programmers who understood each path through were all gone.</p>
I grind an axe here, of course. My point is that following the DRY advice dogmatically was a bad idea</em>. It’s the dogmatism that gets you.</p>
Dogmatism says: Don’t repeat yourself means don’t repeat any code, ever.</p>
Dogmatism says: This particular one project management methodology is the one true methodology! Every team at this company will do agile/scrums and always-pair-program/never-pair-program while fibonacci-pointing/playing-planning-poker.</p>
Dogmatism says: Object orientation is the only way people should structure code and therefore this programming language only has classes.</p>
Dogmatism says: All software must follow one of the named design patterns in the Gang of Four book/some other book and if you can’t name the pattern you’re doing something wrong.</p>
Put that way it sounds silly, right? So we do we keep doing it?</p>
Because we don’t like the reality that we must always do the work</em> to find the right solution to the specific problem in front of us. It’s much easier to fall back on a set of rules that we don’t have to think about or make hard decisions about. But this compromises our solutions.</p>
There’s a blog post in me about how making tradeoffs well requires understanding clearly the values you’re using to select among possibilities. Every value is a razor you can use to make decisions. Dogmatism is a value! It makes decisions for you.</p>
I suspect dogmatism is a value we often hold without self-reflection. That is, we can hold it as a value without being aware that it’s a value and that it is influencing our decision-making. I think it makes bad decisions. Dogmatism doesn’t let you weigh tradeoffs. And friend, it’s tradeoffs all the way down.</p>


One year for a one-line fix
Unknown — Tue, 24 May 2022 14:59:59 +0000
This blog post is harder to write than you might think, because it goes right into a number of people problems and the behavioral patterns of toxic organizations. I wish to preface all of this by noting that toxic organizations warp the behavior of everyone in them. People who might behave in healthy ways in healthy orgs find themselves behaving badly inside toxic systems. The only thing to do is fix the organization first. So I have sympathy for everybody involved in this story, both my unknown predecessors and the people who were right next to the problem the whole time. I am most interested in what I will do differently</em> next time.</p>
With that preface in mind, let’s tell a story.</p>
Let me tell you a war story.</h2>
Once upon a time there was a dot-net monolith, one that had been poorly maintained for a long time, hacked upon by rushed people who were evaluated only by how fast they pumped out the next feature the CEO wanted. This dot-net monolith was in a poor state and everybody around it knew that. It was expensive to run (its AWS costs were enormous), expensive to work on (making changes was time-consuming and dangerous), expensive to deploy (deploys often broke and took hours to resolve), difficult to test (the test suite was a mechanical turk service that ran overnight), and difficult to understand (re-entrant side-effect-heavy functions and in-memory caches on top of external Redis caches made for some fun race condition factories). The team around it knew it had trouble.</p>
Enter me, somebody who didn’t know a dot-net from a dot-product. I was brought in to scale out the system, which had a lot of new code written in JavaScript around that dot-net thing. I was fairly confident in my ability to make node jump through hoops. I knew that C# was Microsoft’s proprietary version of Oracle’s proprietary Java, so at least I could read the code. Mostly. At my request, I started out fixing bugs on the team that touched the most varied parts of the system, so I could get my hands dirty first, learn how things fit together in reality, and earn credibility with the overall team before I had to start making changes.</p>
Two weeks into my new job, the entire system fell over on a regular weeknight, under regular load. And by “falling over”, I mean it became non-functional. All API endpoints began to fail to respond. The site was down. Nobody could purchase widgets and have them delivered.</p>
Why? Nobody could say. It looked like it was Redis. At least, the CPU on the Redis cache instance was hitting 100% and when it did, everything stopped.</p>
Now, like many of us, I was very familiar with Redis. I trusted Redis. It is often the most reliable piece of software in my stack. I’d pumped a lot of traffic through Redis at the world’s JavaScript registry, a lot more than this single-state retail outfit could possibly be sending through it. What was this system doing to Redis that was making it thrash so badly? Nobody knew. What were we putting into Redis? Nobody knew. How many objects were in it? Nobody knew. How big were they? Nobody knew.</p>
“Where are your metrics?” I asked. There was an expensive hosted Graphite service, but nobody was looking at it. There was an expensive APM product wired up to the monolith, but nobody knew how to interpret it. There were Cloudwatch graphs! Only infra had access to these or the ability to make dashboards.</p>
At this point I knew what my job needed to be first. I went on an observability tear over the next months, among other tears inspired by this outage.</p>
We got through that initial outage by upgrading to AWS’s largest Elasticache, which was ruinously expensive but seemed to hold up under the load. We then mitigated the problem around the edges by taming some problems with the website hitting endpoints more than it needed to, and at the core (most meaningfully) by splitting up the cache into several different cache instances. (Two very thoughtful engineers had already browbeaten their way into being given time to refactor the code enough to make this split happen, because they knew this was a problem area before the outage happened. The implications of this sentence are entirely intentional, and we’ll come back to them.)</p>
We limped through the weeks remaining until the day that was the big sales day for the industry, the one that was going to be the biggest day ever with $X of revenue, for some record-breaking value of X. The entire company prepped for months for this event, with marketing and incentives and ordering stock to be sold.</p>
Three hours after opening, the system went down. Adding more instances of the monolith brought the system down harder. In the end, we had 3 hours of downtime in the middle of the hottest business day of the year, the equivalent of Black Friday, and this downtime ruined the work of everybody at the company who’d prepared for that day. It was bad. Very bad. Company-harming bad.</p>
One year later, my colleague Chris and I identified the problem and fixed it with a one-liner.</p>
That’s a heck of a war story.</h2>
Right? The one-line mistake that nearly killed a company, and the one-line fix that saved it. Except, well, it’s more complicated than that.</p>
In the end, none of the observability instrumentation I added mattered.1</a></sup> It was the default Cloudwatch Redis graphs that identified the problem. When we stood up an idle cluster of the service using our new deploy system (deploy times down from 30 minutes minimum to less than 3 minutes tops)– ahem–</p>
Hold on, you rewrote the deploy system?</em></p>
Yeah. When the entire infra team was laid off we were finally able to fix probably the worst cause of daily development friction–</p>
Hold on, the entire infra team was laid off? And this let you fix things?</em></p>
Yes, as I was saying, we finally had access to everything and freedom to fix what had been obviously broken for a long time.</p>
Hold on.</em></p>
I know. There’s a lot to unpack here, and I’ve been struggling for some time to find a way to unpack it that remains kind to the people who were trapped in this toxic organization alongside me, doing bad things because that’s what the organization wanted them to do. For some of that</em> story, read “Dysfunction junction”</a> first.</p>
I’ll talk about those things in a minute.</p>
The punchline to the story.</h2>
Back to the bug. Redis. AWS’s largest Redis. CPU hitting 100%. That bug.</p>
As I was saying, testing our new deployment system was what made us look at this problem again with fresh eyes. We used the new deployment system to stand up an idle production cluster</em>, ready to be swapped in for the older prod cluster that used the old deploy system. This was something we’d done for every microservice in the system, so we had a lot of practice doing it, and at last we were doing the hard one, the dot-net one.</p>
The moment we brought the new, idle cluster into existence with terraform, we noticed the Redis instance CPU spike and cause trouble to the production system. That was surprising! The new prod cluster wasn’t live yet! We looked at the full set of Redis graphs and noticed an oddity. The new connections per minute was absurdly</em> high normally, and the idle cluster had just spiked it higher.</p>
First, no way should any process be generating new Redis connections except on restart. That graph should be sitting at zero. Second, idle clusters shouldn’t be creating any new load on Redis except at process start.</p>
“Huh,” we said. “Could pings from the load balancer be creating new connections? Because pings are the only traffic it’s taking. That would be fundamentally broken but it would explain this.”</p>
So we fired up a video chat, shared a screen with the code that injected Redis into the dot-net frammistans, and set about understanding what it was doing. We learned the word “lifestyle” and read the docs on the various kinds of lifestyles: transient, scoped, and singleton. Nearly all of the Redis connection managers were singleton lifestyle, which is for the lifespan of the application. Seems good! Then we noticed one line that didn’t look like the others, injecting connections for the general-use cache:</p>
container</span>.</span>Register</span><</span>ICacheConnectionManager</span>,</span> RedisConnectionService</span>></span>(</span>Lifestyle</span>.</span>Scoped</span>)</span>;</span></span></code></pre>
“Scoped” means to create an instance of the thingie once per request lifecycle</em>. Once per request.</p>
Every. Single. Endpoint. Invocation. Created. A. New. Redis. Connection. Pool.</p>
All requests, not just requests that needed to use a Redis. Requests like the health check endpoint, the one that should be near-zero cost because load balancers hit it frequently, requests like that. This is why the Redis cpu graphs looked like some kind of exponential function on the “active thingies in the system” count, because it literally was. The system had been DOSsing itself into downtime for years.</p>
We changed that one line to give it a singleton lifestyle and deployed the change to our (new, shiny, one of many cattle) integration environment. We observed that the new connections graph began behaving as we expected, and everything kept working. So we deployed it to production.</p>
Really easy fix. It let us stop running the largest Elasticache AWS sells, collapse all the split-out caches into the new much more modest not-clustered cache, and made everything go faster. Scaling horizontally no longer caused the system to punch itself in the face. That plus the new fully-terraformed ALBs made dealing with big days completely routine, and engineering commenced a very quiet two years of rebuilding with an AWS bill that was a fraction of what it was before, and I mean holy heck we cut that bill down to something that was [Ceej’s editor has deleted a lot of ranting here].</p>
However. I remain unhappy about this fix.</p>
I should have spotted this a year before, during the original Redis-caused outages. If I had seen that graph– and I should have demanded to look at all those graphs– I would have known immediately that something was very wrong, because nothing should be creating new connections like that. But I didn’t. Why not?</p>
Why did it take a year?</h2>
Expertise, ownership, and trust. Each of these concepts is a two-edged sword and each cut me with its second edge.</p>
Expertise.</h3>
Expertise. I knew I did not have C# or dot-net expertise. I had to rely on the people who had it. I was also not familiar with the code</em> that did this work, especially at two weeks in. I had to trust the people who knew dot-net and knew the code to assure me that there were no obvious howling bugs in it.</p>
Where expertise is assumed but is not present, bad code goes unchecked. People get angry when you review their work and ask for changes, or even when you only ask questions about the work. Defensiveness can arise in low-trust environments, but it can also mask situations where people don’t have the expertise you need them to. Or situations where people have the expertise but are so pressured, stressed, and burned out that they’re not operating at full capacity.</p>
Here the second edge cut me because I assumed without pushing that the people with dot-net expertise had already investigated the obvious possibilities. But also! I lacked this expertise myself. When we finally hired people who were expert with dot-net and comfortable with it, they laughed at this bug, because it was familiar territory for them. They’d have looked for and found it immediately.</p>
Ownership.</h3>
When some one human or a team owns something, I feel I need to let them own it and trust their expertise. Meddling in their work can destroy their self-confidence or make them feel undermined. A feeling of ownership is good! It means you feel responsibility for that thing, and know that the burden of maintaining it rests on you.</p>
The other edge of ownership is gatekeeping. The deploy system was obviously a block to all development by all teams. The team had a Slack channel where they negotiated who was going to merge which code for the single deploy window</em> available on four days a week, with no deploys allowed on Fridays. Deploys were flaky and could take up to three hours to resolve. A colleague with a technical leadership role was in fact working on a better deploy system, but the infra team manager instructed their team to ignore the work.2</a></sup></p>
The infra team also jealously guarded access to things they thought belonged to them, such as access to an Athena search setup for production logs. At one point one of them locked down commits to the main monolith’s repo, announcing to the team that they no longer got to merge into “my repo”. To be clear, this was a human being who’d been burned to an absolute crisp by overwork; the blame flows upward.</p>
Management can of course be the worst gatekeeper of all, and it was in this case. I mentioned briefly before that Redis had been identified as a problem area by some informed engineers, and they had to push hard to be allowed the time to work on it. They might have been more successful if supported</em> by management instead of being treated as if they were wasting time that would be better spent on cranking out this month’s pet feature for the CEO.</p>
From a distance, I can say with some confidence that gatekeeping was the worst block to diagnosing this Redis bug.</p>
Trust.</h3>
Trust. I said the word “trust” in each of the two preceding sections, because I had to extend trust to my colleagues. You earn trust by granting trust. People live up or down to your expectations of them, and I prefer to expect the best.</p>
You can see the downside of all of these. Where expertise is assumed but not present, bad things happen. Where ownership turns into gatekeeping, other people are blocked from fixing things even if they could help. When trust is not warranted, things get into bad states and stay that way.</p>

“Trust, but verify.” – unknown origin, but possibly Khrushchev</p>
</blockquote>
Then the layoffs happened.</h2>
The gatekeepers were all gone. The experts were also (mostly) gone. The ownership and responsibility were all on me and a much smaller but motivated team.</p>
When the ownership fell to me, I felt both responsibility and empowerment. I was no longer politely taking people at their word, because those people weren’t there any more. I was investigating and experimenting on my own, and ruthlessly testing all of my own hypotheses. I knew I didn’t have expertise, and even when I do</em> have expertise I have learned the hard way to double-check all my own work.</p>
I was also not bound by the past. I did not care if something had always been that way. I was okay with doing things differently. I don’t much trust myself, but I did trust the people working alongside me in that moment. And most especially, I trusted the work we did together, because we verified it together.</p>
The sad thing is that the ownership turned into gatekeeping problem was the difficult one to surmount, the one that in retrospect I’m not sure I could have solved in any other manner than parting ways with the gatekeeping team. I am going to tentatively state a thesis: operations/infra teams as teams separate from engineering always turn into walled-off defensive gatekeepers. You cannot allow them to exist in healthy orgs. You must practice some variation on devops by embedding people with this expertise into project teams.</p>
Maybe there’s a way to do it if you frame their goal as developer experience</em> not as “operations” or making AWS go brrrrrrr. The goal has to be to keep people focused on their customers– the engineers building the project– and not on defense against their colleagues. The same goes for security teams: embed those experts where they can have sympathy for the problems their colleagues are trying to solve and improve their solutions early.</p>
But I digress. Expertise, ownership, and trust are a big part of this story, but they’re not everything.</p>
The context also mattered.</h2>
Years later I learned that one of the two engineers who’d started working on Redis before the outages had some suspicions that there was a lifestyle problem, but he was afraid to change code that had been that way the entire time he’d worked there. I had no such fears, because we had removed reasons to fear experimentation</em> by completely rewriting the infrastructure and deployment environment to make experimentation low-cost. We’d also invested in full tracing via Honeycomb</a>. We knew what was going on with the system in ways that we didn’t in that first outage.</p>
The highly-contended integration environment had become many environments</a>. (The link goes into detail about that project and</em> tells the story of this bug from another perspective.) Deploys had been made fast and reliable. Access to information and metrics was available to everybody. Full access to AWS was available to all engineers. If a change broke production, the fix was three minutes away.</p>
We weren’t scared to make changes any more.</p>
I also want to call out that the team had progressed past trusting the word of people in the past about how things worked and whether or not things were feasible. We had the space and the support to read code to see if it genuinely behaved as described or if it worked differently, and experiment with changing things. Management was no longer blocking people from investigating or fixing technical debt.</p>
What can we learn from this story?</h2>
This is what I’d like you and my future self to take away from this war story:</p>

Assume nothing. The people around you might be wrong! You might be wrong too!</li>
Test all hypotheses. Each test gives you more information.</li>
Eliminate gatekeeping. No team can afford to cope with the damage done by people who want to keep information or access away from their colleagues.</li>
Observability, even humble standard metrics, is invaluable.</li>
You (o fellow technical leader) own everything. You must always feel the responsibility of that ownership. You can share it, but it’s always partly yours.</li>
Trust but verify. Especially team superstitions.</li>
Ruthlessly eliminating developer friction pays unexpected dividends.</li>
</ul>
Also, it was totally not Redis’s fault.</p>



For this specific problem. It was and remains invaluable for other reasons. ↩</a></p>
</li>

Most toxic behavior is driven by toxic organizations, but some toxic behavior is individual and creates</em> that toxic organization. ↩</a></p>
</li>
</ol>
</section>


Legacy you hate
Unknown — Tue, 24 May 2022 13:17:00 +0000
What should you do with a pile of legacy code you hate?</p>
This was the central challenge of my last job. I was partially successful at solving it, and unsuccessful in ways that I want to share with you so you can do better than I did.</p>
Let’s start by clarifying the problem.</p>
“Hate” is a spongy word and we can be more descriptive about why you dislike the code base you’re presented with. Maybe it’s bad code: tangled, hard to maintain, failure-prone. Maybe it was written long ago by people who’ve long since left the company (burned out by having to maintain it) and nobody left understands it. Maybe only a few people are able to change how it behaves, and it takes those people far longer than anybody likes. Maybe it doesn’t scale in the ways you need it to. Maybe it’s also</em> written in a language you don’t like or don’t know, or maybe it’s written on top of a framework you don’t like or don’t know.1</a></sup></p>
Whatever the reason, you’re very done with this pile of code and so is everybody around you. It needs to be replaced and you all know it. And yet</em>, it’s the money engine for your company.</p>
What to do?</p>
Don’t rewrite immediately.</h2>
The temptation will be to rewrite the whole thing. You already know that you shouldn’t.</p>
We all know that second system syndrome</a> is a thing, and we all know that big-bang rewrites are notoriously difficult to pull off. As Gall famously said:</p>

“A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a working simple system.” – John Gall</p>
</blockquote>
The other reality is that companies rarely have the time and resources to devote to rewrites, even if they have run themselves deep into tech debt. They never want to pay down that debt, and they never like the idea of giving up new feature work for a rewrite project that doesn’t move them forward.</p>
And yet, this code needs to end up being rewritten somehow</em>, because it’s a disaster that is costing the organization dearly, and perhaps even driving it to the brink of failure.2</a></sup> This is true at the same time that you can’t dive into a big bang rewrite.</p>
Reframe the problem and shift the goal: you want to be able to rewrite in useful pieces</em>. Small pieces allow you to make incremental progress that can be seen to be “delivering business value” or at least measurable progress toward the end goal. Small pieces are also</em> small systems on their own, each of which is simple enough to be kept working.</p>
Now you’ve shifted your task to identifying useful pieces and rewriting those, and that task is more achievable. How do you identify useful pieces to rewrite? Well, this is both bad news (because you hate the code) and good news (because understanding complex systems is fun): you need to spend a lot of time with the system you have. You need to invest in it.</p>
You need to understand it, even if you hate it.</h2>
It’s your money engine. It has to keep working.</p>
You cannot hope to replace what you do not understand.</p>
Understanding it deeply will allow you to find the cracks you can hammer a wedge into.3</a></sup></p>
Understanding it deeply will allow you to know when you’ve finished replacing it.</p>
So how do you understand it?</p>
This isn’t the same as the theory of the program</a>, which is about how the code is constructed. You do need to know what problems the code solves for the system around it, what “affair of the world” it exists to model. More important for this task is understanding the details of its current behavior</em> as part of a larger working system. This entire system is not just the code</em> in the thing you want to replace. It is all the systems around that code as well: the web site, the analytics pipeline downstream from it, the internal admin workflows, the profusion of microservices that we all persist in writing around everything.</p>
This whole system is an evolved, complex working system. It probably doesn’t have detailed specifications. It probably also does not have comprehensive tests. (If it did, you might not be in this mess.)</p>
Write tests.</h2>
If the system does not have comprehensive automated tests, invest in writing those tests before doing anything else. It’s hard to talk people into writing specs for features that have existed unspecified for years, but everybody understands why tests are useful. (If somebody doesn’t, then there are many excellent books you can drop on their head to enlighten them.)</p>
Not kicking off a testing project the moment I was in charge of this problem is my number one regret from my last job. We eventually did it and it was so valuable I was angry with myself. There were organizational reason why it was difficult for the team to commence that work earlier, including work that was genuinely urgent, but we could have started writing tests sooner! My advice to you would be to prioritize testing higher than I did, and defer what work you can until afterward.</p>
Start by investing time in the test framework and tooling. Your goal is to make it easy for everyone on the team to write tests and to understand their results. People do what is easiest to do, so you must make the right thing easy. The importance of this work deserves a blog post all its own.4</a></sup> However, anything is better than nothing.</p>
Don’t negotiate on these tasks:</p>

Write integration tests. Test how the Hated Code™ calls out to everything around it. Test that the expectations of the code around the Hated Code™ are being met.</li>
Don’t accidentally pour glue over implementation details that should be hidden. If unit tests don’t exist at all, you might want some, but they’re not as important as integration tests that validate overall system behavior.</li>
Involve the whole organization in the test-writing effort. Prioritize this work alongside feature work and make on-going test-writing part of regular maintenance.</li>
Automate running the tests. Do not rely on humans doing anything by hand. Run them continuously against an integration environment, or in whatever context is sensible for your setup. The important thing is to have the tests run against every change intended to land in the production environment.</li>
</ul>
You’ll find bugs in the overall system while doing this. It’s a judgement call whether you should invest time in fixing them. Some bugs might be difficult to fix because of the problems that lead you to want to replace the mess; don’t waste your time. Some bugs are load-bearing because the system will have grown around them, like a tree growing around a bicycle. You can cut the bike out, but at what cost to the tree? Fixing bugs that are easy to fix gives everybody dopamine cookies and shows people around the project that the investment in testing has started to pay off, so let yourself do some of that.</p>
If you and your team didn’t understand your system going into the testing effort, you will afterward. The tests will support any refactoring or replacement work by verifying that the entire system continues to work. They are the scaffolding around your new construction project.</p>
Identify and exploit wedge points.</h2>
Now you can start thinking about changing the system.</p>
Your goal here is to split up your monolithic code base by identifying good points to hammer in wedges to use to split off chunks.</p>
Where are you going to hammer in your wedge first? Have you identified a modular boundary you can exploit to split off a chunk of functionality for a rewrite? Look for clean lines of separation: data, access methods, business logic all must come out in one piece. The common approach is to put a proxy in front of the monolith-ish thing you want to start replacing and redirect traffic from it to your rewrite. One popular term for this is “the strangler fig pattern”</a>. I often call it “divide and conquer”.</p>
The advantage of this approach is that it keep the pressure on the system to remain working at all times, allowing you to pay full respects to Gall. The tests are your latch on this working state: they validate that your replacement is behaving properly in context. You might find yourself writing even more tests at this point to support the validation; this is fine!</p>
The disadvantage of this approach is that you need to have good split points, and you probably don’t.</em> Good division points indicate where good modularity already exists and if you had that you’d probably be less unhappy with the mess.</p>
Create split points if they don’t exist.</h2>
This is important: don’t rewrite anything yet.</p>
Don’t proceed until you can find a good location to drive that wedge in and split off a chunk. Don’t take half measures. Some of the worst tech debt I encountered recently was in functionality that was half implemented inside the Hated Code™ monolith and half outside. The implementation details were spewed out everywhere. Changing functionality was extra difficult because it needed to be changed in two places, and one of the places was a code base that was very hard to work within. Also, once we’d fixed the primary performance bottlenecks, the secondary ones were all in how the Hated Code™ treated these satellite services as databases that it owned. Important working data vital to the operation of the system was a mashup of data from other microservices plus the monolith.</p>
Don’t do this to yourself.</p>
No, really, modularity is important. Parnas’s 1972 paper, “On the Criteria To Be Used in Decomposing Systems into Modules”</a> points right at the important thing, which is that hiding information and implementation details allows you to change both. Modularity allows change.</p>
Premature modularity is a form of premature optimization, and it hurts, but I’ve more often seen no modularity at all. Gotta go fast and break things, right? Side effects everywhere, code that has been DRYed to disastrous levels, the details of specific data structures in one place used to make decisions somewhere else, extreme cleverness that relies on implementation details in distant locations in the system. Rushed people make short-term decisions, and their hacks pile up into tangles of code.</p>
Whatever the cause, you might have to start by refactoring internally to bring order and modularity to a ball of mud.</a>5</a></sup> Start hiding details behind interfaces.</p>
An aside: “Don’t Repeat Yourself” aka DRY has been misunderstood and misapplied to disaster so often I would like to stop saying it to newer programmers. Often much better advice is to repeat yourself to find patterns</a>.</p>
If you find yourself with a function or method that has an enormous parameter list to distinguish the six different ways it might be called, you have a case of DRY madness that has broken modularity. One technique that might help if you’re in this situation is to do the least DRY thing possible: refactor to expand each code flow into one large function for each, replacing each call out to an overused long-parameter list function with the same code, inline. Simplify as you write. Strive for branchless programming as an antidote!6</a></sup> The real patterns that support a better split-up of responsibility will emerge as you do this work.</p>
Once again, your tests are going to have your back as you go. You’ll know if that flow stays working or not. You might find that your Hated Code™ is less hate-worthy after you’ve cleaned it up. Maybe you’re more in sympathy with it now? Or maybe not.</p>
Time to drive those wedges in with a sledgehammer.</h2>
Now you can strangler-fig/divide-and-conquer/split those rocks as you go. You’ll probably get the modularity boundaries closer to right than your predecessors, because you have a lot more information than they did: you have a far more developed system to study!</p>
If you’re tight on resources, you might choose to do nothing</em> about any specific modular chunk of code. Leave it where it is, and make incremental improvements opportunistically. If this segment is not performing well, or is doing the wrong thing, or is hard to maintain, or if the team is far more comfortable with working in some other language ecosystem, then replace it. Prioritize potential rewrites by how much you hate the current implementation; that is, how many ways they’re failing to do what good code does.</p>
Here’s where I remind you that modularity in your system does not require splitting its components into separate microservices. Microservice APIs are strong module boundaries; these API boundaries resist change unless you plan carefully. On the other hand, these boundaries do</em> resist attempts at clever end-runs around that modularity.7</a></sup> I like to bundle together data that is roughly similar size and changes at similar rates or in a similar style. CRUD data that is infrequently destructively updated and all lives in the same kind of database might all belong together. Geographical data that all uses PostGIS belongs with other data like that. This is itself a gigantic topic, so I won’t go further other than to remind you that microservices have tradeoffs. The important goal is to leave a system than can be more easily rewritten</em> behind yourself.</p>
Plan to rewrite next time.</h2>
All code has a lifespan.</p>
Your designs make tradeoffs (always) that suit the context you’re working in:</p>

What language ecosystem is the current team comfortable using?</li>
Do you need to get this project done rapidly, so some shortcuts are okay?</li>
What performance characteristics are acceptable today?</li>
What task does this component have to perform today?</li>
</ul>
The context around</em> working code changes over time. The business context the code exists in is guaranteed to change. Product requirements change. The tools your team is happy with today might make the team unhappy three years from now. Other parts of the system will change around it.</p>
Make it easier for your future self or your successors to rewrite any given component of a system. If you know the lifespan of a decision, or when a scaling shift will make a component a good candidate for a rewrite, record that information right next to the code.</p>
The tl;dr.</h2>
It’s okay to hate that code base. It is hate-able. It’s okay to want to replace it. You can replace it! But you have to put in the work first. The work I’ve had to do in this situation looks like this:</p>

Understand it even if you dislike it. ⬅️ treat it like a puzzle</em></li>
Write tests. For the system. Mostly integration. ⬅️ helps everything</em></li>
Identify or create wedge points. ⬅️ most of the time will go here</em></li>
Split off chunks and rewrite. ⬅️ the fun part</em></li>
Shrink the mess until it’s tolerable. ⬅️ satisfying!</em></li>
Plan so rewriting the new chunks is easier next time. ⬅️ pay it forward</em></li>
</ul>
Anyway, this is what I’ve learned from trying to do this work with limited resources. It’s best not to be in this situation: instead devote time to maintaining the system as a system and every bit of code in it. But most of us don’t have time machines to prevent past technical leaders from making these mistakes.</p>



Being written in a language ecosystem you don’t like is not enough of a reason to rewrite something all by itself. If you’ve landed into a team that doesn’t know the language ecosystem that company’s money engine is written in, your first task is to correct the hiring mistake of the past. You might have to become an expert into the thing you don’t know; you might (like me) discover that you dislike the thing you had to become an expert in. Probably the real takeaway is to do better due diligence than I did, and discover in advance what flavor of mess you’re expected to clean up. But sometimes, the ecosystem mismatch is the last misery on top of a pile of miseries. ↩</a></p>
</li>

This was literally true in my case. ↩</a></p>
</li>

If you have never seen rocks split by hand with the wedge and feather technique, check out this video showing somebody breaking up a big boulder</a>. ↩</a></p>
</li>

I am nudging Chris Dickinson into blogging about how he approached this project at our mutual former employer, but until he does here’s a link to a tweet about his approach to the work</a>. ↩</a></p>
</li>

This is the part of the rock-splitting video where you haul out the drill and bore a hole to stick a wedge into. The metaphor is now out of control because you can’t drill holes in big balls of mud, but, uh, let’s pretend the mud has been pressurized into rock over many thousands of years? ↩</a></p>
</li>

DRY is misunderstood, IMO. The original principle is “Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.” This is a good principle! It does not mean that you need to collapse any two bits of code that look mostly the same. As with everything, advice has contexts. Everything in moderation. Tef is right</a>. ↩</a></p>
</li>

Though I have seen people manage to do that. E.g., replicating an entire db to get at a subset of its data rather than using the API that was put in front of the db specifically to hide the implementation details of the db schema. Sigh. But even this is a case of people doing what feels easiest: if the replication tools are right there and calling an API feels harder, they’ll reach for replication. The right solution is to make doing the right thing the easiest thing for everybody. This is more work for you, which you needed, right? ↩</a></p>
</li>
</ol>
</section>


Why Rust's postfix await syntax is good
Unknown — Fri, 13 May 2022 16:00:31 +0000
The other day on Twitter Kat Marchán said this:</p>

my strongest opinion on programming languages is that postfix .await is the single greatest innovation in the past 70+ years of programming language theory and history and you can’t convince me otherwise.
— Kat Marchán has permanently left this site (@zkat__) May 12, 2022</p>
</blockquote>
And Jan Lehnardt asked:</p>

I’ve seen a few folks say this. Do you know of a “here is how this compares to async/await keywords” for someone who barely rusts?
— Jan Lehnardt is on Mastodon: @janl@narrativ.es (@janl) May 12, 2022</p>
</blockquote>
I didn’t know of any, and a little searching didn’t turn one up. So here’s one? I hope? If you are not programming Rust a lot, and want to know why not-language-designers like me think that Rust’s await syntax</a> is good, this is the blog post for you.</p>
While looking around for somebody explaining why this is nice syntax, I found one of the discussions</a> about possibilities before it was selected. That’s a pretty long conversation, and I enjoyed skimming it. This comment</a> examining what Rust might look like with a number of the syntax possibilities was particularly neat. It immediately jumped out to me that the one they landed on (postfix field) and the close relation to it (postfix method) felt more Rust-y. But why? You might have to be a Rust user already to feel that.</p>
So in order to explain why Rust’s .await</code> is a nice bit of syntax, I will start by explaining two other things: how chaining calls is idiomatic Rust, and how error propagation with another nice bit of syntax, ?</code>, supports this.</p>
Hoo hah back on the chain gang</h2>
Chaining is a very common idiom for taking one collection and transforming it into another, perhaps even one of a different type. This snippet takes a collection of id-having-things (any collection type, so long as it is iterable), iterates through them, plucks out the ids, and re-collects them into a Vec:</p>
let</span> ids</span>:</span> Vec</span><</span>usize</span>></span> =</span> things_with_ids</span>.</span>iter</span>(</span>)</span>.</span>map</span>(</span>|</span>xs</span>|</span> xs</span>.</span>id</span>)</span>.</span>collect</span>(</span>)</span>;</span></span></code></pre>
Here’s a slightly-edited real-world example, which chains some stuff to end up with a string:</p>
let</span> malformed_kinds</span> =</span> requested_kinds</span></span>
    .</span>iter</span>(</span>)</span></span>
    .</span>filter</span>(</span>|</span>xs</span>|</span> !</span>is_valid</span>(</span>xs</span>)</span>)</span></span>
    .</span>cloned</span>(</span>)</span></span>
    .</span>collect</span>::</span><</span>Vec</span><</span>_</span>></span>></span>(</span>)</span></span>
    .</span>join</span>(</span>"</span>,</span>"</span>)</span>;</span></span></code></pre>
And the idiom is seen in other areas of API design. Here’s how my little Skyrim mod tool</a> sends posts (with some editing to make it a useful example):</p>
let</span> agent</span> =</span> ureq</span>::</span>AgentBuilder</span>::</span>new</span>(</span>)</span></span>
    .</span>timeout_read</span>(</span>Duration</span>::</span>from_secs</span>(</span>50</span>)</span>)</span></span>
    .</span>timeout_write</span>(</span>Duration</span>::</span>from_secs</span>(</span>5</span>)</span>)</span></span>
    .</span>build</span>(</span>)</span>;</span></span>
let</span> maybe_response</span> =</span> agent</span></span>
    .</span>post</span>(</span>uri</span>)</span></span>
    .</span>set</span>(</span>"</span>apikey</span>"</span>,</span> &</span>self</span>.</span>apikey</span>)</span></span>
    .</span>set</span>(</span>"</span>user-agent</span>"</span>,</span> "</span>modcache: github.com/ceejbot/modcache</span>"</span>)</span></span>
    .</span>send_form</span>(</span>body</span>)</span>;</span></span></code></pre>
All of this is to say: chaining like this is common in Rust.</p>
Don’t break the chain</h2>
Now, there isn’t any error handling visible in the above code. What does error handling look like with chaining? Does it break the chains? It used to! The ?</code> error propagation symbol is new to Rust since I first started using it, and its introduction has made writing error handling a lot nicer.</p>
Rust allows you to express that an operation might fail by returning a Result</code></a>. This is a sum type:</p>
enum</span> Result</span><</span>T</span>,</span> E</span>></span> {</span></span>
   Ok</span>(</span>T</span>)</span>,</span></span>
   Err</span>(</span>E</span>)</span>,</span></span>
}</span></span></code></pre>
If all went well, you get the Ok</code> variant with your data in it. If it did not, you get the Err</code> variant with your error type. The Rust compiler makes you handle both variations in your code.</p>
Here’s a faked example of getting some data from a function that might fail, and doing something with that if we can.</p>
//</span> Our fetch talks to a db so it might fail for reasons</span></span>
//</span> beyond our control, so we return a result type.</span></span>
fn</span> fetch_all_animals</span>(</span>)</span> -></span> Result</span><</span>Vec</span><</span>Animal</span>></span>,</span> SomeErrorType</span>></span> {</span></span>
    //</span> blocking call to a db here</span></span>
}</span></span>
</span>
//</span> We depend on a fallible function, so we are fallible too.</span></span>
fn</span> count_hedgehogs</span>(</span>)</span> -></span> Result</span><</span>usize</span>,</span> SomeErrorType</span>></span> {</span></span>
    //</span> this is a Result</span></span>
    let</span> maybe_animals</span> =</span> fetch_all_animals</span>(</span>)</span>;</span></span>
    //</span>... so we match on it to see if we succeeded or not</span></span>
    match</span> maybe_animals</span> {</span></span>
        Ok</span>(</span>animals</span>)</span> =></span> {</span></span>
            //</span> we got some animals! let's find the hedgies</span></span>
            let</span> count</span> =</span> animals</span>.</span>filter_for_hedgehogs</span>(</span>)</span>.</span>len</span>(</span>)</span>;</span></span>
            Ok</span>(</span>count</span>)</span></span>
        }</span></span>
        Err</span>(</span>e</span>)</span> {</span></span>
            //</span> We failed to get animals. We handle the error in whatever</span></span>
            //</span> way makes sense for the program. Here we just propagate</span></span>
            //</span> the error on up to the caller.</span></span>
            Err</span>(</span>e</span>)</span></span>
        }</span></span>
    }</span></span>
}</span></span></code></pre>
This error handling pattern was everywhere in my Rust code, being verbose all over the place. It’s also predictable! This makes it a good candidate for sugar. So the ?</code> syntax</a> for this was added in Rust v1.13 at the end of 2016</a>. If all you want to do is return immediately if you have an error and carry on if you got an OK result, use ?</code>.</p>
fn</span> count_hedgehogs</span>(</span>)</span> -></span> Result</span><</span>usize</span>,</span> SomeErrorType</span>></span></span>
{</span></span>
    let</span> animals</span> =</span> fetch_all_animals</span>(</span>)</span>?</span>;</span> //</span> <-- note the ?</span></span>
    //</span> if the fallible function failed, we have bopped that</span></span>
    //</span> error on out & can proceed</span></span>
    let</span> count</span> =</span> animals</span>.</span>filter_for_hedgehogs</span>(</span>)</span>.</span>len</span>(</span>)</span>;</span></span>
    Ok</span>(</span>count</span>)</span></span>
}</span></span></code></pre>
You can see that error handling is a lot less verbose when it can fit into this pattern. In fact, the idiomatic Rust way to implement the above function is to chain it all together:</p>
let</span> count</span> =</span> fetch_all_animals</span>(</span>)</span>?</span>.</span>filter_for_hedgehogs</span>(</span>)</span>.</span>len</span>(</span>)</span>;</span></span></code></pre>
Which is super-compact and might not need its own function at all. This stays super-compact if our hedgehog filter is fallible as well, though I’m not sure why it would be fallible. It would look like this:</p>
let</span> count</span> =</span> fetch_all_animals</span>(</span>)</span>?</span>.</span>filter_for_hedgehogs</span>(</span>)</span>?</span>.</span>len</span>(</span>)</span>;</span></span></code></pre>Finally we get to async</code> and await</code></h2>
Now! Let’s suppose we have moved to the magic land of async Rust programming and have a non-blocking db fetch for our animals.</p>
//</span> we must say the magic word</span></span>
async</span> fn</span> fetch_all_animals</span>(</span>)</span> -></span> Result</span><</span>Vec</span><</span>Animal</span>></span>,</span> SomeErrorType</span>></span> {</span></span>
    //</span> we do all the same work as before</span></span>
    //</span> and maybe call some async functions here too</span></span>
}</span></span></code></pre>
Now when we call that function, what we get back is actually a Future</code></a>. To use it, we have to call poll</code> on it, or more idiomatically, we await</code> it to resolve it to a value. (There’s a link in the further reading section if you want to learn more.) This is a lot like what happens in Javascript when we get a promise back from an async function:</p>
const</span> animals</span> =</span> await</span> fetch_all_animals</span>(</span>)</span>;</span></span></code></pre>
But Rust’s chosen syntax uses a field-like postfix on a Future, and this is the specific thing I think is neat:</p>
let</span> animals</span> =</span> fetch_all_animals</span>(</span>)</span>.</span>await</span>;</span></span></code></pre>
Look at what happens if we’re calling fallible functions and want our error handling in-line! We stick ?</code> on the .await</code> to propagate any errors and unwrap a result in-line:</p>
let</span> count</span> =</span> fetch_all_animals</span>(</span>)</span>.</span>await</span>?</span>.</span>filter_for_hedgehogs</span>(</span>)</span>.</span>len</span>(</span>)</span>;</span></span>
//</span> and if our hedgehog filter were both async and fallible....</span></span>
let</span> count</span> =</span> fetch_all_animals</span>(</span>)</span>.</span>await</span>?</span>.</span>filter_for_hedgehogs</span>(</span>)</span>.</span>await</span>?</span>.</span>len</span>(</span>)</span>;</span></span></code></pre>
That is the use case that shows why I think this specific syntax choice is brilliant. Precedence is clear. We don’t have to wrap things in parens for human readability or to control precedence. If we read a chain, the operations are mentioned in the order that they happen. It works with the existing idioms rather than against them.1</a></sup></p>
Another thing that’s interesting to me here is that this choice is not</em> what most modern languages made for their syntax. Lots of them use a prefixed await</code> keyword. In javascript if we were chaining it would look like:</p>
const</span> count</span> =</span> (</span>await</span> fetch_all_animals</span>(</span>)</span>)</span>.</span>filter_for_hedgehogs</span>(</span>)</span>.</span>length</span>;</span></span>
//</span> and errors will throw exceptions that we're letting bubble up</span></span>
//</span> and if we're chaining more than one async thing...</span></span>
const</span> count</span> =</span> (</span>await</span> (</span>await</span> fetch_all_animals</span>(</span>)</span>)</span>.</span>filter_for_hedgehogs</span>(</span>)</span>)</span>.</span>length</span>;</span></span></code></pre>
But I’d probably never write either of those and definitely never the second. I am far more likely to write:</p>
let</span> count</span> =</span> 0</span>;</span></span>
try</span> {</span></span>
  const</span> animals</span> =</span> await</span> fetch_all_animals</span>(</span>)</span>;</span></span>
  count</span> =</span> animals</span>.</span>filter_for_hedgehogs</span>(</span>)</span>.</span>length</span>;</span></span>
}</span> catch</span> (</span>ex</span>)</span> {</span></span>
  //</span> handle the error at this level</span></span>
  //</span> I'd omit the try/catch if I wanted the error to propagate</span></span>
}</span></span></code></pre>
My aversion to chaining partly comes from the fact that I must use the parens to express my intent. The syntax of any programming language shapes what code feels idiomatic and most readable and what code feels like patting a cat tail to head. There’s nothing right or wrong about any of it, because all of them have found a way to express the concept.</p>
Is postfix await a small bit of syntax? Yes. Is it thoughtfully chosen out of many possibilities? Yes. Is it very much in tune with the Rust syntax around it? Also yes. This is what I appreciate most about the Rust project: its concern for the experience of the human beings using the language.</p>
Further reading</h2>
If you would really like to understand Rust futures, you should read @fasterthanlime</a>’s article “Understanding Rust futures by going way too deep”</a>.</p>
If you are comfortable reading Rust, and want to know more about async executors and how they work, check out whorl</a>. This repo walks you through the implementation of an async executor and shows you what await</code> desugars to. (Hat tip to Chris Dickinson for telling me about this!)</p>
If you have any pointers to other posts about why this syntax is neat, please send them to me and I will link! Also, if you have your own reasons about why this syntax is nice, please do write them up and I will link those too!</p>



You might say that the people who chose the await syntax had a strong hold on the theory of Rust</a>. ↩</a></p>
</li>
</ol>
</section>


Programming as Theory-Building
Unknown — Thu, 12 May 2022 10:00:00 +0000
A couple of years ago now I read Peter Naur’s “Programming as Theory-Building”</a> (alternative PDF link</a>) and it was a mind-blower. Yes, this Naur is the same Naur you know from Backus-Naur Form aka BNF</a>. Anyway, you should go and read this essay now. It’s not very long. I’ll wait while you read it.</p>
Back? Okay! Let’s do a bit of a crawl through some main points.</p>
Highlights of the essay</h2>
Naur opens with his thesis statement, the argument he’s about to make:</p>

[P]rogramming properly should be regarded as an activity by which the programmers form or achieve a certain kind of insight, a theory, of the matters at hand. This suggestion is in contrast to what appears to be a more common notion, that programming should be regarded as a production of a program and certain other texts.</p>
</blockquote>
Programming isn’t about writing the code; it’s about understanding the problem and expressing that understanding through code. That understanding is what allows us to modify the code without harming its design. Naur discusses three real-world cases of existing programs being modified over time by a team closely connected to the original team, a team that had only documentation to go on, and the same team. The team with the closest connection was most successful at making additions that worked with the existing design, and not against it:</p>

The conclusion seems inescapable that at least with certain kinds of large programs, the continued adaption, modification, and correction of errors in them, is essentially dependent on a certain kind of knowledge possessed by a group of programmers who are closely and continuously connected with them.</p>
</blockquote>
Naur then goes into what “theory” means, philosophically and in practice. Here are his three points about what a programmer having the theory can do, lightly edited:</p>


Explain how the solution relates to the affairs of the world1</a></sup> that it helps to handle.</li>
Explain why each part of the program is what it is, in other words is able to support the actual program text with a justification of some sort.</li>
Respond constructively to any demand for a modification of the program so as to support the affairs of the world in a new manner.</li>
</ol>
</blockquote>
Naur is talking about “programs” here, and today we add on top of that the systems in which many programs interoperate to do complex things. So the demands are a little higher: we need to have theories of the systems we build and maintain, as well as theories of the pieces of that system.</p>
The term I’m more likely to use myself for what Naur calls “the theory of the program” would be “a mental model of the system”. Some accumulation of facts in my head has allowed me to build a map of the territory– where things are implemented or “happen” in the system– and to predict behaviors given inputs. My mental model might be shallow in places where I haven’t had to make changes and very deep and detailed where I have recently worked. I need to keep that model constantly refreshed through review and rehearsal. While I’m model-building I find myself making small changes to the code, like renaming variables for clarity once I understand them, tweaking log lines, or adding comments.</p>
The ability to fluidly make functional changes</em> to the system I’ve modeled requires even deeper understanding of how it’s implemented, because there are patterns and structures in the code that I have to work with rather than against. This gets closer to what Naur means when he talks about the theory of a program. It’s not just the what, but the how and the why. Why does having a theory of the program matter? Because this enables rapid and effective modification of the program to respond to changing requirements without piling up technical debt or hacks.</p>

It must be obvious that built–in program flexibility is no answer to the general demand for adapting programs to the changing circumstances of the world.</p>
</blockquote>
Alas, it is not obvious, we say, looking at all the premature generalization happening around us.</p>
I’m going to quote this entire paragraph because of how important it feels to me, with some commentary.</p>

On the basis of the Theory Building View the decay of a program text as a result of modifications made by programmers without a proper grasp of the underlying theory becomes understandable. As a matter of fact, if viewed merely as a change of the program text and of the external behaviour of the execution, a given desired modification may usually be realized in many different ways, all correct. At the same time, if viewed in relation to the theory of the program these ways may look very different, some of them perhaps conforming to that theory or extending it in a natural way, while others may be wholly inconsistent with that theory, perhaps having the character of unintegrated patches on the main part of the program.</p>
</blockquote>
We might call these “unintegrated patches” technical debt or hacks, but either way, we know it’s a problem when we see it. Somebody has worked against the grain of the wood when carving a new feature into the system, and it feels wrong. The hacks get in the way when you need to make the next change. They might not work well because they’re at odds with other design choices. They might be sitting right next to an already-existing affordance to add that new behavior!</p>
Continuing on in this paragraph:</p>

This difference of character of various changes is one that can only make sense to the programmer who possesses the theory of the program. At the same time the character of changes made in a program text is vital to the longer term viability of the program. For a program to retain its quality it is mandatory that each modification is firmly grounded in the theory of it. Indeed, the very notion of qualities such as simplicity and good structure can only be understood in terms of the theory of the program, since they characterize the actual program text in relation to such program texts that might have been written to achieve the same execution behaviour, but which exist only as possibilities in the programmer’s understanding.</p>
</blockquote>
People who do have the theory of the program can make changes that work with what’s there already. They know where the affordances are. Naur says that simplicity and quality only make sense in the context of that code to begin with, and this point is a good one. Let’s try another metaphor: Writing a program is like finding a domain-specific language to express the problem and its solution, a language that expresses your understanding of the domain. It’s a truism that code is communication. It is primarily communication with other humans, not with a compiler, with a set of verbs and nouns chosen by you as the best expression of your understanding of the problem. Other people reading your code must learn to read your new language, and to make changes they need to write it.</p>
How do programmers learn a theory of the system? Naur says documentation and the source code are not enough, and someone new to a system needs hands-on mentoring:</p>

What is required is that the new programmer has the opportunity to work in close contact with the programmers who already possess the theory, so as to be able to become familiar with the place of the program in the wider context of the relevant real world situations and so as to acquire the knowledge of how the program works and how unusual program reactions and program modifications are handled within the program theory.</p>
</blockquote>
Humans learn through guided practice with others. Spend time working with people who understand a system, and you’ll begin to understand it too.</p>
This is great if the people who have the theory of a system are still around to talk to.</p>
Nobody’s around any more</h2>
Real world is often more like Naur’s “group B” case, where further development on software happened without the benefit of close contact with theory-holders. I’ve described this a couple of times as a being a software archaeologist, digging out bits of architecture and sorting through refuse pits to figure out how a past software team lived and what the heck they were thinking about. Given reality, let’s ask two practical questions in response to Naur’s article:</p>

How can you rebuild the theory of a program or system if its original authors aren’t around to teach you?</li>
How can you leave useful information behind yourself to help future maintainers rebuild the theory you have?</li>
</ol>
In my experience, I had to spend a lot of time reading code and building my own mental model of the software, how it was constructed, and how the pieces worked together– the archaeologist metaphor I mentioned above. I had to reconstruct the theory by looking at the textual artifact and what it was doing in practice.</p>
My colleague Chris Dickinson</a> does some interesting things while doing code spelunks. He generates other artifacts as he goes, such as textual call diagrams that he can turn into graphs with graphviz. He’ll do this as a vertical slice for a specific code path as well, ending up with a detailed call flow chart showing every network traversal or call made to build a single web page, for instance. There are cognitive reasons why making drawings or notes like this is helpful– you improve your understanding of a concept by expressing it in a different form than you’re receiving it. (This is related to why taking notes in a lecture is helpful. Hear -> write -> read.)</p>
I often tried to use commit logs to figure out why specific changes were made, but was more often frustrated than enlightened. The past generations of programmers at this company did not often write helpful commit messages.2</a></sup> One specific commit that introduced an incredibly expensive bug had a commit message like “scope fixes”. Why was the change made? What did the programmer intend? Nobody knows.</p>
I had to supplement by talking with people who weren’t familiar with the source code but could tell me what it did and why that was desirable or not. These people might not be the programmer-operators that Naur discusses, but they are expert user-operators. They have a theory of the program too! The one remaining programmer on staff who really understood a particular piece of software was priceless, and I’m grateful they were as amiable about explaining things as they were.</p>
An aside about retention</h2>
I wish now to point out what might be obvious to you about the cost of team turnover. When there’s one human being left on a team who understands how that pile of legacy code works, you’re in trouble. If there are none, you’re in worse trouble. You have to hire people who can walk into messes cold and figure them out without help, and those people don’t come cheap.</p>
Keep people around. Give them raises rather than making them leave to get more money. Keep them feeling good in their daily work.</p>
Since we haven’t prevented, let’s try curing</h2>
Given this experience, and given the lightbulb moment that reading Naur gave me, I changed my development practice. I started thinking about ways I could help other programmers– maybe programmers I’d never meet– build useful theories about the software they inherited to maintain. If I found good ways to do this, I could then socialize those approaches and turn them into team practices.</p>
Here are some of the things I’ve started doing.</p>
I deliberately distinguished maintainer documentation</em> from user documentation</em>. The people who need to consume a service’s API are a completely different audience from the people who need to maintain that service. They might have different skillsets and programming languages: a person working on a website is probably writing in Typescript, while the service might be in Rust, C#, or anything at all. They have completely different concerns as well. The maintainer needs to dip into the source and needs to read details about the internals. Making the consumer of an API dip into the internals of its implementation would be a waste of their time.</p>
Maintainers are the people who need the theory, so I invested time writing documentation for them. That documentation belongs as close to the source code as possible, and at least partly inside it in the form of comments. Don’t waste time documenting what can be seen through simple reading. Document why</em> that function exists and what purpose it serves in the software. When might I call it? Does it have side effects? Is there anything important about the inputs and outputs that I might not be able to deduce by reading the source of the function? All of those things are clues about the thinking of the original author of the function that can help their successor figure out what that author’s theory of the program was.</p>
Chunk up a level: writing about that function might help a maintainer fix a bug with it, but that isn’t sufficient for getting across the theory of the program. There are structural choices you make as you put together the program, as well as major decisions you make that inform the design. More specifically, the program exists to solve a problem, some “affair of the world” that Naur refers to. What was that problem? Is there a concise statement of that problem anywhere? What approach did you take to solving that problem statement? What tradeoffs did you make and why? What values did you hold as you made those tradeoffs? Why did you organize the source code in that particular way? What belongs where?</p>
I started putting design documents, decision records, notes about spikes, and so on into the same repo as the source code. If you’re doing code deep-dives and generating artifacts about what you learn, check those artifacts into the source repo too! Put the videos somewhere durable and link to them. I duplicated some of these documents in the company’s documentation platform of choice, but the duplicates were not as important to me. I can’t guarantee a future maintainer will discover those artifacts.</em> In fact, I can’t predict that any documentation platform other</em> that the source code repo will exist in the future.</p>
Toward the end of our joint tenure at out most recent job, my colleague Chris and I recorded videos of us doing code-centric deep dives through specific interesting, important, or especially difficult-to-understand aspects of the system. My hope is that these more conversational artifacts substitute in some way for having human mentors present to give people personal guided tours.</p>
Oh yeah, and I continued writing tomes in my PR/commit messages, and being fussy about what actually lands into the mainline branch from my PRs.</p>
A theory of theory-building</h2>
Here are Naur’s points, rephrased:</p>

The original authors of a system develop a theory of that system as they work, which comes from their understanding of the problem they’re solving and the decisions they made designing the solution.</li>
Programmers need to share that theory in order to make changes to what the system does successfully, without degrading its quality.</li>
People learn how complex systems work by being taught about them by other humans.</li>
Documentation isn’t enough, even if it a) exists, b) is truthful, and c) is discovered and read.</li>
</ul>
And here are my reactions:</p>

Do as much teaching of other programmers as possible while you’re in a job.</li>
Since turnover is a fact of life, and you won’t stay at any one job forever, do your best to leave artifacts behind that help your successors theory-build.</li>
Bargain with Naur’s position about documentation not being enough to by investing time into writing documentation aimed at describing the theory of the program</em>.</li>
Put that documentation as close to the source code as possible, because the source code is the only artifact guaranteed to survive.</li>
Write good commit messages.</li>
</ul>
The next people to come along will still have to put in the work, but you’ll have at least tried to to make it easier on them.</p>



I love this “affairs of the world” phrasing for “the thing the program is supposed to do”. It points right at the fact that the requirements come from external context. ↩</a></p>
</li>

I think that git commit -m</code> is a culprit here. It encourages people to type very short messages. The Github concept of the “pull request” improves on this by giving people a big text box to type in, so they aren’t pushed to keep their messages to 50 characters tops. This isn’t enough, though, if teams don’t have a culture of encouraging each other to write good PR descriptions, or of squashing branches down into single commits with a good message. ↩</a></p>
</li>
</ol>
</section>


Dysfunction junction
Unknown — Tue, 10 May 2022 10:00:00 +0000
You know Conway’s Law</a>, of course. I’ll recap it here, just in case:</p>

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.
— Melvin E. Conway</p>
</blockquote>
This law is a deep one, because communication drives everything that humans do. We are deeply social; we cooperate with each other on large projects by communicating with each other. Communication is a reflection of our social structures because it defines our social structures! Of course the things we make reflect those structures. (Put like that it sounds obvious, but Conway’s statement is so pithy.)</p>
Conway’s Law has some fun expressions, like:</p>

If two teams are communicating well with leaders who get along, their software will work well together. Conversely, if two managers don’t</em> get along, the software written by their teams won’t work together well.</li>
Bugs form along organizational fracture lines.</li>
Organizational silos become software silos.</li>
The engineers in the org right now don’t like the software written by the engineering team two generations ago. 1</a></sup></li>
Broken organizations produce broken software.</li>
</ul>
Broken human systems make broken software systems</h2>
The darkest expression of Conway’s Law is that organizational dysfunction is expressed as software dysfunction. If the leader of an organization is failing, the software it builds starts to fail as well. If you as a technical leader are tasked with fixing a broken software system, you might need to start by fixing the broken organization that produced it.</p>
Is the site falling down on the regular? Is shipping new features impossible? Is the tech debt at meme levels? Look for:</p>

A toxic culture.</li>
Teams that don’t communicate or cooperate.</li>
A rapid feature-shipping pace that devalues software quality.</li>
Managers who aren’t caring for their reports.</li>
Managers who don’t accept organizational priorities.</li>
Leadership that incentivizes bad behavior or unsustainable behavior.</li>
</ul>
And by “leadership”, I mean all the way up to the CEO.</p>
I’ve been struck by the influence of company leadership on overall work quality over and over again. Each company has a culture that it promotes, consciously or unconsciously, suffusing its personality through everybody there. The culture starts with the CEO and cascades downward. Intelligent, thoughtful CEOs inspire the people around them to be the same. Bullying CEOs drive good people away. Outright stupid CEOs (and I’ve seen several of these in recent years) make the teams around them stupid and inspire bad work.</p>
The management that’s right next to your engineering team has even more direct effect than an exec team. Management can turn an engineering team into a 0.1x team. I’ve seen leadership that inspired what people told me was the worst work of their careers. This leadership actively made their teams worse. Depressing, yes, but flip it: leadership can turn a team into a 10x team through inspiration, support, good incentives, emotional safety, and all the other things that good management can do. You</em> can provide this leadership.2</a></sup></p>
I maintain that you must</em> if you want to be effective as a technical leader.</p>
Systems are self-reinforcing</h2>
I resisted coming to the conclusion that I had to think at the organizational level to fix technical problems for a long time. Surely, I thought, surely</em> the organization will welcome solid, sensible suggestions aimed at fixing the trouble everybody agrees we’re in. The reality was far messier than this. Sensible suggestions might be nodded at as sensible, but their implementation will be resisted overtly and covertly.</p>
The reasons why are squishy and human. People respond to incentives, and they do more of what the organization around them rewards them for doing. They get practice at whatever that is. They unconsciously build structure around themselves to support whatever they’re being rewarded for doing. The human organization that produces the software is a system all on its own, with feedback loops and incentive structures. When the org is healthy, those feedback loops promote a virtuous cycle. When the org is unhealthy, the feedback loops promote a doom spiral.</p>
If the organization remains dysfunctional and you try to fix its output, it will resist your technical fixes, subvert improvement projects, and continue to drift back to the status quo. Managers will refuse to accept org-wide priorities, refuse to move staff to critical projects, refuse to cooperate with each other. Teams will attempt to maintain their silos. People will dig in rather than change their practices. Even more toxic things can happen in extremely broken organizations.</p>
Systems are self-sustaining and self-reinforcing. To repair a software system, you have to repair the human system that built it, and you do that by breaking those self-reinforcing loops3</a></sup>.</p>
People will tell you what’s wrong</h2>
The first step, therefore is to discover the incentive loops. Investigate and diagnose organization dysfunction methodically, the way you’d diagnose software.</p>
One exhausting but effective way to figure out what the worst problems are is to sit down in a one-on-one with every single member of the engineering team and</em> with adjacent teams. Your team is smart and they’ll tell you what’s wrong if they feel safe with you. Adjacent teams will have insights that people directly in the mess might not have, so include them too.</p>
Ask everybody the same questions. I let everybody know in advance that I am doing this and to expect a calendar invite. I give everybody the questions in advance, and message people with personal reminders that they aren’t in trouble and I am meeting with everybody. Not everybody will need that reminder, but the ones who do will appreciate it keenly.</p>
What questions should you ask? It depends on the circumstances, but you might try asking people if they feel they can do their jobs effectively, and then ask them to expand on their answer. Keep the questions open-ended and focused on the experiences of the person you’re talking to.</p>
Take notes as you talk with people. (I ask for permission as I do this.) Themes will emerge. After talking to everyone in your organization, you’ll know what’s not working and what needs to be repaired.</p>
Be kind to people</h2>
If you’re like me, you might find yourself angry about some of the things you learn in this exploration of the organization. Anger is a important response to learning that your values have been violated. It’s a signal you should pay attention to. That doesn’t mean you vent it at anybody else. Instead, use it to identify urgent work.</p>
Remember also that you’re not the only person who has figured out that things are broken. People in the org know it and have responded in their own ways. People will work very hard to fix things in their own specific corners, with what control they have. These efforts aren’t going to be coordinated and might end up at cross-purposes with each other, unintentionally making things worse. Honor the intent. You are here to coordinate.</p>
You must exercise power</h2>
You have diagnosed the problem. Now you want to fix it, because you are a software engineer and fixing things is what you do.</p>
If you have reporting authority and can start making organizational changes, this is the time to use that authority. If you’re a technical leader without that authority, it’s harder. You will need an ally who does</em> have it, and you will need to have the respect and trust of that person. Perhaps this person is an exec who agrees that the company isn’t getting what it needs from engineering. Maybe this person is a new leader brought in at your request.</p>
Spend time with this ally getting aligned with them on the values you’re bringing to the problem. What matters most? What does a healthy organization look like?</p>
You will also need to invest time to gain the respect and trust of the team around you. How do you gain trust? By behaving in a trustworthy way. Demonstrate that you know your trade. Show leadership in small ways. Handle incidents well (there might be lots of opportunity to do this). You might not have hard power, but you definitely have soft power. Influence. Persuasive skills. Moral authority. The ability to inspire people. Learn to use these as best you can.</p>
If all else fails, get that promotion to a pure leadership position you’ve been avoiding so you can stay technical. You can always escape it later (I say, as a person who escaped it).</p>
Change is about alignment</h2>
Let’s assume the happy path: you have an ally with organizational power, and you’re aligned with that person on values and goals. What happens next is alignment and reinvention for the broader organization. Now you make sure everyone in the org, starting with line managers, shares leadership’s values and understands the end goal. Now you clean up or change how the org accepts work and executes on it, giving all your processes a good shakeout.</p>
Entire books could be written about this part and have been, of course, because what you’re doing is setting up a healthy engineering organization.</p>
This is what happens next:</p>

The company around you has to understand that the engineering org is not going to ship features for a bit.</li>
Managers need to get into alignment first, or leave.</li>
If managers have burned their relationships with their reports too badly, they might have to leave anyway.</li>
People stuck in toxic patterns must leave.4</a></sup></li>
You need to design a new way for the org to accept work and execute on it.</li>
Leadership must rebuild trust by offering trust.</li>
Leadership must rebuild respect by offering respect.</li>
</ul>
Now give the team a project they can succeed with that will improve something noticeably. Give everybody a win. Reduce that technical mess just a tiny bit. The team will be behind you, helping. (Finally! You get to fix some of the technical problems you’ve been itching to fix!)</p>
Remember that lots of the people in that org want change too. They’re good engineers who are doing bad work, and they know it. They’ll be relieved to execute on a plan to make things better. They will shine if you let them.</p>
Looking back at doing this</h2>
Change is difficult to coax into being. You know the saying about how people have to want to change? This is true of organizations as well, as an collective decision from everyone in that org. The organization needs to feel that the cost of changing is less than the cost of continuing to operate as it has. Change is</em> costly and risky; local maxima are attractive even if they’re not very maximal. Also note that moving from a local maximum to a higher spot requires going downhill first. That’s something that looks very dangerous to an embattled exec who’s already under pressure because their org is underperforming.</p>
Doing this at an organization was the hardest thing I’ve done in my career. I feel I was only partially sucessful. I have notes from early in the process about things that needed to change that were still applicable two years later. You might not be willing to take this work on, which is okay. If you can’t make traction, it’s okay to leave. Organizations deserve to fail sometimes.</p>
It was incredibly satisfying to hear from my colleagues that the changes made their work lives better, though, so the rewards are deep.</p>
Acknowlegements</h2>
Thanks to Chris Dickinson</a> who read early drafts of this post. Read his blog!</a></p>

I keep thinking about that insight @isntitvacant had about Conway’s Law being true over time as well, and the software that felt good in the hands of a past time being uncomfortable to a team that has entirely turned over.
— Ceej “oh well” Silverio (@ceejbot) February 8, 2021</p>
</blockquote>



This is Conway’s Law over time. Teams are immutable: adding or removing a person to a team produces a different team. After enough change, the team is different enough that it no longer recognizes itself in the software system it produces. The result is people being vaguely unhappy about software that might be working perfectly well. This probably deserves its own short blog post. ↩</a></p>
</li>

It’s always a leadership problem. ↩</a></p>
</li>

My thinking here is influenced by Donella Meadows</a>’s Thinking in Systems</em>. You should read the book, but this blog post has a great summary</a> if you’d like a teaser. ↩</a></p>
</li>

Orgs in failure modes can burn people so badly that they can’t get past that emotionally. Burned people will often be perfectly fine and healthy if they get to press a big reset button and go somewhere else. ↩</a></p>
</li>
</ol>
</section>


Problem statement
Unknown — Sun, 08 May 2022 10:09:21 -0700
The first problem with blogging is deciding where and how to do it. I’ve written static site generators myself in python, ruby, and javascript. I considered writing another one in Rust, my current language of choice, but I decided that this would be too much of a distraction. I have a month between jobs at best, so I need to focus.</p>
If you’re curious, this one is generated with Hugo</a>, which I chose because of how many themes are available without me having to fuss and write one. The theme is a lightly-modified Risotto</a>. The static files are hosted on AWS S3 behind Cloudflare’s free personal plan, because I don’t yet need any of their paid features. I manage the AWS and Cloudflare settings using terraform. I don’t yet have all my personal cloud infrastructure terraformed, but I might take the time to do that over my month+ between jobs. I have flinched from doing that in the past because it felt like too much work, but it was less work than I feared, and I now have reproducible results.</p>
It’s been a long time since I blogged regularly. I still have a presence on tumblr</a> but I’ve let it lag behind in the last couple of years. Work became intense and draining, and burnout meant I dropped a number of hobbies. I think it’s time to resume this one, though, with a focus on the kinds of topics I would have pitched as conference talks before the COVID pandemic.</p>
Here are some topics I’d like to visit:</p>

The connection between technical dysfunction and organizational dysfunction, and why you have to deal with the organization first.</li>
How reading Peter Naur’s “Programming as Theory-Building”</a> changed my priorities as a technical leader.</li>
Why technical correctness is the least useful kind of correctness in the real world (and how none of us ever make decisions based on this anyway).</li>
Problem statements + values statements: how to arrive at decent solutions to problems if technical correctness isn’t useful.</li>
What to do with a legacy monolith implemented with a language and framework you don’t know and/or dislike.</li>
Where “performance” comes from in the messy real world, and where it doesn’t come from.</li>
Why it took me a year to arrive at a one-line fix for a massive performance problem, and how I hope to shorten that time should I encounter a similar situation again.</li>
Why a relentless focus on reducing developer friction pays off in team productivity, and some ways to do this.</li>
The cost of breaking tech hiring so badly that productive, product-shipping engineers feel they have to “grind leetcode problems” to pass an interview.</li>
How dogmatism is an enemy: let’s not be dogmatic about methodologies, “best practices”, “design patterns”, or anything, really.</li>
When microservices are appropriate and when they’re not, and some varied approaches to slicing systems up.</li>
Data-centric analysis for systems design and how to coax people into doing it.</li>
</ul>
I am not a fan of single right answers for any of these topics. Most real-world tasks are complex enough that many approaches to them are viable, and the “right” approach will depend on the context. What’s your starting point? What does the system around you do right now? What are the people working on it comfortable with? What do they do well and what do they struggle with? Which tools fit their hands best? So in my theory, the best advice I can give anybody is about how to ask and answer questions about that context, and tell some stories about what worked and didn’t work for me.</p>
I also hope to learn enough that five years from now I’ll be convinced that the Ceej writing these posts in 2022 was a prime chump who got much of this wrong. I’ll write another set of blog posts, and keep the blogging economy chugging.</p>


About
Unknown — Sat, 07 May 2022 11:38:39 -0700
I’ve been on the internet since 1987, which places me in the early generation of people being idiots online. Every mistake it’s possible to make, I’ve made. Blogging is one of those mistakes, and it’s one I have repeated since back when they were called “web journals”.</p>
I decided to start blogging again during an interlude between jobs. I’d like to write about what I learned through my last decade-plus of work in a changed Silicon Valley, in a startup scene that’s quite different from the scene I entered in the early 90s. The technical problems have changed in the cloud era, which is fairly easy to observe, and many people have written about this.</p>
I’d like to try writing about the human factors that you need to consider when approaching technical problems. After thirty years in this profession, I seem to have accidentally learned some things, mostly through making mistakes. Maybe I can help you avoid those mistakes!</p>
Anyway, here’s another blog.</p>

Ceejbot's notes

x Why was this fun (while work was not)?

Private homebrew taps

Test it</h2> Get some people to test the whole process and make sure everything works. Verify that your docs are clear enough that even the people who break everything can manage to make it work. Test your release workflows. Test updates.</p> You’re done.</p>

Modern terminal environment

Understanding Software

This company writes software</strong></h1> Everyone here contributes to this work.</li>

Congratulations.</h1>

And that’s how we turn a napkin sketch</strong> into something that affects the physical world.</h1> Questions?</strong></h1> SN: Stop sharing screen now.</p>

Questions?</strong></h1> SN: Stop sharing screen now.</p>

Accepting Work

A systems analysis rubric

The process of exploration</h2> Step one: Research.</p> Investigate the background of the problem & document the current solutions, if they exist.</li> Document why the current solutions are inadequate, if relevant.</li>

Multi-factor panacea

Goodbye Cloudflare; hello Fastly!

Reduce Friction

Against dogmatism

One year for a one-line fix

That’s a heck of a war story.</h2> Right? The one-line mistake that nearly killed a company, and the one-line fix that saved it. Except, well, it’s more complicated than that.</p>

Legacy you hate

Identify and exploit wedge points.</h2> Now you can start thinking about changing the system.</p> Your goal here is to split up your monolithic code base by identifying good points to hammer in wedges to use to split off chunks.</p>

Why Rust's postfix await syntax is good

Programming as Theory-Building

Dysfunction junction

Test it</h2>
Get some people to test the whole process and make sure everything works. Verify that your docs are clear enough that even the people who break everything can manage to make it work. Test your release workflows. Test updates.</p>
You’re done.</p>

And that’s how we turn a napkin sketch</strong> into something that affects the physical world.</h1>

Questions?</strong></h1>
SN: Stop sharing screen now.</p>

Questions?</strong></h1>
SN: Stop sharing screen now.</p>