Everywhere I’ve worked, documentation has had a different place in both development and infrastructure lifecycles. Whether a team is Kanban, Agile, or something else entirely, the same problem has stayed the same: Why are these docs naff/outdated/too verbose/not verbose enough? Another constant that has, for the most part, remained a constant is that all of those problems magically get fixed for a specific set of documentation after an incident where they would be useful. This is where my GCSE drama comes in…
Let’s Roleplay a Bit #
I’d like you to picture yourself on-call. You get a ping at 7pm. Some DB that you don’t quite recognize has 4x a higher error rate than normal. You don’t even really touch this DB, so you look through the docs. What qualities do you look for first? Chances are, those qualities will also be quite similar to what you’d look for during standard working time. This is the standard that I see most often when assessing documentation. Some of those qualities might be:
- Searchability: If it can’t be found quickly, it’s not really there.
- The ability to separate quick wins, with the ability to dig deeper if needed.
- References/Sources: Saying “Microsoft docs say x” is odd. You can just say “Microsoft docs say this” instead.
- Self-awareness: Know that this doc will likely follow the same fate of the rest of them, so treat it as the others in that respect. If you know that this doc won’t be reviewed again for a while, make sure that anything that is likely to become outdated is noted as such, with advisories to double check certain parts if this doc is X months old.
After-Dark #
It’s 3am. You got the ping. Then you got the pings. Then you wake up as an adenosine-induced sprawling mess, stumbling to your machine. The fans seem to sound differently this time. They don’t. That’s your senses coming back to you. How did you even make it to your chair in one piece? You did, but your partner acts as if they didn’t. Great, and now the dog thinks the house is about to be raided.
So now what do you look for? Well… aside from your bed and a better on-call agreement? A meta thought exercise of course!
Searchability: #
The docs surely are in the intranet home page aren’t they? Literally everything is? The old docs site is, but not the new one.
Oh nevermind, that message on slack had a docs link, let’s just use that.
The main error I see is… wait, what is it again?
WTF is a 08006 error? Okay whatever, let me search that in the handy search box our documentation cool comes with.
Damn. What about “error rate”?
Lovely. Right let me just google it…
Oh, that was easy enough to find
How can that be done better?
In this example doc I’d envisaged, that exact error message doesn’t show up. That’s because the on-call pager and the error discussed in is a bit different. It refers to a connection_failure instead. This doc in question instead needs to ensure that all forms of the error message that are relevant to this discussion are referenced.
Separating Quick Wins #
As you now start to wake up enough to re-remember how the English language works, you now realize that S has HTF, and that you need to start troubleshooting. You’ve found what the error code refers to, and finally dug up the couple of docs that refer to it. The first refers to a single DB server on WS 2008, so you quickly assume that’s part of the old DC that nobody talks about anymore.
The second seems more relevant: “Postgres Cluster Troubleshooting”. Even better, it was updated only a couple weeks ago!
The Inverse Vacuum of the GIN-Index Singularity
To resolve the phantom latency in your cluster, you must first re-calibrate the flux-capacitor within the shared_buffers to ensure the B-tree doesn’t accidentally undergo spontaneous combustion during a sequential scan. If the WAL logs begin to whisper in ancient Latin, it’s a clear sign that your primary key has achieved sentience and requires a blood sacrifice of three unoptimized subqueries to appease the query planner. Finally, ensure your max_connections is set to a prime number divisible by the current humidity in the server room, otherwise, the ACID compliance will liquefy and leak into the swap partition.
Your bed is looking much more attractive than it used to be all of a sudden.
How can that be done better?
Assume that whoever is reading your troubleshooting doc doesn’t want to be there. If they do want to be there, they’ll read an abstract or any links you put for more information.
At 3am, you’ll want commands to just paste into a console, and be given two outputs in the doc: One meaning to move onto the next step in the troubleshooting flow, and another that indicates that we’ve found the issue.
The other points that I personally look for are pretty much already covered.
- References/Sources: None please, give me the fix.
- Self-awareness: Lots please, assume that it’s 3am and I’m staring at your doc trying to get sensation in my fingers again.
To Summarize and Conclude… #
In any doc you make, you can have all of the verbosity you want, but put the simple, quick wins at the very top. If there’s a place to click around, describe exactly where the thing to click is, and put an image there. If there’s a command to enter, put the exact command. Even more important, put exactly what the expected output is for scenario A, scenario B, and so on. Make sure that in any case that the document will be useful, it can be searched for easily. If that means you need a weird keywords section at the bottom, go for it. It looks weird, but people on-call will be so, so thankful.
I’ve got my own opinions about how often docs should be updated, but I acknowledge that in certain company cultures, you just can’t update them as much as you’d like. That’s another topic I’ll likely discuss at some point. If your documentation store supports comments, please do comment whenever you see something out of date as you’re reading docs.
Now, some of you may ask a very valid question: What about the orgs that almost never have to deal with stuff after hours, is there still any point to all of this? If you don’t have a hiring freeze, yes! Remember, while you might have great organizational knowledge, 3am you probably doesn’t. Neither does the person you just hired. Wouldn’t it be great if they could get up to speed without having to feel like they’re stupid asking questions that are treated as if they’re already answered in the docs but never actually are?