SysAdmin1138 Explains

Growth vs throughput mindsets

2026-02-18T15:00:00Z

I've talked about this one to various managers over the years when the topic of work-scheduling humans comes up. There are two big frameworks for deciding who works which tickets:

Throughput: Assign the person who will complete the task fastest to the ticket or story. Helpdesks often work on this model, since they're frequently quite under water.
Growth: Assign the second most knowledgeable person to the ticket/story (or third, or fourth if you have that many people) so they can get experience solving the problem. This method solves the cross-training problem.

Both approaches are valid, and which approach a team takes is balanced against the incentives the team operates in. For a Helpdesk team, which is often under-staffed, throughput tends to be highly prioritized. For a Dev team with on-call responsibilities, growth tends to be the norm to make sure problems can be handled by anyone and to stop burning out your senior talent. Security teams tend to be a blend of both; with ticket-based security reviews urging throughput mindsets, where incident-response prioritizes growth. For Platform teams that tend to be involved in a lot of incidents due to running the systems that problems happen on, even if the problem wasn't actually the platform system, throughput mindsets are hard to avoid.

Now add coding LLMs to the mix.

The on-label reason to use agents such as Claude is throughput: spend less time working on tasks to improve your throughput. For Engineering Directors this is a great thing, because you can get throughput advances in your reporting teams without sacrificing engineering maturity, letting you lean on the product roadmap a little harder.

I'd argue that the throughput nature of LLM agents actually sacrifices some growth at population-scale when used in currently typical tech-companies. By moving from a purely generative mindset when building code to a revisionary mindset, going from writing code to reviewing and editing generated code, engineers spend less time learning problem-spaces in detail. Less time learning means taking more calendar time building up true domain knowledge. Coding agents are a somewhat different thing than the decades of "just add another abstraction layer and forget about the lower level details" we've been doing since 1970. It is true that most engineers in SaaS companies aren't spending lots of time hand-tuning assembly for execution on their ISA, and doing so is actually very bad practice unless you're in extremely specific problem domains. The difference between yet another abstraction and coding agent is the difference between abstraction and synthesis.

Abstraction is a distillation of a problem-space into an API, which can represented by grpc protobufs or function-calls, or whatever. You have inputs, expected actions, and expected results; the abstraction handles the details so you don't have to and often someone else is responsible for updates over time.

Synthesis is building novel functionality through combining multiple things to create a new thing. Humans traditionally have been the synthesis engines of code-production, but coding agents are beginning to automate portions of this.

Writing code is synthesis. Until the advent of coding agents, your IDE would give you support through syntax highlighting and API short-cuts, reducing the cognitive load of bare syntax problems to let you focus on higher level problems like how a function should handle bad inputs. The IDE gives you support for abstractions, but leaves the synthesis to you.

Once coding agents get added in, you shift from generating code, synthesis, to reviewing automatically generated code whenever the agent creates something for you. This is when cognitive biases come into play. Remember the mindset thing I opened this post with? An engineer under throughput stress is going to be less diligent about checking generated code in detail, and will shift more of their domain learning to troubleshooting test-failures and incident-response. Less domain knowledge will be developed during the initial writing. An engineer with no throughput stress, perhaps they're doing some for-fun coding, is more likely to take a fine toothed comb to generated code to learn what the generated code is trying to do, why it works the way it does, and what best-practices it seems to be following; in this throughput-free case the coding agent is a driver of growth.

Coding agents end up magnifying the biases present in the team's environment, enhancing positive feedback loops. The advertised reason to adopt coding agents is to increase developer throughput! The throughput bias is baked into the technology and its marketing, which means organizations requiring coding agent use need to take steps to provide some negative feedback loop enhancement to avoid large-scale degradation in coding quality and incident severity. Use of these agents make an organization more sensitive to growth/throughput bias shifts prompted by org-chart and quarterly goal shifts.

Combine the bias-enhancing quality of coding agents with the industry-wide retraction in US-based jobs for economic reasons, and you should be seeing a general retraction in overall growth mindsets among US-based engineers.

The tech community needs so much repair

2025-11-24T18:00:00Z

This BlueSky thread from Cat Hicks is on the money.

There actually has not been enough reflection in the tech community about DOGE
all I see are "they're not engineers" "that's not what engineering is" I want a thick good really real piece to sink my teeth in about all the parts of this that ARE "like engineering"
I would write it except I want to stay safe
Engineering can be so much better than this and more than this. It's absurd how much we're resigned to this entire area of human work being held hostage by this kind of culture.
I do not believe that the majority of the software people I have worked with all these years sit around thinking, "I want cancer research to be destroyed." We have mountains to climb no doubt, we are trapped in some bad cultures no doubt, but I do not believe this of you for one moment.
Our tech community needs SO much more repair, processing, and collective reflection. The people who worked to support government and science were so betrayed by other groups. The people in flashy big tech cos feeling like their values were betrayed. Just so much repair needed here

I absolutely agree that our tech community needs quite a lot of collective reflection, processing, and work on repair. For a variety of reasons I've done a fair amount of casual research into recovering from trauma of various types. Some of this comes from having lived through the dot-com, great recession, and profitability-crunch contractions in the tech market, and some from good old fashioned life experience outside of the workplace. Trauma is trauma, and we have some pretty good ideas what chronic (ongoing, persistent) trauma does vs acute (single traumatic event) trauma.

Acute trauma: sudden-death layoff.

Chronic trauma: never being allowed to stay on one project long enough to get something good for your performance review, making you constantly fear the next layoff will have your name in the list.

Each of these affects the body and mind differently, and when entire populations are subjected to chronic traumas you get population-level reactions. When those chronic traumas are structural, as is the case with management culture in all of big US-tech, remediation becomes next to impossible. When individuals in this chronically traumatized population can't fix it, you get three big trauma-responses:

Cynicism. I can't fix it, that's just how it is. If you set your expectations low enough, you get to be happily surprised once in a while! It's great.
Heroism. Rally your fellow workers to overthrow the corrupt system! Who is with me? If not, I'll do what I can alone.
Trauma harder as a way of life. Obviously, we're in a cut-throat system which means I need to cut throats. QED.

Big-tech management likes workers in the trauma harder category, is somewhat tolerant of cynics, and is designed from the org-chart out to prevent the heroes from getting anywhere useful and to redirect their energies in positive directions like burnishing the company's reputation among diversity hires. Do this for a few decades and you have an entire population of highly educated workers who have been trained their entire careers to look at the next sprint/month/quarter's deliverable for your team and kinda ignore what the rest of the company is doing. How your quarter's project to reduce stream latency by 15% at the p95 level relates to the efficiency of data-analysis in ICE isn't always obvious, don't look up or you might find out.

So take these workers who live in this pit of oppression every day for their day-jobs, and put them into an open-source community for their fun-time activity. What happens then?

The cynics contribute as they're able, expecting corporate malfeasance to show up at any point, often seeing it when it isn't there.
The heroes go about building a community that actually is healthy for a change! Whew.
The trauma harder crew perpetuate corporate-style power structures because that's how tech works, accidentally reinforcing the cynics and frustrating the heroes

Fixing this sort of thing requires so much work.

The cynics need to be taught that their defensive pessimism is not appropriate by repairing the structural injustices
The heroes need to understand they're not alone and are being listened to through an effort of collective reflection
The trauma harder crew needs to realize that alternate structures are viable, and understand the damage they've endured in the existing system through extensive processing

You can't do this overnight, it will take a revolution of some kind. Some revolutions are slow, like the heroes getting somewhere with governmental support allowing unionization to creep in higher and higher numbers until union contracts dominate worker terms and conditions rather than Radford salary reports. Some revolutions come quick like whole industries getting nationalized after a socialist junta and remodeled away from oligarchic control. Some come generationally, like China overtaking the US for big-tech exports forcing the oligarchs to look elsewhere to stay fat.

Our tech community needs SO much more repair, processing, and collective reflection.

The desperation of Windows

2025-11-07T15:00:00Z

It is no secret I'm a long time now-former Windows system administrator. The first Windows I professionally administered was Windows NT, and I was with it up through Windows 2008 (and a touch of 2012). I ran into an observation today that made me to Hmm, and that leads to blogposts.

@xgranade The common complaint is that you need to be a professional software developer to use Linux. But to use Windows 11 (and have it be usable) you need to be a professional sysadmin who uses terms like "group policy".
https://aus.social/@natarasee/115501020317612539

Because this is spot on. I'd argue Linux is usable by non-devs these days, but you still need a tolerance for fiddling and non-standard UI. The Windows side is extremely true. Windows in a corporate context is way more tolerable than Windows in a home context because the corporate context has a group of grumpy Windows sysadmins setting new Group Policy every time a security or feature release comes out to turn down the suck. Those grumpy sysadmins are as grumpy at Microsoft pulling this consent-violating shit as you are, and Windows lets you centrally shut it off (in a corporate context.)

Windows has been losing desktop market-share to Apple for years, and the old "Wintel" cash-cow they used to enjoy is not milking as much as it used to. When software makers see flagging revenue and soft user demand, it's time to do demand forcing! And demand forcing leads to shittier experiences as the use more software! message gets ever more aggressive.

The long time followers of this blog have seen enough of this industry to know the cycle when they see it.

Darling product stops being darling for whatever reason. Competition, flagging significance, major incident spoiling user trust, private equity takeover, whatever.
The product's Product org has to make number go up in spite of all this so jacks renewal prices.
Renewal prices only jack so far before growth reverses, so Product has to ship new features to justify the price increases.
Bad uptake of new features means features get added to base plans to justify jacking the price.
Bad uptake of now baseline features means more aggressive prompting of those features to drive up Monthly Active Users metrics
Repeat

Do this for enough years and you get the sclerotic Windows 11, full of demand-forcing promptware that pisses your customers off.

Paul Cantrell's Treatise Against Efficiency

2025-10-02T14:33:00Z

On Fediverse, Paul Cantrell, CompSci professor at Macalester in St. Paul Minnesota, posted the following list:

Here’s the lightning sketch of Paul’s Treatise Against Efficiency that I’ve never written:

1. Efficiency is asymptotically inefficient: as costs approach zero, the cost of further reducing them approaches infinity.
2. Efficiency prioritizes the measurable over the difficult-to-measure.
3. Efficiency prioritizes what those in power see (or imagine) over on-the-ground reality.
4. Following from 2 and 3, efficiency reduces the amount and quality of information flowing into a human system.
5. Efficiency foments institutional inflexibility.
6. By removing slack, efficiency causes small failures to cascade more readily and increases the risk of catastrophic failure.
7. Following rom 4, 5, and 6, efficiency trades small costs for massive risks: from failures, from missed opportunities, and from inability to adjust.
8. Efficiency, when pushed, strangles the emergent phenomena that in the long term create all new things of value.
9. Thus, although it can be a by-product of evolution, efficiency as a goal in itself strangles evolution.
10. Efficiency as a goal strangles joy.

It turns out I do have time to write an article about that, below the fold.

]]> 1. Efficiency is asymptotically inefficient: as costs approach zero, the cost of further reducing them approaches infinity.

This is a long-standing axiom of "nines" thinking. Where each "nine" of uptime costs rather more than the previous nine. I won't spend time on this, because this is widely understood. Fighting for better uptime, no matter how good it already was, is a big part of why Site Reliability Engineering came into being: we needed a way to quantify this, and frameworks for deciding when moar uptime! was not actually called for given business realities.

In areas where efficiency is not a proxy for uptime, where we have an entire engineering practice created to manage realities surrounding the topic, you get what we had before SRE was everywhere: managers attempting to score points by driving up efficiencies of operation and damn the costs! Improved efficiency will pay for itself! It wasn't a good time in the tech industry.

2. Efficiency prioritizes the measurable over the difficult-to-measure.

This is an axiom of management overall: you can't manage what you can't measure. Efficiency is a management problem, therefore efficiency is based on what you can measure. QED. Managing through vibes is not based in science.

3. Efficiency prioritizes what those in power see (or imagine) over on-the-ground reality.

This is another emergent axiom out of US tech industry management practices. In any organization with more than three levels on the org chart, the people making the big decisions on costs are at a great remove from the people doing the work. In theory, if your quantization is good enough (point 2) this shouldn't matter. Theory is not reality, though; which is why the big tech orgs often have Staff+ engineers (Staff, Principal, Distinguished Engineers, Architects) partnering with higher level managers to improve the odds of ground-truth being an input to the big decision. Improving the odds is not the same as 'shall.' The management context always wins in a conflict with ground-reality.

4. Following from 2 and 3, efficiency reduces the amount and quality of information flowing into a human system.

By restricting information solely to things that can be quantized (point 2) and with a tall enough org chart that the people who make the big decisions are heavily divorced from the people doing the work (point 3) you are making decisions on heavily summarized data based on a restricted subset of all the data, which makes deriving a high quality solution rather harder. This is a fundamental feature of modern management and data practices.

5. Efficiency foments institutional inflexibility.

Efficiency is a form of optimization.

Optimization requires building your organization and management structure to be fit for the given tasks, with no surplus.

Every time the task changes, you have to reoptimize. Reoptimization has a real cost, so don't change the task for little things.

This leads to inflexibility, and missed innovation opportunities; it simply costs too much to experiment with a new thing.

6. By removing slack, efficiency causes small failures to cascade more readily and increases the risk of catastrophic failure.

Efficiency of management becomes brittle when one failure causes other managers to get overloaded with work even with routine 'failures' like parental leaves. Slack allows you to better buffer these routine failures, and improves the odds of the whole thing avoiding a catastrophic failure when some unplanned failure happens.

Technical efficiency in distributed systems becomes brittle when the ability to scale up to meet demand is compromised, or a changed usage-pattern causes cascading failures due to lack of optimization in the new pattern. Slack, inefficiency, allows better buffering of changed usage patterns, allowing time to reoptimize for the new shape.

7. Following rom 4, 5, and 6, efficiency trades small costs for massive risks: from failures, from missed opportunities, and from inability to adjust.

As efficiency increases, slack decreases.

As slack decreases, tolerance of failure decreases. Even some 'planned' failures can prompt greater incidence-rates of major failure. Unplanned failures are far more likely to result in major failure.

As slack decreases, the cost of innovation increases, leading to missed opportunities. Certain innovations become classed as failures as a result.

8. Efficiency, when pushed, strangles the emergent phenomena that in the long term create all new things of value.

Efficiency by necessity creates brittleness and reduced failure tolerance.

Lack of failure tolerance means lack of tolerance for experimentation, which is the foundation of innovation. Such systems are more likely to wall innovation off into a 'safe' part of the organization, and be chronically underfunded.

9. Thus, although it can be a by-product of evolution, efficiency as a goal in itself strangles evolution.

Efficiency is optimization. While evolution can produce some amazingly optimized builds, such builds do not tolerate changed circumstances. Take the oxygen out of the system, and most of life as we know it has a much harder time working. Efficiency increases change intolerance, which slows evolution.

10. Efficiency as a goal strangles joy.

The joy of making is introducing change. A new, more efficient process. A better way of handling changed usage patterns. A fundamental rethink of how features are planned. A better platform system that's less annoying to keep running.

Efficiency increases change intolerance. Change is joy. Efficiency decreases joy.

ServerFault spam issues

2025-09-05T16:59:29Z

For the maybe three of you still there, the StackExchange sites have been experiencing a multi-month spam wave from a certain group of spammers who've built automation for our sites. SuperUser gets it harder than ServerFault, but we both get hit. So I did some diving today to figure out just how bad the floods are.

For 2025:

Non-deleted questions: 2512
Deleted questions (mostly spam): 9748
One specific spam type: 6863

Yes, we're getting way more spam than legitimate questions right now, and have all year.

The Metasmoke people are doing a lot to make sure you rarely see it, but it's still a burden on the moderation staff due to the need to delete the spam-users. That deletion helps the StackExchange anti-spam systems identify bad IPs and others which increases the burden on spam-campaigners like these.

Where "software telemetry" fits in the observability ecosystem

2025-06-24T14:00:00Z

I wrote a weird little book. I'm still getting royalties, so thank you all for buying, but this book does not easily fit into 2025 concepts of "observability engineering," so I want to talk about my goals and how it still fits.

At the base, I ended up writing a book for Platform teams looking to deliver internally deployed observability systems. That's not quite what I had in mind when I started writing in 2020, but that's where it lives five years later. My actual goal was to write a book that was usable by people in the SaaS industry, but also in businesses where the main user of internally developed software was internal users. The non-SaaS population often gets ignored in book targeting, and I wanted something that would let these people feel seen in a way that reading yet another Observability for Cloud Systems book would. In 2025, this book is a Platform book.

In 2019 and early 2020 when I was working with Manning on the title and terms, the word "observability" came up. It seems hard to remember in 2025, but "observability' was still a vague term that didn't yet have industry consensus behind it. OpenTelemetry was a thing at the time, but the "metrics" leg of OTel was still in beta, and "logs" was merely roadmapped. In 2025 there are debates around whether the fourth pillar is profiles, performance traces, or errors, which could be stack-dumps or a category of logs. If we had decided to use "observability" instead of "telemetry" the book may have sold better, but the term "telemetry" works better for me because observability is a practice built on top of telemetry signals. I wasn't writing a book about practice, I was writing about herding signals.

Herding signals, not interpreting them.

In 2025, most of the herding is supposed to be done through OpenTelemetry these days. Or if it isn't OTel, the signals are being herded through other systems like Apache Spark. This is industry consensus; instrument your code, add attributes in the emitters and collectors, change your vendors as you need to, build dashboards in your vendor's platform. A rewrite of Software Telemetry would reference OTel far more often than I did, but I would still make sure to mention non-OTel styles due to OTel not actually being supported (or in some cases, a good fit) in certain environments like network telemetry.

Whatever the API format of the signals getting herded, platform engineers need to know the fundamentals of how telemetry systems operate and that's what I wrote about. But also, I wrote about storing those signals, which is something that OpenTelemetry deliberately leaves out as a detail for the implementer. As I extensively wrote about, storing signals and creating a reporting interface is a hard enough part of telemetry that you can build a business around it. In fact, the Observability Tools market in 2025 is valued at around $2.75 Billion, and they all would love for you to use OTel to send them data to store and present.

In the language of my book, OpenTelemetry is an early shipping stage technology. Early because it has no role in storage. OTel arguably has a role in the emitting stage through explicit markup in code itself. OpenTelemetry's impact to the presentation stage is mostly in tagging and attribute schemas and how they get represented in storage. Observability needs to consider every stage, but also the SRE Guide problems of figuring out what to instrument, to which markup standards, following which procedures to ensure reliability. Observability sits on top of telemetry.

One of the consistent comments I got during the pre-publication reviews was: "I want to know what to track."

My answer was simple: that's not the book I'm writing.

This book is for you, the growth engineer tasked with taking a Kafka topic (or group of topics) of logging data, sent there by OTel, and transform it in the big Databricks instance with all the other business data.

This book is for you, the network engineer tasked with extracting network metrics out of a proprietary system, so you can chart network things in the main engineering dashboarding platform.

This book is for you, the security engineer tasked with extracting security event data out of a cloud provider to put into the SIEM system.

This book is for you, the project manager who has just been given a digital transformation project to revitalize how all the internally developed apps will produce telemetry, and how engineers will observe the system.

Nostalgia over blogging vs the current social media hellscape

2025-06-03T14:00:00Z

A sentiment just crossed Fediverse recently, which is in the vein of "RSS was peak social media, change my mind". The original post was from https://hachyderm.io/@Daojoan@mastodon.social and is quoted below:

RSS never tracked you.
Email never throttled you.
Blogs never begged for dopamine.
The old web wasn’t perfect.
But it was yours.
https://mastodon.social/@Daojoan/114587431688413845M

I was there for the rise and fall of blogging, so the rest of this post is me over thinking this particular post.

]]>

RSS never tracked you

RSS never tracked you the way HTTP or HTML never tracked you. What tracks you with HTTP/HTML are rendering engines for HTML and Javascript. I can say with absolute certainty that tracking-tech during blogging's height was used to track audience, click-through rates, and other site-engagement metrics. RSS was the loss-leader for clicks on sites. This particular post uses one of those tricks; an opening paragraph promising more, which if you're reading this through RSS you will have to click through to read. Tracking RSS feeds was done through tracking pixels back in the day because you couldn't trust Javascript to run in the RSS readers but HTML rendering generally worked.

Email never throttled you.

Email absolutely did, absolutely. The problem with spinning up a self-hosted newsletter service in 2025 is getting your IP reputation good enough that the big mail-box vendors (Google, Microsoft, Yahoo/AOL) let you deliver. This issue was in its infancy in 2005, but was an emerging problem as IP reputation became clear as a low-cost first-pass anti-spam technique. Mailing list operators ran into this all the time back in 2005 as various subscribers moved behind security appliances doing IP reputation.

Blogs never begged for dopamine.

On a factual basis, this is false. Sole operators like myself lived for comments, that dopamine hit from people liking what I wrote. If I couldn't get that, I'd enjoy reshare statistics (see the first point about RSS tracking). Many commercial blogging platforms even created RSS feeds for comments on specific articles to make it easier to keep up on discussions. If that isn't dopamine, I don't know what is.

The old web wasn’t perfect.
But it was yours.

True, to a point. I tagged this article blogger because that's what I hosted this blog through for most of its first decade. You know, a centralized blogging platform akin to LiveJournal or Dreamwidth back then, or Medium and Substack today. I didn't own the platform. After I moved to my own domain and Movable Type, this was true. And yeah, it wasn't perfect but it was most definitely mine.

There is another layer to this post beyond the simply factual, and that's a critique of platforms. Blogging in its heyday was perhaps the last major gasp of the Old Internet, where a bunch of hobbyists create something beautiful and widely adopted, which then gets enclosed by commercial interests. Blogging was absolutely not centralized. That lack of centralization means there was no unitary profit motive looking to drive engagement to increase sales, those were limited to individual sites.

Blogging's decline is traceable to two big trends:

Google killed Google Reader, which was the dominant RSS reader by a long shot. This death forced a bunch of folks to look for alternate platforms. My stats show my readership plummeted over 50% on death-day.
Twitter and other early social media provided a much shorter dopamine feedback loop than the blog-publish-comments loop.

The fact that Google Reader's death functionally killed RSS-based distribution is proof that blogging was already beginning to centralize. Google couldn't control production and distribution, only engagement; and the real money was in controlling all three. Google wanted what Twitter had, which was the entire production, distribution, and engagement framework on the same site with the same owners and tracking infrastructure. Once they had that, start engineering dopamine feedback loops to improve stickiness and engagement; and we have all the algorithmic and dark pattern pathologies we know and loathe today.

Modern internet users have been trained for a decade in a half to expect social media to involve a single, or small number of sites to do everything, and individual posts are short enough to deal with while waiting for a bus or the kids to walk from school to car. The old blog+rss model simply can't compete with this, and better fits as attention-competition to a given user's news media consumption. Medium and Substack both offering paid subscription options for individual blogs is proof that news media is the competition to blogging, not social media.

The nostalgic exhortation to set up a blog and distribute through RSS is in the same vein as calls to return to IRC and dump Slack/Teams. Some people/groups can do that, but the platform features are different enough that the old tools don't feel feature-complete and that lack nudges folk back into the commercial walled gardens.

Storage, DoGE, and cognitive biases against tape

2025-04-07T14:00:00Z

The Department of Government Efficiency, Musk's vehicle. made news by "discovering" the General Services Administration uses tapes, and plans to save $1M by switching to something else (disks, or cloud-based storage). Long time readers of this blog may remember I used to talk a lot about storage and tape backup. Guess it's time to get my antique Storage Nerd hat out of the closet (this is my first storage post since 2013) to explain why tape is still relevant in an era of 400Gb backbone networks and 30TB SMR disks.

The SaaS revolution has utterly transformed the office automation space. The job I had in 2005, in the early years of this blog, only exists in small pockets anymore. So many office systems have been SaaSified that the old problems I used to blog about around backups and storage tech are much less pressing in the modern era. Where we have stuff like that are places that have decades of old file data, starting in the mid to late 1980s, that is still being hauled around. Even when I was still doing this in the late 2000s the needle was shifting to large arrays of cheap disks replacing tape arrays.

Where you still see tape being used here are offices with policies for "off-site" or "offline" storage of key office data. A lot of that stuff is also done on disk these days, but some offices still kept their tape libraries. The InfoSec space is keen to point out you can't crypto-locker an offline tape, so offline tape is a useful tool in recovering from a ransomware incident. I suspect a lot of what DoGE found was in this category of offices retaining tape infrastructure. Is disk cheaper here? Marginally, the true savings will be much less than the $1M headline rate.

But there is another area where tape continues to be the economical option, and it's another area DoGE is going to run into: large scientific datasets.

To explain why, I want to use a contrasting example: A vacation picture you took on an iPhone in 2011, put into Dropbox, shared twice, and haven't looked at in 14 years. That file has followed you to new laptops and phones, unseen, unloved, but available. A lot goes into making sure it's available.

All the big object-stores like S3, and file-sync-and-share services (like Dropbox, Box, MS live, Google Drive, Proton Drive, etc) use a common architecture because this architecture has been proven to be reliable at avoiding visible data-loss:

Every uploaded file is split into 4KB blocks (the size is relevant to disk technology, which I'm not going into here)
Each block is written between 3 and 7 times to disk in a given datacenter or region, the exact replication factor changes based on service and internal realities
Each block is replicated to more than one geographic region as a disaster resilience move, generally at least 2, often 3 or more

The end result of the above is that the 1MB vacation picture is written to disk 6 to 14 different times. The nice thing about the above is you can lose an entire rack-row of a datacenter and not lose data; you might lose 2 of your 5 copies of a given block, but you have 3 left to rebuild, and your other region still has full copies.

But I mentioned this 1MB file has been kept online for 14 years. Assuming an average disk life-span of 5 years, each block has been migrated to new hardware 3 times in those years. Meaning each 4KB block of that file has been resident on between 24 and 42 hardrives; or more, if your provider replicates to more than 2 discrete geographic region. Those drives have been spinning and using power (and therefore requiring cooling) the entire time.

These systems need to go to all of this effort because they need to be sure that all files are available all the time, when you need it, where you need it, as fast as possible. If a person in that vacation photo retires, and you suddenly need that picture for the Retirement Montage at their going away party, you don't want to wait hours for it to come off tape. You want it now.

Contrast this to a scientific dataset. Once the data has stopped being used for Science! it can safely be archived until someone else needs to use it. This is the use-case behind AWS S3 Glacier: you pay a lot less for storing data, so long as you're willing to accept delays measurable in hours before you can access it. This is also the use-case where tape shines.

A lab gets done chewing on a dataset sized at 100TB, which is pretty chonky for 2011. They send it to cold storage. Their IT section dutifully copies the 100TB dataset onto LTO-5 drives at 1.5TB per tape, for a stack of 67 tapes, and removes the dataset from their disk-based storage arrays.

Time passes, as with the Dropbox-style data. LTO drives can read between 1 and 2 generations prior. Assuming the lab IT section keeps up on tape technology, it would be the advent of LTO-7 in 2015 that would prompt a great restore and rearchive effort of all LTO-5 and previous media. LTO-7 can do 6TB per tape, for a much smaller stack of 17 tapes.

LTO-8 changed this, with only a one version lookback. So when LTO-8 comes out in 2017 with a 9TB capacity, a read restore/rearchive effort runs again, changing our stack of tapes from 17 to 12. LTO-9 comes out in 2021 with 18TB per tape, and that stack reduces to 6 tapes to hold 100TB.

All in all, our cold dataset had to relocate to new media three times, same as the disk-based stuff. However, keeping stacks of tape in a climate controlled room is vastly cheaper than a room of powered, spinning disk. The actual reality is somewhat different, as the few data archive people I know mention they do great restore/archive runs about every 8 to 10 years, largely driven by changes in drive connectivity (SCSI, SATA, FibreChannel, Infiniband, SAS, etc), OS and software support, and corporate purchasing cycles. Keeping old drives around for as long as possible is fiscally smart, so the true recopy events for our example data is likely "1".

So another lab wants to use that dataset and puts in a request. A day later, the data is on a disk-array for usage. Done. Carrying costs for that data in the intervening 14 years are significantly lower than the always available model of S3 and Dropbox.

Tape: still quite useful in the right contexts.

Applied risk management

2025-03-31T14:00:00Z

I've been in the tech industry for an uncomfortable amount of time, but I've been doing resilience planning the whole time. You know, when and how often to take backups, segueing into worrying about power diversity, things like that. My last two years at Dropbox gave me exposure to how that works when you have multiple datacenters. It gets complex, and there are enough moving parts you can actually build models around expected failure rates in a given year to better help you prioritize remediation and prep work.

Meanwhile, everyone in the tech-disaster industry peeps over the shoulders of environmental disaster recoveries like hurricanes and earthquakes. You can learn a lot by watching the pros. I've talked about some of what we learned, mostly it has been procedural in nature:

Incident response programs, discussing categorization of incidents
Getting people to think about disaster recovery, talking about how security-type disasters have slightly different recovery phases that requires changing your thinking
Correcting a non-compliant team's reliability, going into helping shift mindsets as a peer, not a manager

Since then, the United States elected a guy who wants to be dictator, and a Congress who seems willing to let it happen. For those of us in the disliked minority of the moment, we're facing concerted efforts to roll back our ability to exist in public. That's risk. Below the fold I talk about using what I learned from IT risk management and how I apply those techniques to assess my own risks. It turns out building risks for "dictatorship in America" can't rely on prior art as much as risks for "datacenter going offline," which absolutely has prior art to include; and even incident rates to factor in.

]]> The risk analysis process in overview is:

Generate a list of risks through brainstorming and fears
Generate a list of responses to risks
Investigate responses for dependencies, determining any planning and execution lead-times
Apply responses to the list of risks, you will absolutely change your mind about what risks are worth considering
For each risk with a response requiring a long lead time (such as "relocate out of state") determine the dependent chain leading to that risk. This eliminated many risks from the brainstorm list.
For the remaining risks, group risks into those with common dependencies, or are outright duplicates using different wording
For grouped risks with long lead times, build dependent event chains for each, with an eye towards which events to watch out for as a herald that it is time to plan or go.
Build an aggregated list of events to keep an eye on before you start actively disaster-planning

That's a lot of steps, but I'll walk you through some examples to help you figure out what I'm talking about. As computer professionals in the disaster recovery space, it's extremely easy to jump to the worse case scenario ("organized murders of my minority in my area," also known as an organized genocide) and base all of your planning around that. Doing so will turn you into a giant ball of anxiety, so going through the above process will give your limbic system a break.

1. Generate a list of risks

You probably already have specific fears, like the "organized murders of my minority in my area" one I used above. Write that down, and write everything else that may push you out of "ignore/fight" and into "do something." This will be emotional work, just get things on the page. We'll qualify the risks later on, but we need a starting set to even begin the analysis.

2. Generate a list of responses

For the analysis I ran regarding when "flee" becomes a good idea, I came up with the following response menu:

Endure it - it's fight I can afford to lose
Fight it - It's a problem, and it's important that I stand up and take action to make it stop
Domestic move - My local area is unsafe, it's time to move somewhere else
International move - Nowhere in the country is safe, time to get out however possible
Overnight evacuation - Specific threats have been received, time to be somewhere else within 12 hours

3. Investigate response dependencies

Most of this work is identifying which responses have lead-times, and determining how much time you need. This is a dependency analysis, so it's safe to get into the weeds a bit regarding procedures and lead-times. For international relocation, there ended up being a huge menu of ways to do that, multiple for each country and this one response ended up taking the majority of the time for me to qualify.

You're looking for two lead-times:

Planning time, how long it takes to set the conditions up to do the response
Execution time, how long it takes to execute the response

I also learned a few things. For the Domestic move option, there is a difference between ASAP and Planned move cases, and for a same-state vs different-state relocation. An ASAP move would involve moving into short term housing, such as a hotel, and then deciding if house-hunting or apartment-hunting needs to happen next. For a planned move, you can do the apartment/house hunt before executing the move. If you're employed, your employer may place restrictions on interstate moves! Find that out, because it'll affect your ability to relocate to another state.

Interstate moves as a remote employee

Due to how tech employers in the US do geobanding, each state and city is sorted into typically three geobands. The geobands are in broad agreement due to the common use of Radford services to assess market salary bands, and run from Tier 1 (SF/NYC) to Tier 3 (cheapest places). Employers typically want you to move within a band. Or better, down a band which will let them pay you less. You might even get relocation cost-support by moving down a band.

Many companies will require approval to move to a new state. This has to do with how employment taxes are computed in America. Your employer needs to have an entity in your state in order to pay you. Smaller employers may be willing to set up an entity just for you. Larger ones are unlikely to do so.

International moves have a menu of possibilities, especially if you're a late career technical professional who has been reaping great big RSUs for a while and has significant taxable savings:

Employer sponsored - The easiest method, but requires you working for an employer with international operations who is willing to let you do it. This is also the most reliable method to get into places like Canada, Australia, or Ireland.
Digital nomad visa - Designed to allow people to visit a country while working remotely for a few months, before moving on. This is needed because it is illegal to do work, even for a foreign employer, while in-country on a tourist visa. Most useful for people doing 1099 work, as that puts the tax burden on you. In theory you could hop to different countries when your visa expires, but it makes your tax situation extremely tricky.
Investment visa - You drop a large amount of money, ranging from €250K (Latvia) to several million and get a long term residency permit. This lets you work in your target country. This is the same vehicle used by Russian and Chinese oligarchs, so there is absolutely stigma attached to using this option.
Retirement visa - You prove enough passive income (rents-received, interest earned, pensions earned) and they give you a long term residency permit. This does not let you work in your target country. The goal is to prove to your target country that you earn enough to not require social supports. Provable amounts range from $15K/year (Panama) to €100K/year (Ireland) in passive income. Many of the countries that have investor visa programs also have retirement visa programs.

Your financial situation will tell you how many of these are viable. This may require you to change employer in order to gain the possibility of an internal transfer to another country. If so, add this to your "prep" time. If the investment or retirement visas are viable, now is the time to look at countries and where you can go and how many of them will allow you to be you.

Of the above, the Employer Sponsored is probably the fastest to execute once things get going. Your employer probably has professionals to work on obtaining the visa. Even so, it could take up to 6 months for you and your stuff to get to your target company. The Investment Visa option requires vetting by your target country's consulate, which can take up to a year. Or more, in the case of Portugal who is seeing a huge influx of applicants. Retirement visas are easier to get, but they require enough passive income so fewer people under IRA mandatory distribution age will qualify.

The right-wing rise is everywhere that speaks English

Everywhere English-speaking is having a bad time of it right now. Everywhere. The UK fell to right wing extremism before the US did. New Zealand has a right wing government. India has had a right-wing government for years. Australia has major anti-immigrant sentiment. Canada was following the US until the US pissed them off enough they decided to fight, so there may be hope there; but they're also increasingly likely to cut off immigration from the US as a result of the hostilities.

If you're like me, and a member of the trans community, the US of March 2025 is still one of the best places around to be. That spot is actively degrading, but be aware that where you jump to may not be better. Your affordable hope may simply be "won't get as bad," and will be actively worse until the US catches up.

The most extreme response, Overnight Evacuation, is a special one because it tells you what the execute time needs to be: 12 hours. The big work for Overnight Evacuation is planning and prep, such as maintaining go-bags and having a plan for where you can run. Go-bag maintenance takes time, especially if you're middle-aged and have a lot of prescribed medications. Depending on your circumstances, you may have to run evacuation drills to refine your procedures. For the Overnight Evacuation option you'll likely need to decide when you want to start actively prepping the 12-hour evacuation capability, because maintaining that readiness for two plus years will be a lot of work.

After assessment, my response table revised somewhat drastically.

Revised risk-response table with planning and execution periods marked
Response	Preparation time	Execution time
Endure it	N/A	N/A
Fight it	N/A	N/A
ASAP domestic move	2 weeks	2 weeks
Planned domestic move	1 month	1 to 3 months
Work-visa move	1 year (job hunt)	2-6 months
Investor/retirement visa move	3 months	6-18 months
Evacuate overnight	1 month + regular refreshes	12 hours

The international moves have incredibly long lead-times! In the absence of refugee lanes for Americans fleeing dictatorship, an international relocation simply can't be the response to an emergent problem. Knowing this tells us that any risk with "international move" as a response needs to be analyzed for what events heralding the need for an international relocation will happen 6 to 18 months beforehand.

4. Apply responses to risks

Now that we've gone through all the work of qualifying our responses and understanding lead times, now we look at our list of risks and apply responses to them. This is a judgment call based on vibes. If you want to get scientific about each risk and diagram out dependency chains, do whatever makes your brain happy.

One thing to note: risks rated Evacuate Overnight are likely part of a rising pattern of risks. Step 5 is where we analyze these dependencies, and Step 6 is where we start grouping risks. Don't get too hung up on dependencies here, you want to get your brainstorm list completed. Feel free to evict risks if you find any that you don't feel merit response any more, or add new risks should thinking about others remind you of risks you missed.

5. Determine dependent chain for long-lead risks

For the risks with responses having a long lead time, we need to analyze those risks to identify what sort of events are upstream of the risk, things that need to happen before our risk happens. Such analysis will let us know which prelude events we should track before entering planning. For example, let's look at the extreme option I opened with, "organized murders in my area," which is an Evacuate Overnight risk. One dependent chain could be:

President repeatedly calls for murders
Right-wing media echoes calls for murders
Right-wing militias begin harassment campaigns
Regular media reports calls for murders, without condemning murders
Organized murders in other areas
Organized murders in my area

This list demonstrates an escalation from a call for action to action in my area. This also elides probabilities, which is analysis you will have to do yourself. America is a huge place. The areas most at risk for a rapid escalation from 1 to 6 are deeply red areas in Republican controlled states, where State authorities are less likely to investigate murders of disliked minorities. The areas least at risk are blue cities in Democrat held states, where State authorities are quite likely to investigate coordinated murders; making it harder for murder militias to safely operate.

For blue state residents, violence will rise in neighboring red states before it starts leaking across borders. This affects timing. After judging the dependent chain, we can assign risks to higher levels of the chain.

President repeatedly calls for murders
Right-wing media echoes calls for murders
Right-wing militias begin harassment campaigns
Regular media reports calls for murders, without condemning murders, plan domestic relocation to bluer state
Organized murders in other areas
Organized murders in my area, evacuate overnight

We have a new risk: regular media reports calls for murders, without condemning the murders. This new risk is a herald of the much more severe murders in my area risk.

Repeat this exercise for each of your long-lead risks. Now is the time to get into the weeds.

One risk on my brainstorm list was "ACA is repealed," which was removed from my brainstorm list due to finding discussions about how the right widely views ACA as settled (their electoral drubbing in 2018 was earned by being too aggressive about repealing ACA) and attacks are focusing on Medicaid block-grants. Such grants don't affect me, so "ACA is repealed" was drop from the risks list. Taking its place was, "Coverage of trans-related care banned for ACA plans," the court-cases for which are already underway.

6. Group risks with common dependencies

Your dependency chain analysis likely revealed a few risks that share a dependent chain, or are actual precursors to risks with higher severity levels of response. In my analysis I was able to condense the huge brainstorm list down to four risk-types, and eliminated several more risks that weren't eliminated in Step 5. My risks were pretty broad, but cover similar themes.

Coordinated murders of trans people in my area
Militia-style groups disappearing trans people in my area
Mass secret-order detention of citizens
My state goes fascist in the 2026 election

Murder, kidnapping for ransom/torture, state disappearance squads, and my state losing safe-haven status. This was a far shorter list than I started with. This list is also extremely dark. The good thing is we're no where near anything on this list as of March 2025. Now the work turns to figuring out the joined dependencies and coming up with prelude events.

7. Build dependency chains for each grouped risk

This is repeating step 5, but using the merged risks. I found diagramming flow-charts to be quite helpful. While doing your work, consider what events would constitute the trigger for the following stages of response:

Plan - Time to pick a destination and start working the details of what a relocation would require
Prep - Time to start working through reversible steps of the plan, such as starting conversations with lawyers, consulates, and employers
Go - Execute the move

8. Build the aggregated list of events to watch out for

Now that you've done all of this work, it's time to build the summary. This aggregated list is the one to check back on once a week or so to see if any lines have been crossed. My list is highly personal, but has 10 different events on it. We are no where near any of those 10 events I identified as significant heralds of greater danger.

Then refresh your analysis once a month to square your analysis with what has happened in the previous four months. My March analysis involved a major rewrite because my February one failed to identify the mass secret-order detention of citizens risk, in spite of it being the number one thing dictators want above all else.

Whenever you feel an anxiety spike watching the news, and the urge to flee begin to take root, review your list of events. It calms me down, and once in a while I make notes for the next monthly refresh. This methodical approach absolutely did get me out of a doom-spiral, all thanks to a career in planning for disasters.

Blog Questions Challenge 2025

2025-01-20T15:00:00Z

Thanks to Ben Cotton for sharing.

Why did you start blogging in the first place?

I covered a lot of that in 20 years of this nonsense from about a year ago. The quick version is I was charged with creating a "Static pages from your NetWare home directory" project and needed something to test with, so here we are. That version was done with Blogger before the Google acquisition, when they still supported publish-by-ftp (which I also had to set up as part of the same project).

What platform are you using to manage your blog, and why do you use it?

When blogger got rid of the publish-by-ftp method, I had to move. I came to my own domain and went looking for blogging software. On advice from an author I like, I kept in mind the slashdot effect so wanted to be sure if I had an order of magnitude more traffic for an article it wouldn't melt the server it was one. So I wanted something relatively light weight, which at the time was Movable Type. Wordpress required database hits for every webpage, which didn't seem to scale.

I stuck with it because Movable Type continues to do the job quite well, and be ergonomic for me. I turned off comments a while ago, as that was an anti-spam nightmare I needed recency to solve. Movable Type now requires a thousand dollars a year for a subscription, which pencils out to about $125 per blog post at my current posting rate. Not worth it.

Have you blogged on other platforms before?

Like just about everyone my age, I was on Livejournal. I don't remember if this blog or LJ came first, and I'm not going to go check. I had another blog on Blogger for a while, about local politics. It has been lost to time, though is still visible on archive.org if you know where to look for it.

How do you write your posts?

Most are spur of the moment. I have a topic, and time, and remember I can be long-form about it. Once in a while I'll get into something on social media and realize I need actual wordcount to do it justice, so I do it here instead. The advent of twitter absolutely slowed down my posting rate here!

Once I have the words in, I schedule a post for a few days hence.

When do you feel most inspired to write?

As with all writers, it comes when it comes. Sometimes I set out goals and I stick to them. But blogging hasn't been a focus of mine for a long time, so it's entirely whim. I do know I need an hour or so of mostly uninterrupted time to get my thoughts in order, which is hard to come by without arranging for it.

Do you normally publish immediately after writing, or do you let it simmer a bit?

As mentioned above, I use scheduled-post. Typically for 9am, unless I've got something spicy and don't care. This is rare, I've also learned that posting spicy takes absolutely needs a cooling off period. I've pulled posts after writing them because I realize they didn't actually need to get posted, I merely needed to write them.

What's your favorite post on your blog?

That's changed a lot over the years as I've changed.

For a long time, I was proud of my Know your IO series from 2010. That was prompted by a drop-by conversation from one of our student workers who had a question about storage technology. I infodumped for most of an hour, and realized I had a blog series. This is still linked from my sidebar on the right.
From recent history, the post Why I don't like Markdown in a git repo as documentation is a still accurate distillation of why I seriously dislike this reflexive answer to workplace knowledge sharing.
This post about the lost history of why you wait for the first service pack before deploying anything is me bringing old-timer points of view to newer audiences. The experiences in this post are drawn directly from where I was working in 2014-2015. Yes Virginia, people still do ship shrink-wrap software to Enterprise distros. Some of you are painfully aware of this.

I'm not stopping blogging any time soon. At some point the dependency chain for Movable Type will rot and I'll have to port to something else, probably a static site generator. I believe I'm spoiled for choice in that domain.

Incident response process for the big stuff

2025-01-09T15:00:00Z

Back in November I posted about how to categorize your incidents using the pick-one list common across incident automation platforms. In that post I said:

A few organizations go so far as to have a fully separate process for the 'High' and 'Critical' urgencies of events, maybe calling them Disaster Recovery events instead of Incidents. DR events need to be rare, which means that process isn't as well exercised as Incident response. However, a separate process makes it abundantly clear that certain urgencies and scopes require different process overall. More on this in a later blog-post.

This is the later blog post.

The SaaS industry as a whole has been referring to the California Fire Command (now known as the Incident Command System) model for inspiration on handling technical incidents. The basic structure is familiar to any SaaS engineer:

There is an Incident Commander who is responsible for running the whole thing, including post-incident processes
There is a Technical Lead who is responsible for the technical response

There may be additional roles available depending on organizational specifics:

A Business Manager who is responsible for the customer-facing response
A Legal Manager who is responsible for anything to do with legal
A Security Lead who is responsible for heading security investigations

Again, familiar. But truly large incidents put stress on this model. In a given year the vast majority of incidents experienced by an engineering organization will be the grass fire variety that can be handled by a team of four people in under 30 minutes. What happens when a major event happens?

The example I'm using here is a private information disclosure by a hostile party using a compromised credential. Someone not employed by the company dumped a database they shouldn't have had access to, and that database involved data that requires disclosure in the case of compromise. Given this, we already know some of the workstreams that incident response will be doing once this activity is discovered:

Investigatory work to determine where else the attacker got access to and fully define the scope of what leaked
Locking down the infrastructure to close the holes used by the attacker for the identified access
Cycling/retiring credentials possibly exposed to the attacker
Regulated notification generation and execution
Technical remediation work to lock down any exploited code vulnerabilities

An antiseptic list, but a scary one. The moment the company officially notices a breach of private information, legislation world-wide starts timers on when privacy regulators or the public need to be informed. For a profit driven company, this is admitting fault in public which is something none of them do lightly due to the lawsuits that will result. For publicly traded companies, stockholder notification will also need to be generated. Incidents like this look very little like an availability SLA breach SEV of the kind that happens 2-3 times a month in different systems.

Based on the rubric I showed back in November, an incident of this type is of Critical urgency due to regulated timelines, and will require either Cross-Org or C-level response depending on the size of the company. What's more, the need to figure out where the attacker went blocks later stages of response, so this response process will actually be a 24 hour operation and likely run several days. No one person can safely stay awake for 4+ days straight.

The Incident Command Process defines three types of command structure:

Solitary command - where one person is running the whole show
Unified command - where multiple jurisdictions are involved and they need to coordinate, and also to provide shift changes through rotating who is the Operations Chief (what SaaS folk call the Technical Lead)
Area command - where multiple incidents are part of a larger complex, the Area Commander supports each Incident Command

Incidents of the scale of our private information breach lean into the Area Command style for a few reasons. First and foremost, there are discrete workstreams that need to be executed by different groups; such as the security review to isolate scope, building regulated notifications, and cycling credentials. All those workstreams need people to run them, and those workstream leads need to report to incident command. That looks a lot like Area Command to me.

If your daily incident experience are 4-7 person team responses, how ready are you to be involved in an Area Command style response? Not at all.

If you've been there for years and have seen a few multi-org responses in your time, how ready are you to handle Area Command style response? Better, you might be a good workstream lead.

One thing the Incident Command Process makes clear is that Area Commanders do not have an operational role, meaning they're not involved in the technical remediation. Their job is coordination, logistics, and high level decision making across response areas. For our pretend SaaS company, a good Area Commander will be someone:

Someone who has experience with incidents involving legal response
Someone who has experience with large security response, because the most likely incidents of this size are security related
Someone who has experience with incidents involving multiple workstreams requiring workstream leaders
Someone who has experience communicating with C-Levels and has their respect
Two to four of these people in order to safely staff a 24 hour response for multiple days

Is your company equipped to handle this scale of response?

In many cases, probably not. Companies handle incidents of this type a few different ways. As I mentioned in the earlier post, some categorize problems like this as a disaster instead of an incident and invoke a different process. This has the advantage of making it clear the response for these is different, at the cost of having far fewer people familiar with the response methods. You make up for the lack of in situ training, learn by doing, by regularly re-certifying key leaders on the process.

Other companies extend the existing incident response process on the fly rather than risk having a separate process that will get stale. This works so long as you have some people around who kind of know what they're doing and can herd others into the right shape. Though, after the second disaster of this scale, people will start talking about how to formalize procedures.

Whichever way your company goes, start thinking about this. Unless you're working for the hyperscalers, incidents of this response scope are going to be rare. This means you need to schedule quarterly time to train, practice, and certify your Area Commanders and workstream leads. This will speed up response time overall, because less time will be spent arguing over command and feedback structures.

The power of columnar databases for telemetry

2024-11-22T15:00:00Z

"Columnar databases store data in columns, not rows," says the definition. I made a passing reference to the technology in Software Telemetry, but didn't spend any time on what they are and how they can help telemetry and observability. Over the last six months I worked on converting a centralized logging flow based on Elasticsearch, Kibana, and Logstash, to one based on Logstash and Apache Spark. For this article, I'll be using examples from both methods to illustrate what columnar databases let you do in telemetry systems.

How Elasticsearch is columnar (or not)

First of all, Elasticsearch isn't exactly columnar but it can fake it to a point. You use Elasticsearch when you need full indexing and tokenization of every field in order to accelerate query-time performance. Born as it was in the early part of the 2010s, Elasticsearch balances ingestion-side complexity in order to optimize read-side performance. There is a reason that if you have a search field in your app, there is a good chance that Elasticsearch or OpenSearch is involved in the business logic. While Elasticsearch is "schema-less," schema still matters, and there are clear limits to how many fields you can add to an Elasticsearch index.

Each Elasticsearch index or datastream has defined fields. Fields can be defined at index/datastream creation, or configured to auto-create on first use. Both are quite handy in telemetry contexts. Each document in an index or datastream has a reference for every defined field, even if the contents of that field are null. If you have 30K fields, and one document has only 19 fields defined, the rest will still exist on the document but be nulled; which in turn makes that 19 defined-field document rather larger than the same document in an index/datastream with only 300 defined fields.

Larger average document size slows down search for everything in general, due to the number and size of field-indexes the system has to keep track of. This also balloons index/datastream size, which has operational impacts when it comes to routine operations like patching and maintenance. As I mentioned in Software Telemetry, Elasticsearch's cardinality problem manifests in number of fields, not in unique values in each field.

If you are willing to get complicated in your ingestion pipeline through careful crafting of telemetry shape, and ingestion into multiple index/datastreams to bucket types of telemetry into shards of similarly shaped telemetry, you can mitigate some of the above problems. Create an alias to use as your search endpoint, and populate the alias with the index/datastreams of your various shards. Elasticsearch is smart enough to know where to search, which lets you bucket your field-count cardinality problems in ways that will perform faster and save space. However, this is clearly adding complexity that you have to manage yourself.

How Apache Spark is columnar

Spark is pretty clearly columnar, which is why it's the de facto platform of choice for Business Intelligence operations. You know, telemetry for business ops not engineering ops. A table defined in Spark (and most of its backing databases like Parquet or Hive) can have arbitrary columns defined in it. Data for each column is stored in separate files, which means queries like the following looking to build a histogram of log-entries per hour "COUNT timestamp GROUP BY hour(timestamp)" are extremely efficient as the system only needs to look at a single file out of thousands.

Columnar databases have to do quite a bit of read-time and ingestion-time optimization to truly perform fast, which demonstrates some of the tradeoffs of the style. Where Elasticsearch was trading ingestion-time complexity to speed up read-time performance, columnar databases are tilting the needle more towards increasing read-time complexity in order to optimize overall resource usage. In short columnar databases have better scaling profiles than something like Elasticsearch, but they don't query as fast as a result of the changed priorities. This is a far easier trade-off to make in 2024 than it was in 2014!

Columnar databases also don't tokenize the way Elasticsearch does. Have a free-text field that you want to do sub-string searches on? Elasticsearch is built from the bolts out to make that search as fast as possible. Columnar databases, on the other hand, do all of the string walking and searching at query-time instead of pulling the values out of some b-trees.

Where Elasticsearch suffers performance issues when field-count rises, Spark only encounters this problems if the query is designed to encounter it through use of "select *" or similar constructs. The files hit by the query will only be the ones for columns referenced in the query! Have a table with 30K columns in it? So long as you query right, it should perform quite well; the 19 defined fields in a row problem shouldn't be a problem so long as you're only referencing one of those 19 fields/columns.

Why columnar is neat

A good centralized logging system can stand in for both metrics and traces, and in large part can do so because the backing databases for centralized logging are often columnar or columnar-like. There is nothing stopping you from creating metric_name and metric_value fields in your logging system, and building a bunch of metrics-type queries using those rows.

As for emulating tracing, this isn't done through OpenTelemetry, this is done old-school through hacking. Chapter 5 in Software Telemetry covered how the Presentation Stage uses correlation identifiers:

"A correlation identifier is a string or number that uniquely identifies that specific execution or workflow."

Correlation identifiers allow you to build the charts that tracing systems like Jaeger, Tempo, and Honeycomb are known for. There is nothing stopping you creating an array-of-strings type field named "span_id" where you dump the span-stack for each log-line. Want to see all the logs for a given Span? Here you are. Given a sophisticated enough visualization engine, you can even emulate the waterfall diagrams in dedicated tracing platforms.

The reason we haven't used columnar databases for metrics systems has to do with cost. If you're willing to accept cardinality limits, you can store a far greater number of metrics for the same amount of money as doing it in a columnar database. However, the biggest companies already are using columnar datastores for engineering metrics, and nearly all companies are using columnar for business metrics.

But if you're willing to spend the extra resources to use a columnar-like datasource for metrics, you can start answering questions like "how many 5xx response-codes did accounts with the Iridium subscription encounter on October 19th." Traditional metrics system would consider subscription-type to be too highly cardinal, where columnar databases shrug and move on.

What this means for the future of telemetry and observability

Telemetry over the last 60 years of computing has gone from digging through the SYSLOG printout from one of your two servers, to digging through /var/log/syslog, to the creation of dedicated metrics systems, to the creation of tracing techniques. Every decade's evolution of telemetry has been constrained by the compute and storage performance envelope available to the average system operator.

The 1980s saw the proliferation of multi-server architectures as the old mainframe style went out of fashion, so centralized logging had to involve the network. NFS shares for Syslog.
The 1990s saw the first big scale systems recognizable as such by people in 2024, and the beginnings of analytics on engineering data. People started sending their web-logs direct to relational databases, getting out of the "tail and grep" realm and into something that kinda looks like metrics if you squint. Distributed processing got its start here, though hardly recognizable today.
The 2000s saw the first bespoke metrics systems and protocols, such as statsd and graphite. This era also saw the SaaS revolution begin, with Splunk being a big name in centralized logging, and NewRelic gaining traction for web-based metrics. Distributed processing got more involved, and at the end of the decade the big companies like Google and Microsoft lived and breathed these systems. Storage was still spinning disk, with some limited SSD usage in niche markets.
The 2010s saw the first tracing systems and the SaaS revolution ate a good chunk of the telemetry/observability space. The word observability entered wide usage. Distributed processing ended the decade as the default stance for everything, including storage. Storage bifurcated into bulk (spinning disk) and performance (SSD) tiers greatly reducing cost.

We're part way through the 2020s, and it's already clear to me that columnar databases are probably where telemetry systems are going to end up by the end of the decade. Business intelligence is already using them, so most of our companies have them in our infrastructure already. Barriers to adoption are going to be finding ways to handle the different retention and granularity requirements of what we now call the three pillars of observability:

Metrics need visibility going back years, and are aggregated not sampled. Observability systems doing metrics will need to allow multi-year retention somehow.
Tracing retention is 100% based on cost and sample-rate, which should improve over the decade.
Centralized logging is like tracing in that retention is 100% based on cost. True columnar stores scale more economically than Elasticsearch-style databases, which increases retention. How sample rate affects retention is less clear, and would have to involve some measure of aggregation to remain viable over time.

Having columnar databases at the core allows a convergence of the pillars of observability. How far we get in convergence over the next five years remains to be seen, and I look forward to finding out.

Incident response programs

2024-11-19T15:00:00Z

Honeycomb had a nice post where they describe dropping a priority list of incident severities in favor of an attribute list. Their list is still a pick-one list; but instead of using a 1-4 SEV scale, they're using a list of types like "ambiguous," "security," and "internal." The post goes into some detail about the problems with a unified list across a large organization, and the different response-level needs of different types of incidents. All very true.

A good incident response program needs to be approachable by anyone in the company, meaning anyone looking to open one should have reasonable success in picking incident attributes right. The incident automation industry, tools such as PagerDuty's Jeli and the Rootly platform, has settled on a pick-one list for severity, with sometimes support for additional fields. Unless a company is looking to home build their own incident automation for creating slack channels, managing the post-incident review process, and tracking remediation action items, these de facto conventions constrain the options available to an incident response program.

As Honeycomb pointed out, there are two axis that need to be captured by "severity," and they are: urgency, and level of response. I propose the following pair of attributes:

Urgency

Planning: the problem can be addressed through normal sprint or quarterly planning processes.
Low: the problem has long lead times to either develop or validate the solution, where higher urgency would result in a lot of human resources stuck in wait loops.
Medium: the problem can be addressed in regular business hours operations, waiting overnight or a weekend won't make things worse. Can preempt sprint-level deliverable targets without question
High: the problem needs around the clock response and can preempt quarterly deliverable targets without question
Critical: the problem requires investor notification or other regulated public disclosure, and likely affects annual planning. Rare by definition.

Level of response

Individual: The person who broke it can revert/fix it without much effort, and impact blast-radius is limited to one team. Post-incident review may not be needed beyond the team level.
Team: A single team can manage the full response, such as an issue with a single service. Impact blast radius is likely one team. Post-incident review at the peer-team level.
Peer team: A group of teams in the same department are involved in response due to interdependencies or the nature of the event. Impact blast-radius is clearly multi-team. Post-incident review at the peer-team level, and higher up the org-chart if the management chain is deep enough for it.
Cross-org: Major incident territory, where the issue cuts across more than one functional group. These are rare. Impact blast-radius may be whole-company, but likely whole-product. Post-incident review will be global.
C-level: High executive needs to run it because response is whole company in scope. Will involve multiple post-incident reviews.

Is Private? Yes/No - If yes, only the people involved in the response are notified of the incident and updates. Useful for Security and Compliance type incidents, where discoverability is actually bad. Some incidents qualify as Material Non-Public Information, which matters to companies with stocks being traded.

The combinatorics indicate that 5*5=25 pairs, 50 if you include Is Private, which makes for an unwieldy pick-one list. However, like stellar types there is a kind of main sequence of pairs that are more common, with problematic outliers that make simple solutions a troublesome fit. Let's look at a few pairs that are on the main sequence of event types:

Planning + Individual: Probably a feature-flag had to be rolled back real quick. Spend some time digging into the case. Incidents like this sometimes get classified "bug" instead of "incident."
Low + Team: Such as a Business Intelligence failure, where revenue attribution was discovered to be incorrect for a new feature, and time is needed to back-correct issues and validate against expectations. May also be classified as "bug" instead of "incident."
Medium + Team: Probably the most common incident type that doesn't get classified as a "bug," these are the highway verge grass fires of the incident world; small in scope, over quick, one team can deal with it.
Medium + Peer Team: Much like the previous but involving more systems in scope. Likely requires coordinated response between multiple teams to reach a solution. These teams work together a lot, by definition, so it would be a professional and quick response.
High + Cross-org: A platform system had a failure that affected how application code responds to platform outages, leading to a complex, multi-org response. Response would include possibly renegotiating SLAs between platform and customer-facing systems. Also, remediating the Log4J vulnerability, which requires touching every usage of java in the company inclusive of vendored usage, counts as this kind of incident.
Critical + Cross-org: An event like the Log4J vulnerability, and the Security org has evidence that security probes found something. The same remediation response as the previous, but with added "reestablish trust in the system" work on top of it, and working on regulated customer notices.

Six of 25 combinations. But some of the others are still viable, even if they don't look plausible on the surface. Let's look at a few:

Critical + Team: A bug is found in SOX reporting that suggests incorrect data was reported to stock-holders. While the C-levels are interested, they're not in the response loop beyond the 'stakeholder' role and being the signature that stock-holder communications will be issued under.
Low + Cross-org: Rapid retirement of a deprecated platform system, forcing the teams still using the old system to crash-migrate to the new one.
Planning + Cross-org: The decision to retire a platform system is made as part of an incident, and migrations are inserted into regular planning.

How is an organization supposed to build a pick-one list from this mess that is usable? This is hard work!

Some organizations solve this by bucketing incidents using another field, and allowing the pick-one list to mean different things based on what that other field says. A Security SEV1 gets a different scale of response than a Revenue SEV1, which in turn gets a different type of response than an Availability SEV1. Systems like this have problems with incidents that cross buckets, such as a Security issue that also affects Availability. It's for this reason that Honeycomb has an 'ambiguous' bucket.

A few organizations go so far as to have a fully separate process for the 'High' and 'Critical' urgencies of events, maybe calling them Disaster Recovery events instead of Incidents. DR events need to be rare, which means that process isn't as well exercised as Incident response. However, a separate process makes it abundantly clear that certain urgencies and scopes require different process overall. More on this in a later blog-post.

Other orgs handle the outlier problem differently, taking them out of incidents and into another process all together. Longer flow problems, low urgency above, get called something like a Code Yellow after a Google effort, or Code Red for the Critical + C-Team level to handle long flow big problems.

Honeycomb took the bucketing idea one step further and dropped urgency and level of response entirely, focusing instead on incident type. A process like this still needs ways to manage urgency and response-scope differences, but this is being handled at a layer below incident automation. In my opinion, a setup like this works best when Engineering is around Dunbar's Number or less in size, allowing informal relationships to carry a lot of weight. Companies with deeper management chains, and thus more engineers, will need more formalism to determine cross-org interaction and prioritization.

Another approach is to go super broad with your pick-one list, and make it apply to everyone. While this approach disambiguates pretty well between the SEV 1 highest urgency problems and SEV 2 urgent but not pants on fire urgent, they're less good at disambiguating SEV 3 and SEV 4 incidents. Those incidents tend to only have local scope, so local definitions will prevail, meaning only locals will know how to correctly categorize issues.

There are several simple answers for this problem, but each simplification has it's own problem. Your job is to pick the problems your org will put up with.

How much informal structure can you rely on? The smaller the org, the more one size is likely to fit all.
Do you need to interoperate with a separate incident response process, perhaps an acquisition or a parent company?
How often do product-local vs global incidents happen? For one product companies, these are the same thing. For companies that are truly multi-product, this distinction matters. The answer here influences how items on your pick-one list are dealt with, and whether incident reporters are likely to file cross-product reports.
Does your incident automation platform allow decision supports in their reporting workflow? Think of a next, next, next, done wizard; each screen asks clarifying questions. Helpful for folk who are not sure how a given area wants their incidents marked up, less helpful for old hands who know exactly what needs to go in each field.

Rust and the Linux kernel

2024-09-03T14:00:00Z

One of the kernel maintainers made social waves by bad mouthing Rust and the project to rebuild the Linux kernel in Rust. The idea of rebuilding the kernel in "Rust: the memory-safe language" not "The C in CVE stands for C/C++" makes a whole lot of sense. However, there is more to a language than how memory safe it is and whether a well known engineer calls it a "toy" language.

One of the products offered by my employer is written in Elixir, which is built on top of Erlang. Elixir had an 8 or so month period of fame, which is when the decision to write that product was made. We picked Elixir because the Erlang engine gives you a lot of concurrency and async processing for relatively easy. And it worked! That product was a beast in relatively little CPU. We had a few cases of 10x usage from customers, and it just scaled up no muss no fuss.

Where the problems with the product came wasn't in the writing, but in the maintaining and productionizing. Some of the issues we've had over the years, many of which got better as Elixir as an ecosystem matured:

The ability to make a repeatable build, needed for CI systems
Dependency management in modules
Observability ecosystem support, such as OpenTelemetery SDKs
Build tooling support usable by our CI systems
Maturity of the module ecosystem, meaning we had to DIY certain tasks that our other main product never had to bother with. Or the modules that exist only covered 80% of the use-cases.
Managing Erlang VM startup during deploys

My opinion is that the dismissiveness from this particular Linux Kernel Maintainer had to do with this list. The Linux kernel and module ecosystem is massive, with highly complex build processes spanning many organizations, and regression testing frameworks to match. Ecosystem maturity matters way more for CI, regression, and repeatable build problems than language maturity.

Rust has something Elixir never had: durable mindshare. Yeah, the kernel rebuild process has taken many years, and has many years to go. Durable mindshare means that engineers are sticking with it, instead of chasing the next hot new memory safe language.

Getting people to think about disaster recovery

2024-07-12T14:00:00Z

SysAdmins have no trouble making big lists of what can go wrong and what we're doing to stave that off a little longer. The tricky problem is pushing large organizations to take a harder look at systemic risks and taking them seriously. I mean, the big companies have to have disaster recovery (DR) plans for compliance reasons; but there are a lot of differences between box-ticking DR plans and comprehensive DR plans.

Any company big enough to get past the running out of money is the biggest disaster phase has probably spent some time thinking about what to do if things go wrong. But how do you, the engineer in the room, get the deciders to think about disasters in productive ways?

The really big disasters are obvious:

The datacenter catches fire after a hurricane
The Region goes dark due to a major earthquake
Pandemic flu means 60% of the office is offline at the same time
An engineer or automation accidentally:
- Drops all the tables in the database
- Deletes all the objects out of the object store
- Destroys all the clusters/servlets/pods
- Deconfigures the VPN
The above happens and you find your backups haven't worked in months

All obvious stuff, and building to deal with them will let you tick the box for compliance DR. Cool.

But there are other disasters, the sneaky ones that make you think and take a hard look at process and procedures in a way that the "oops we lost everything of [x] type" disasters generally don't.

An attacker subverts your laptop management software (JAMF, InTune, etc) and pushes a cryptolocker to all employee laptops
30% of your application secrets got exposed through a server side request forgery (SSRF) attack
Nefarious personages get access to your continuous integration environment and inject trojans into your dependency chains
A key third party, such as your payment processor, gets ransomwared and goes offline for three weeks
A Slack/Teams bot got subverted and has been feeding internal data to unauthorized third parties for months

The above are all kinda "security" disasters, and that's my point. SysAdmins sometimes think of these, but even we are guilty of not having the right mental models to rattle these off the top of our head when asked. Asking about disasters like this list should start conversations that generally don't happen. Or you get the bad case: people shrug and say "that's Security's problem, not ours," which is a sign you have a toxic reliability culture.

Security-type disasters have a phase that merely technical disasters lack: how do we restore trust in production systems? In technical disasters, you can start recovery as soon as you've detected the disaster. For security disasters recovery has to wait until the attacker has been evicted, which can take a while. This security delay means key recovery concepts like Recovery Time and Recovery Point Objectives (RTO/RPO) will be subtly different.

If you're trying to knock loose some ossified DR thinking, these security type disasters can crack open new opportunities to make your job safer.