<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
	<channel>
		<title>Mikhail Panchenko</title>
		<description>Burrito Enthusiast, Naked Neighbor</description>
		<link>http://blog.mihasya.com</link>
		
			<item>
				<title>Thoughts Evoked By CircleCI's July 2015 Outage</title>
				<description>&lt;p&gt;After having a bit of downtime, CircleCI’s team have been very kind to post &lt;a href=&quot;http://status.circleci.com/incidents/hr0mm9xmm3x6&quot;&gt;a
very detailed Post Mortem&lt;/a&gt;.
I’m a post mortem junkie, so I always appreciate when companies are honest
enough to openly discuss what went wrong.&lt;/p&gt;

&lt;p&gt;I also greatly enjoy analyzing these things, especially through the complex
systems lens. Each one of these posts is an opportunity to learn and to
reinforce otherwise abstract concepts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE&lt;/strong&gt;: This post is NOT about what the CircleCI team should or shouldn’t
have done - hindsight is always 20/20, complex systems are difficult, and
hidden interactions actually are hidden. Everyone’s infrastructures are full of
traps like the one that ensnared them, and some days, you just land on the
wrong square.  Basically, that PM made me think of stuff, so here is that
stuff. Nothing more.&lt;/p&gt;

&lt;h3 id=&quot;database-as-a-queue&quot;&gt;Database As A Queue&lt;/h3&gt;

&lt;p&gt;The post mortem states:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Our build queue is not a simple queue, but must take into account customer
plans and container allocation in a complex platform. As such, it’s built on
top of our main database.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As soon as I read that, I knew exactly what happened. I’d lived this exact
problem before, so here’s that story:&lt;/p&gt;

&lt;p&gt;At Flickr, &lt;a href=&quot;http://code.flickr.net/2010/02/08/using-abusing-and-scaling-mysql-at-flickr/&quot;&gt;we would put everything into MySQL until it didn’t work
anymore&lt;/a&gt;.
This included the Offline Tasks queue (aside: good grief, &lt;a href=&quot;http://code.flickr.net/2008/09/26/flickr-engineers-do-it-offline/&quot;&gt;this
post&lt;/a&gt; was
written in 2008). One day, we had an issue that slowed down the processing of
tasks. The queue filled up like it was supposed to, but when we finished fixing
the original problem, we noticed that the queue was not draining. In fact, it
was still filling up at almost the same rate as during the outage.&lt;/p&gt;

&lt;p&gt;When you put tasks into mysql, you have to index them, presumably by some
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;date&lt;/code&gt; field, to be able to fetch the oldest tasks efficiently. If you have
additional ways you want to slice your queues, which both CircleCI and Flickr
did, that index probably contains several columns. Inserting data into RDMS indexes
is relatively expensive, and usually involves at least some locking. Note that
dequeueing jobs also involves an index update, so even marking jobs as in
progress or deleting on completion runs into the same locks. So now you have
contention from a bunch of producers on a single resource, the updates to which
are getting more and more expensive and time consuming. Before long, you’re
spending more time updating the job index than you are actually performing the
jobs. The “queue” essentially fails to perform one of its very basic functions.&lt;/p&gt;

&lt;p&gt;Maybe my reading is not quite right on the CircleCI issue, but I’d bet it
was something very similar.&lt;/p&gt;

&lt;p&gt;In the aftermath of that event at Flickr, we swapped the mysql table out for a
bunch of lists in Redis. There were pros and cons involved, of course, and we
had to replace the job processing logic completely. Redis came with its own set
of challenges (failover and data durability being the big ones), but it
was a much better tool for the job. In 2015, Redis almost certainly isn’t the
first thing I’d reach for, but options are plentiful for all sorts of usecases.&lt;/p&gt;

&lt;h3 id=&quot;coupling-at-the-load-balancer&quot;&gt;Coupling at the Load Balancer&lt;/h3&gt;

&lt;p class=&quot;center&quot;&gt;
    &lt;img src=&quot;/imgs/posts/circlepm/tmi.gif&quot; alt=&quot;too many papers&quot; class=&quot;constrained&quot; /&gt;&lt;br /&gt;
    &lt;small&gt;From &lt;a href=&quot;http://www.nrc.gov/reading-rm/doc-collections/fact-sheets/3mile-isle.html&quot;&gt;nrc.gov&lt;/a&gt;&lt;/small&gt;
&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;First we tried to stop new builds from joining the queue, and we tried it from
an unusual place: the load balancer. Theoretically, if the hooks could not
reach us, they couldn’t join the queue. A quick attempt at this proved
ill-advised: when we reduced capacity to throttle the hooks naturally they
significantly outnumbered our customer traffic, making it impossible for our
customers to reach us and effectively shutting down our site.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I don’t actually think that’s an “Unusual” place to start at all. If one of the
problems is that updates to the queue are becoming too expensive and every
additional update is exacerbating the problem, start eliminating updates!&lt;/p&gt;

&lt;p&gt;The rest of that paragraph is also not unusual at all. It hints at some
details about the CircleCI infrastructure that you would find in an
overwhelming majority of infrastructures.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The public site and the Github hooks endpoint share a loadbalancer&lt;/li&gt;
  &lt;li&gt;The processes serving the site and the github hooks run on the same hardware
(likely in the same process, as they’re probably just endpoints in the same
app)&lt;/li&gt;
  &lt;li&gt;There is no way to turn off one without turning off the other&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everyone that knows me knows I &lt;strong&gt;LOVE&lt;/strong&gt; to talk about “unnecessary coupling” in
complex systems. This is a really good example.&lt;/p&gt;

&lt;p&gt;The two functions have key differences - for one, their audience. Let’s focus
on that. The hooks serve an army of robots residing somewhere in Github’s
datacenter. The site serves humans. As a general rule, robots can always wait,
but making humans on the internet wait for anything is a big no-no. To me, this
is a natural place to split things up, all the way through. You can still use
the same physical load balancer or ELB instance, but you could make two paths
through it - one for the human oriented stuff, another for the robots. Sure,
there’ll probably be some coupling farther down the line, like when both
processes query the same databases. But at least now the site will only go down
if the database is actually inaccessible, not when it has a single contended
resource that has nothing to do with serving the site.&lt;/p&gt;

&lt;h4 id=&quot;a-long-aside-traffic-segregation-at-opsmatic&quot;&gt;A Long Aside: Traffic Segregation At Opsmatic&lt;/h4&gt;

&lt;p&gt;I do obsess over this stuff, and we’ve already had our fair share of outages
with very similar causes. I want to talk a bit about how traffic is currently
handled at Opsmatic. This section is full of admissions of having flavors of the
same issues as above to further drive the point home that no one’s infrastructure
is perfect, certainly not ours. It’s also meant to demonstrate that following
some very high level guidelines built on prior learning can go a long way
towards improving an infrastructure’s &lt;strong&gt;posture&lt;/strong&gt; in the event of unexpected
issues, especially surges.&lt;/p&gt;

&lt;p&gt;There are three entry points into Opsmatic:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opsmatic.com&lt;/code&gt; is our company’s website and the actual product app&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;api.opsmatic.com&lt;/code&gt; is our REST API, which has historically been used mostly by
the app (that’s changing quickly)&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ingest.opsmatic.com&lt;/code&gt; is the API to which our collection agents talk&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here’s an ugly drawing to help you along:&lt;/p&gt;

&lt;p class=&quot;center&quot;&gt;
    &lt;img src=&quot;/imgs/posts/circlepm/archdoodle.png&quot; alt=&quot;too many papers&quot; class=&quot;constrained&quot; /&gt;&lt;br /&gt;
&lt;/p&gt;

&lt;p&gt;The first two are configured to talk to the same AWS Elastic Loadbalancer (ELB).
The ELB forwards the traffic on ports 80 and 443 to a pool of
instances where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nginx&lt;/code&gt; is listening. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nginx&lt;/code&gt; in turn directs the requests.
Traffic to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(www.)opsmatic.com&lt;/code&gt; goes to one process (a Django app run under
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gunicorn&lt;/code&gt;), traffic to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;api.opsmatic.com&lt;/code&gt; goes through a completely different
pipeline where it’s teed off to the appropriate backend depending on the URL
pattern. Currently, most of the API traffic is actually coming from humans
using the app. As we flesh out, expand, and document &lt;a href=&quot;https://opsmatic.com/app/docs/rest-api&quot;&gt;our REST
api&lt;/a&gt;, that’s bound to change, at which
point we may put even more buffer between the two traffic streams - separate
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nginx&lt;/code&gt; processes with appropriate tuning, possibly even separate hardware.&lt;/p&gt;

&lt;p&gt;The third &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ingest.opsmatic.com&lt;/code&gt; subdomain is pointed at a completely different
ELB. That’s our equivalent for the Github hooks - the agents are always
running, always sending heartbeats, always sending updates. An unexpected surge
in traffic - for example, an enormous new customer spinning up agents on their
whole fleet of servers all at once without warning us - could certainly
overwhelm the currently provisioned hardware. At the moment, this would take
the app down as well - while the Opsmatic backend is extremely modular, we
currently run all those pieces on the same machines. This limits the operational
overhead at the expense of introducing gratuitously unnecessary coupling.&lt;/p&gt;

&lt;p&gt;However, just having the separate ELB gives us recourse in the event of a sudden
surge in robot traffic: we can just blackhole THAT traffic at the ELB and
continue serving site and API read traffic. The robots would be mad, and the
data you were browsing would gradually get more and more stale, but it
beats the big ugly &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;500&lt;/code&gt; page.&lt;/p&gt;

&lt;p&gt;The Opsmatic agent is also built to accumulate data locally if it can’t
phone home, so the robots would build up a local version of the change history
without losing any data or timestamp accuracy. When we were back up, they’d
eventually backfill all that data. This event itself could cause a stampede,
but we’ve found it to be a real nice luxury to have.&lt;/p&gt;

&lt;p&gt;The modularity combined with reasonably healthy automation allows us to regain
our balance quickly. If a certain service is overloading a shared database, we
can kill just that service while we work out what’s going on or scrambling to
add capacity.&lt;/p&gt;

&lt;h4 id=&quot;every-incident-is-a-push-towards-self-improvement&quot;&gt;Every Incident Is A Push Towards Self Improvement&lt;/h4&gt;

&lt;p&gt;The next time this sort of event does happen, we’d likely follow up with a few
more steps that have been put off solely due to resource constraints:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Split up &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;stack-role&lt;/code&gt; into smaller pieces, likely along the lines of
“human-facing services” and “robot-facing services”. That is, physically
separate services that deal with agent traffic from services that deal with
human traffic. Possibly we’d go a step further and split up web services from
background job processors that pull work from queues.&lt;/li&gt;
  &lt;li&gt;Split the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opsmatic.com&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;api.opsmatic.com&lt;/code&gt; load balancers up&lt;/li&gt;
  &lt;li&gt;A bunch of auxiliary work on various internal tools to better
accommodate the fragmentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The upshot - we currently have a bit of coupling and resource sharing
going on for things that really shouldn’t be coupled, but it’s only because
we’ve postponed actually splitting everything up in favor of other projects. We
are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Seconds away from being able to blackhole automation traffic in favor of
preserving the app, as well as turning off any background processing that might
be causing issues - we can just let that queue grow, turn the service on and off
as we try different fixes, etc.&lt;/li&gt;
  &lt;li&gt;A few minutes of fast typing away from adding capacity while most of our
customers likely don’t even know anything is amiss&lt;/li&gt;
  &lt;li&gt;A few more minutes of fast typing from completely decoupling robot traffic
from human traffic so that the next surge doesn’t affect the app at all&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hey, that’s pretty good! If we have to fight a fire, at least we can fight it
mostly calmly. That, in and of itself, is huge. Being able to isolate the
problem and say “OK, this is the problem, it is not the whole infrastructure, it
is contained to a particular set of actions and now we’re going to work on it”
is huge for morale during an outage. I do not envy the feeling  the CircleCI
team must have felt when attempts to bring back the queue took down the main
site.&lt;/p&gt;

&lt;p&gt;I used the word “&lt;strong&gt;posture&lt;/strong&gt;” earlier - I have in mind a very specific property
when I use that word. It’s not so much about “how resilient to failures is our
infrastructure?” but rather “how operator-friendly is our infrastructure during
an incident?” Things like well-labeled kill swtiches, well segmented traffic, well
behaved background and batch processing systems that operate indepenently from
the transactional part of the app go a long way towards decreasing stress levels
during incidents.&lt;/p&gt;

&lt;h3 id=&quot;conclusion-what-is-this-post-even&quot;&gt;Conclusion?.. What is this post, even..&lt;/h3&gt;

&lt;p&gt;This turned into a bit of a rambling piece. Hope you found it interesting.
Here’s my key takeaways:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;You can use a database as a queue, but you should keep a close eye on the
timing data for the “work about work” your database is doing just to get jobs
in and out. One day, you’re going to have a bad time. That is ok. It’ll make
you stronger.&lt;/li&gt;
  &lt;li&gt;It pays to think about the sources of traffic to your infrastructure and how
they interact with each other. Over time, it pays even more to have parallel,
as-decoupled-as-time-allows paths through your system, any of which can be shut
off in isolation.&lt;/li&gt;
  &lt;li&gt;Every infrastructure is a work in progress; computers are hard, and
distributed systems are even harder&lt;/li&gt;
&lt;/ul&gt;
</description>
				<published>2015-07-19 00:00:00 +0000</published>
				<link>http://blog.mihasya.com/2015/07/19/thoughts-evoked-by-circleci-outage.html</link>
			</item>
		
			<item>
				<title>A Story and Some Tips For Sustainable OSS Projects</title>
				<description>&lt;p&gt;This past week Kyle Kingsbury
&lt;a href=&quot;https://twitter.com/aphyr/status/618880016991059968&quot;&gt;tweeted&lt;/a&gt; about being
flooded with pull requests caused by changes to the InfluxDB API. Concidentally,
I had just spent several hours over the July 4th weekend dealing with the same
problem in &lt;a href=&quot;https://github.com/rcrowley/go-metrics&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;go-metrics&lt;/code&gt;&lt;/a&gt;, albeit on a
smaller scale. I think these are symptoms of a very very common problem with OSS
projects.&lt;/p&gt;

&lt;h3 id=&quot;a-bit-of-history&quot;&gt;A bit of history&lt;/h3&gt;

&lt;p&gt;The Metrics library has a very simple core API made up of various
metrics-related interfaces - you can create metrics, push in new values, and
read the metrics’ current values and aggregates. Simple and beautiful.&lt;/p&gt;

&lt;p&gt;The library was originally put together by the epic Richard Crowley while he was working
at Betable. He was starting to experiment with using Go for services, and needed
a way to keep track of them. Finding no satisfactory equivalent to &lt;a href=&quot;https://github.com/dropwizard/metrics&quot;&gt;Coda Hale’s
metrics library for Java&lt;/a&gt;, Richard made his
own. Folks quickly wrote adapters to push metrics into their time series system
of choice - I wrote one for Librato. Richard happily merged the PRs.&lt;/p&gt;

&lt;p&gt;The core features were built, everything worked reasonably well, and Richard
moved on to a job that doesn’t use Go nearly as heavily. Several months later, I
noticed &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;go-metrics&lt;/code&gt; had 20+ open pull requests. I pinged Richard and offered to
help maintain the project. We were using it heavily, and were happy to pay our
dues. Richard immediately made myself and &lt;a href=&quot;https://github.com/wadey&quot;&gt;Wade&lt;/a&gt;, a
Betable employee, collaborators on the repository. I started looking over the
PRs.&lt;/p&gt;

&lt;h3 id=&quot;the-paralysis&quot;&gt;The Paralysis&lt;/h3&gt;

&lt;p class=&quot;center&quot;&gt;
    &lt;img src=&quot;/imgs/posts/gometrics/papers.jpg&quot; alt=&quot;too many papers&quot; class=&quot;constrained&quot; /&gt;&lt;br /&gt;
    &lt;small&gt;Cropped from photo by &lt;a href=&quot;https://www.flickr.com/photos/wheatfields/4774087006&quot;&gt;wheatfields&lt;/a&gt;&lt;/small&gt;
&lt;/p&gt;

&lt;p&gt;I quickly realized that I was not qualified to review a good chunk of the PRs:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Update for InfluxDB 0.9&lt;/li&gt;
  &lt;li&gt;Fallback to old influxdb client snapshot&lt;/li&gt;
  &lt;li&gt;Update influxdb client&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;“I don’t know &lt;em&gt;jack&lt;/em&gt; about InfluxDB,” I thought. “How am I supposed to decide
what gets merged and what doesn’t?” There was a Riemann client in there too. Who
am I to judge a Riemann client lib?&lt;/p&gt;

&lt;p&gt;I had also observed that the InfluxDB API was still changing quite a bit. I
remembered that there had previously been a wave of PRs about InfluxDB. &lt;em&gt;Wait,
was this the same wave?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Another issue that gave me pause was that I had no idea how many people were
already using this library with Influx, expecting the current client to continue
working. How many builds would break? Go’s notoriously loosey-goosey dependency
management made it likely that as soon as I merged any API changing PR, I would
get another PR changing it back the next day.&lt;/p&gt;

&lt;p&gt;There was also a PR about adding a Riemann client. &lt;em&gt;Welp, I don’t use that
regularly either..&lt;/em&gt;&lt;/p&gt;

&lt;h3 id=&quot;clarity&quot;&gt;Clarity&lt;/h3&gt;

&lt;p&gt;In the summer of 2012, I did a brief contacting stint with Librato. Among other things, I
helped build a Java client library. They also asked me to tie that client to
Coda’s library, so I obliged and &lt;a href=&quot;https://github.com/dropwizard/metrics/pull/258&quot;&gt;submited a PR&lt;/a&gt;.
Coda replied fairly tersely:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Really cool functionality, but I’ve been declining further modules for the
main Metrics distribution. I suggest you run this as your own project. I’ll be
adding a section in the Metrics documentation with links to related libraries,
and this should definitely be in it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At the time, I thought “Well that kinda sucks. I want my code up there, with the
cool kids’ code in the really popular library.” Now, literally 3 years later, I
understood exactly why Coda made that move. He didn’t use Librato. He had no
idea what would make a good or bad Librato client. It was just more surface area
to support. He had enough to worry about with core Metrics and DropWizard
features, keeping up with JVM changes and compatibility issues, etc, Never mind
other projects.&lt;/p&gt;

&lt;h3 id=&quot;the-path-forward&quot;&gt;The Path Forward&lt;/h3&gt;

&lt;p class=&quot;center&quot;&gt;
    &lt;img src=&quot;/imgs/posts/gometrics/wood_joints.jpg&quot; alt=&quot;well fitted pieces&quot; class=&quot;constrained&quot; /&gt;&lt;br /&gt;
    &lt;small&gt;Cropped from photo by &lt;a href=&quot;https://www.flickr.com/photos/matthewbyrne/3802556684&quot;&gt;matthewbyrne&lt;/a&gt;&lt;/small&gt;
&lt;/p&gt;

&lt;p&gt;Though Kyle points out that &lt;a href=&quot;https://twitter.com/aphyr/status/618905828846866432&quot;&gt;this may not be the best approach for every
project&lt;/a&gt;,
it seemed very clear to me that the only way the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;go-metrics&lt;/code&gt; lib could continue
to be maintained, at least by myself and Wade, was to modularize and move
any external dependencies out to their own libraries - with their own
maintainers, and hopefully their own communities. It’s not going to make the
“moving target API” problem any easier, but it’ll put the
solution into the hands of the people who are actually interacting with the
problem and have a vested interest in achieving and maintaining a palatable
solution. It removes myself, Richard, and Wade, completely disinterested and
uninitiated bystanders, from the critical path to a solution.&lt;/p&gt;

&lt;p&gt;At the end of the day, it’s just Separation of Concerns. It’s just good
organization. The task is broken up into small semi-independent pieces with
responsibility for each piece given to the person with the most interest in that
piece. There’s a corresponding and very palpable feeling of psychological
relief.  “Review the PRs for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;go-metrics&lt;/code&gt;” is no longer this huge nebulous task
that will require a huge amount of context and deep understanding of some
additional system. I know the core APIs. I can evaluate changes to that fairly
quickly.&lt;/p&gt;

&lt;h3 id=&quot;practical-tips-for-maintainers&quot;&gt;Practical Tips For Maintainers&lt;/h3&gt;

&lt;p&gt;If you find yourself maintaining a small OSS project with a fairly well defined
scope and API, here are some tips to keep yourself sane (some of these are more
general, not specific to the above story):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Always have a buddy.&lt;/strong&gt; If your project gets any traction and you start
seeing community adoption, find one or more particularly enthusiastic users and
convince them to help carry the load. We all want to take care of our baby
projects, but real life is what it is. People change jobs, have health issues,
go on lengthy vacations, start families, become vampires. Some combination of
those things will likely make your interest in any given project oscillate, and
you should have a framework in place for making sure you don’t create another
zombie on GitHub.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Resist dependencies.&lt;/strong&gt; If someone creates a PR which brings in a new library,
especially code that talks to something over the network - a server or SaaS
of some kind - strongly consider pushing the author towards starting their own
library. If this is not possible due to a lack of APIs, invest the time in
adding hooks instead. It’ll be worth it.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Have a concise contribution policy.&lt;/strong&gt; This will greatly reduce the burden of
having to reply to PRs that suffer from obvious code quality issues. It is an
absolute MUST to have a pre-written set of rules to appeal to instead of having
to post seemingly arbitrary responses to individual PR authors.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Enforce guidelines automatically whenever possible.&lt;/strong&gt; We are living in a
remarkable age. The tools available to maintaners are simply amazing. With the
help of services like GitHub, TravisCI, CodeClimate, etc., there’s no need to
maintain a mailing list, apply patches by hand, set up some jury-rigged systems
for running tests. It’s all free, and it’s all great. Use it. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;go-metrics&lt;/code&gt; and
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;go-tigertonic&lt;/code&gt; do not take advantage of the OSS ecosystem, and I am about to
fix that. One other small note here: you should make it very easy to replicate
the exact process that the build is going to perform locally.  There should be a
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Makefile&lt;/code&gt; or something similar containing the one command that the build tool
is going to run so that folks can validate their branches easily without having
to wait on the CI tool to run against their PR.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hopefully you find our experience with maintaining and reviving &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;go-metrics&lt;/code&gt;
helpful, and this story helps you avoid similar pitfalls. Happy hacking.&lt;/p&gt;
</description>
				<published>2015-07-12 00:00:00 +0000</published>
				<link>http://blog.mihasya.com/2015/07/12/sustainable-oss-projects.html</link>
			</item>
		
			<item>
				<title>A failure months in the making</title>
				<description>&lt;p&gt;&lt;em&gt;This is the story of an outage that occurred on September 25th 2014 and has
previously been discussed in the context of blameless post mortems on the
&lt;a href=&quot;http://blog.pagerduty.com/2014/10/blameless-post-mortems-strategies-for-success/&quot;&gt;PagerDuty blog&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If you attended Surge 2014, you may have noticed something strange: a man was
sitting on one of the cube-shaped stools in the Fastly expo area hunched over
his laptop almost the entire day, and well into the evening hours. Even if you
didn’t notice, and even if you weren’t even AT the conference, you may be
curious about this man. The security guard certainly was, as he made his rounds
after dark, long after everyone had left the expo area..&lt;/p&gt;

&lt;p&gt;That man was yours truly; I was fixin’ stuff. This is the story of what
happened.&lt;/p&gt;

&lt;h3 id=&quot;the-outage&quot;&gt;The Outage&lt;/h3&gt;

&lt;p&gt;On September 24th Opsmatic was one of the many AWS customers to receive
one of these emails:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;One or more of your Amazon EC2 instances are scheduled to be rebooted for
required host maintenance. The maintenance will occur sometime during the
window provided for each instance. Each instance will experience a clean
reboot and will be unavailable while the updates are applied to the underlying
host. This generally takes no more than a few minutes to complete.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The EC2 Event Console confirmed that quite a few instances in our infrastructure
would be affected:&lt;/p&gt;

&lt;p class=&quot;center&quot;&gt;
    &lt;img src=&quot;/imgs/posts/rebootorama/reboot_schedule.png&quot; alt=&quot;reboot schedule&quot; class=&quot;constrained&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;All the servers would be rebooted early Friday or Saturday morning SF time..
while I was at the conference. There was not much certainty in the exact timing
or order of the reboots (the windows were 4 hours long), but we did eventually
discover some good news:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;Any instances using EBS for their root volume could be put through a
stop/start cycle in advance of the window to avoid the reboot.&lt;/em&gt; When you “stop”
an instance, you’re essentially destroying it, but the EBS volume survives. When
you “start” it back up, you get no guarantees about which “host” will receive
the instance that will then boot that volume. This is where “ephemeral” drives
get their names - they are attached to the “host” and do not survive a
stop/start.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Any instances provisioned after the notifications went out would not need to
be rebooted&lt;/em&gt;. As we later learned, the reboots were necessary for Amazon to
roll out a patch to Xen which fixed &lt;a href=&quot;http://xenbits.xen.org/xsa/advisory-108.html&quot;&gt;XSA 108&lt;/a&gt;.
Many hypervisor “hosts” were already running patched code, so Amazon would
simply put new instances on already-patched hosts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since every single piece of Opsmatic’s infrastructure is redundant at least at
the instance level, we quickly concluded that this was actually not that big of
a deal:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;All of our nodes used EBS root volumes, so they could be stop/started&lt;/li&gt;
  &lt;li&gt;Most of our nodes do not use ephemeral storage for anything important&lt;/li&gt;
  &lt;li&gt;The affected nodes that DID use ephemeral storage were Cassandra nodes. Since
we use a replication factor of 3, we can afford to have at least one of those
rebooted at any time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We briefly debated pre-emptively re-provisioning the Cassandra nodes anyway, but
decided that it was better to just let the reboot happen. Copying data is time
consuming, and the reboots were hours away. We would just get up just before the
maintenance window started and gracefully stop Cassandra on the node about to be
rebooted out of an over-abundance of caution.&lt;/p&gt;

&lt;p&gt;To minimize the amount of odd-hours activity, we decided to stop/start all the
stateless nodes that were scheduled to be rebooted on our own terms, during
business hours. Since I was already at a conference, I’d take care of it in
order to minimize disruption to the rest of the team back home, cranking away.&lt;/p&gt;

&lt;p&gt;At around 13:50 PDT I started the process. I stop/started one of our NAT nodes
without incident. Then things get a little murky.&lt;/p&gt;

&lt;p&gt;For some reason, I decided to actually replace one of the nodes, but I don’t
remember why. I did not make any record of my reasoning. It is entirely possible
that I got distracted between the last node and the next one and went to
reprovision it instead of just doing a stop/start cycle. It’s also possible
there was some other issue with the node, and I simply failed to document it.&lt;/p&gt;

&lt;p&gt;At about 14:15 PDT, I terminated one of our “stack” nodes (they run all the
services that power the Opsmatic app) and then went to replace it.&lt;/p&gt;

&lt;p&gt;We had provisioned our AWS infrastructure using &lt;a href=&quot;https://github.com/opscode/chef-provisioning&quot;&gt;Chef Metal&lt;/a&gt;
so replacing the node should have been as simple as terminating it and then
“converging” the infrastructure - a single, global command that does not take
any parameters other than the declaration of what your infrastructure should
look like (number of nodes in each cluster, etc). Chef, in theory, would detect
that the “stack” cluster was missing a node and provision a new one to replace
it.&lt;/p&gt;

&lt;p&gt;So that is what I did. Replacing a node in our infrastructure is a routine
operation that we had practiced several times without incident.&lt;/p&gt;

&lt;p&gt;At 14:20 PDT Opsmatic went down in flames. The Chef run restarted &lt;em&gt;every single
instance in our infrastructure.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Talk about a “Game Day”…&lt;/p&gt;

&lt;p class=&quot;center&quot;&gt;
    &lt;img src=&quot;/imgs/posts/rebootorama/pagerduty_report.png&quot; alt=&quot;pages galore&quot; class=&quot;constrained&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;As soon as the instances came back up, we scrambled to make sure that all the
services were back to normal. We were down for a total of about 30 minutes, in
part because there were certain parts of the recovery process that were not as
smoothly automated as we had thought; these defects became very apparent during
the previously un-tested “restart the entire infrastructure” scenario.&lt;/p&gt;

&lt;h3 id=&quot;the-causes&quot;&gt;The Causes&lt;/h3&gt;

&lt;p&gt;Once service was restored, we started trying to figure out what the hell had
happened. Meanwhile, the delightful Surge lightning talks were drawing
uproarious laughter in the main ballroom behind me.&lt;/p&gt;

&lt;p&gt;As I scrolled frantically through the log from my fateful Chef run, I saw a bunch
of lines like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[2014-09-25T21:18:39+00:00] WARN: Machine ******.opsmatic.com (i-*******
on fog:AWS:************:*********) was started but SSH did not come up.
Rebooting machine in an attempt to unstick it ...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;One per server. We quickly confirmed in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#chef&lt;/code&gt; IRC channel that this was a
bug - because Chef could not establish an SSH connection to these nodes,
it decided to reboot them. That, apparently, should not have happened.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[2014-09-25T18:30:13-0400]
&amp;lt;johnewart&amp;gt; Ah, well -- you managed to uncover a bug by doing that
&amp;lt;johnewart&amp;gt; we should only reboot it if it's within the first 10 minute window
&amp;lt;johnewart&amp;gt; like, you create, and then try to run again 5 minutes later and it can't connect
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After a bit more digging, we sorted out that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chef-metal&lt;/code&gt; had been relying on
the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ubuntu&lt;/code&gt; user being present on all our machines along with a specific
private key. Something had caused the home directory for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ubuntu&lt;/code&gt; user to be
deleted.&lt;/p&gt;

&lt;p&gt;At this point I remembered something: a LONG time ago, before Opsmatic even had
a name, I had done some experiments with AWS. As part of that, I had a
bootstrapping scheme which relied on the same &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ubuntu&lt;/code&gt; user (standard practice
when provisioning Ubuntu AMIs), but also included a recipe called
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;remove_default_users&lt;/code&gt; which nuked the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ubuntu&lt;/code&gt; user once bootstrap was
complete.&lt;/p&gt;

&lt;p&gt;This bootstrap process was never used for anything serious - the initial
iteration of Opsmatic’s infrastructure was one big server at an MSP; from there,
we moved straight to the Chef-driven AWS setup. However, that small bit
of cruft persevered in our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chef-repo&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;My hunch was correct. Although &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;remove_default_users&lt;/code&gt; was never part of any
roles or run lists in the new infrastructture, we were able to confirm that it
was applied on all the nodes on August 31st (just a couple of days after the
last time we had practiced replacing a node) by performing a search in Opsmatic
itself:&lt;/p&gt;

&lt;p class=&quot;center&quot;&gt;
    &lt;img src=&quot;/imgs/posts/rebootorama/opsmatic_chef.png&quot; alt=&quot;chef report&quot; class=&quot;constrained&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;However, by the time of the outage it was once again absent from all run lists.
So how did it get there on August 31st and how was it ultimately removed? That
would take another couple of weeks to figure out.&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;remove_default_users&lt;/code&gt; recipe was clearly dead weight; we had gotten a
little sloppy and let a bit of invisible technical debt accumulate. In order to
prevent the same thing happening again, we immediately deleted the recipe. This
has another nice side-effect: the next time this recipe appeared in a run list,
Chef would fail. We have good visibility into those failures in Opsmatic, so we
would be able to react and debug “in the moment.”&lt;/p&gt;

&lt;p&gt;That exact thing happened on October 14th: as I was doing some
refactoring in our cookbooks and roles, I found chef failing because it could
not find &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;remove_default_users&lt;/code&gt;. I knew I was about to find something important&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;something slippery, elusive, confusing, and damaging. Indeed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The recipe was originally part of a cookbook called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base&lt;/code&gt; - a collection of
resources that needed to be applied to all nodes. As we moved to a
“more-than-one-node” setup, we started using Chef roles to define run lists. The
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base&lt;/code&gt; cookbook was pulled apart and reconstituted as a role to be included in
other roles. There was a step in the refactor where “parity” was achieved - the
role was made to replicate the previous behavior exactly. At that point, the
role was copied into another file called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base-original.json&lt;/code&gt; to be used as a
reference as pieces of it were pulled into other cookbooks etc. Many edits were
then made to the role in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base.json&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base-original.json&lt;/code&gt; file stuck around in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;roles&lt;/code&gt; directory.&lt;/p&gt;

&lt;p&gt;But here’s the thing about a role file: unlike cookbooks, the name of the role
doesn’t just come from the filename; it comes from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;name&lt;/code&gt; field defined
inside.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ head roles/base.json 
{
  &quot;name&quot;: &quot;base&quot;,
  &quot;description&quot;: &quot;base role configures all the defaults every host should have&quot;,
  &quot;json_class&quot;: &quot;Chef::Role&quot;,
...
$ head roles/base-original.json 
{
  &quot;name&quot;: &quot;base&quot;,
  &quot;description&quot;: &quot;base role configures all the defaults every host should have&quot;,
  &quot;json_class&quot;: &quot;Chef::Role&quot;,
...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The majority of time spent working on Chef is spent working on cookbooks, so
it’s easy to forget the subtle differences in behavior with roles.&lt;/p&gt;

&lt;p&gt;So what had happened was this: while modifying something else about the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base&lt;/code&gt;
role, I had assumed that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base-original&lt;/code&gt; were different roles that
were both in use. I had modified both files and uploaded them both to the Chef
server, first &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base&lt;/code&gt;, then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base-original&lt;/code&gt;. In reality, they both updated the
same role, and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base-original&lt;/code&gt; content won out because it was uploaded
second. Chef ran at least once with this configuration, deleting the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ubuntu&lt;/code&gt;
user. Some time later, someone who DID know that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base-original&lt;/code&gt; was not to be
uploaded made yet more changes and only uploaded &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base&lt;/code&gt;, wiping
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;remove_default_users&lt;/code&gt; out once more. By the time the epic reboot happened, it
was gone from the run list again, leaving us to scratch our heads.&lt;/p&gt;

&lt;p&gt;Because the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ubuntu&lt;/code&gt; user was created by the provisioning process and not
explicitly managed by Chef, it was not re-created.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Whoever ran &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chef-metal&lt;/code&gt; next was going to cause a global reboot.&lt;/em&gt; It just so
happened that I did it from a conference and ended up spending my evening
plugged into an expo booth’s outlet.&lt;/p&gt;

&lt;p class=&quot;center&quot;&gt;
    &lt;img src=&quot;/imgs/posts/rebootorama/selfie.png&quot; alt=&quot;outage selfie&quot; class=&quot;constrained&quot; /&gt;
&lt;/p&gt;

&lt;h3 id=&quot;remediations-and-learnings&quot;&gt;Remediations and Learnings&lt;/h3&gt;

&lt;h4 id=&quot;computers-are-hard&quot;&gt;Computers are Hard&lt;/h4&gt;

&lt;p&gt;Managing even a small infrastructure requires discipline, precision, and
thoroughness. The smallest bit of cruft can combine with other bits of cruft to
form a cruft snowball (cruftball?) of considerable heft over a relatively short
time period.&lt;/p&gt;

&lt;h4 id=&quot;cookbooks-vs-roles&quot;&gt;Cookbooks vs Roles&lt;/h4&gt;

&lt;p&gt;This sort of failure is exactly the cause of the trend towards “role cookbooks”
replacing the role primitive. Having a recipe that is simply a collection of
other recipes is functionally identical to a role, but has a few advantages -
namely versioning (enough said) and consistent behavior with resource cookbooks.
Having a recipe named &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base-original.rb&lt;/code&gt; would have had no effect on a recipe
named &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;base.rb&lt;/code&gt;.&lt;/p&gt;

&lt;h4 id=&quot;chef-metal&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chef-metal&lt;/code&gt;&lt;/h4&gt;

&lt;p&gt;While the theory behind &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chef-metal&lt;/code&gt; sounds good, we have started switching away
from it. Bugs and maturity are the immediate problems, but it would be foolish
to act like those don’t exist in all software, including whatever other scheme
we end up using. This single bug is not why we’re migrating away.&lt;/p&gt;

&lt;p&gt;The theory behind &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chef-metal&lt;/code&gt; itself sounds good, and it’s the
“right” sort of automation, e.g. it’s not &lt;a href=&quot;http://www.kitchensoap.com/2012/09/21/a-mature-role-for-automation-part-i/&quot;&gt;just scripting steps normally
performed by a human&lt;/a&gt;
However, it was very alarming how easily a very localized, routine change which
had been successfully executed fairly recently turned into a global disaster.
This is a big red flag for any system. It is an indicator of &lt;em&gt;unnecessary
coupling&lt;/em&gt;. Every time we wanted to add any node to our infrastructure, however minor
and auxiliary, we’d have to perform an operation that touches &lt;em&gt;everything&lt;/em&gt;.
Having witnessed the potential for disaster, this would elicit a healthy dose of
The Fear each time. In the long run, if we’re afraid to perform simple tasks
with the the provisioning system, we’re not going to provision and replace nodes
as frequently. Whenever you stop doing something regularly, you become bad at
it. Routine operations should have routine consequences.&lt;/p&gt;

&lt;p&gt;There are also more tactical concerns: “can’t SSH to this server, better reboot
it” sounds EXACTLY like automating a manual ops process, and a bad one at that.
Then there’s the security angle: even with the bug fixed, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chef-metal&lt;/code&gt; still
requires SSH access to the servers it manages with elevated credentials. In
other words, you have to keep the provisioning user (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ubuntu&lt;/code&gt; in our case)
around on your instances forever. We strongly dislike that - it adds another
little bit to the surface area.  Sure, you need to be on a private network in
order to get to SSH in the first place, but it’s another hidden back door that’s
easy to neglect. We’d rather not have it.&lt;/p&gt;

&lt;p&gt;We haven’t had much time to think about it, but this approach may work much
better when applied at the container level, one step removed from the actual
infrastructure. We may investigate it in the future. For now, our infrastructure
is small, homogenous and simple enough that we will simply be switching to a
more “transactional” provisioning process.&lt;/p&gt;

&lt;h4 id=&quot;documenting-and-finishing-big-migrations-quickly&quot;&gt;Documenting and Finishing Big Migrations Quickly&lt;/h4&gt;

&lt;p&gt;A huge part of this was just technical debt - recipes, cookbooks, and roles left
over through consecutive refactors. Even in a “simple” infrastructure, success
and safety depend on a vast set of shared assumption about how things work. As
individuals change the systems’ behavior, the change has to be explicit, easy to
understand, and easy to remember. Pieces being left around from “the old way”
make it easy to make a no-longer-valid assumption.&lt;/p&gt;

&lt;h4 id=&quot;things-we-should-add-to-opsmatic&quot;&gt;Things We Should Add To Opsmatic&lt;/h4&gt;

&lt;p&gt;We’re constantly improving teams’ visibility into changes and important events
in their infrastructure. That we were able to find when a particular recipe was
great, but the experience also illuminated some gaps in our view of CM (e.g.
role/run list changes, and some “meta” features to surface such changes). We’re
hard at work, converting what we learned into real improvement in the product.&lt;/p&gt;

&lt;h3 id=&quot;parting-thoughts&quot;&gt;Parting Thoughts&lt;/h3&gt;

&lt;p&gt;As soon as we recovered from this outage, I thought “I’m going to have to write
about this.” It is a great example of a complex system failure, “like the ones
you read about.” It served as a great, rapid refresher course on complex system
theory; it reminded us that we have to minimize coupling and interactions within
our systems constantly and ruthlessly.&lt;/p&gt;

&lt;p&gt;If you enjoyed this story (you sadist), you’ll probably like the following posts
and books in the broader literature.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.amazon.com/Field-Guide-Understanding-Human-Error/dp/0754648265/&quot;&gt;&lt;strong&gt;The Field Guide to Understanding Human Error&lt;/strong&gt;&lt;/a&gt;
by Sidney Dekker, and pretty much anything else by Dekker on the subject of
human error and human factors.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.amazon.com/Normal-Accidents-Living-High-Risk-Technologies/dp/0691004129&quot;&gt;&lt;strong&gt;Normal Accidents&lt;/strong&gt;&lt;/a&gt;
by Charles Perrow - a great introduction to complex systems, complete with great
anecdotes from a number of different fields.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://whilefalse.blogspot.com/2012/12/make-it-easy.html&quot;&gt;&lt;strong&gt;Make It Easy&lt;/strong&gt;&lt;/a&gt;
by Camille Fournier is a great concise post on the importance of designing
systems and processes with the operator in mind.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.kitchensoap.com/&quot;&gt;&lt;strong&gt;Kitchen Soap Blog&lt;/strong&gt;&lt;/a&gt; by John Allspaw is a
great source for keeping abreast of developments in complex system failure, as
well as ops and ops management in general.&lt;/li&gt;
  &lt;li&gt;Amazon’s &lt;a href=&quot;http://aws.amazon.com/message/65648/&quot;&gt;&lt;strong&gt;Epic 2011 Post Mortem&lt;/strong&gt;&lt;/a&gt; - I
mentioned this post in my &lt;a href=&quot;http://surge.omniti.com/2011/speakers/mike-panchenko&quot;&gt;Surge 2011
talk&lt;/a&gt; because it read so
much like parts of the Three Mile Island nuclear accident’s description in
&lt;em&gt;Normal Accidents&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
				<published>2014-11-08 00:00:00 +0000</published>
				<link>http://blog.mihasya.com/2014/11/08/A-failure-months-in-the-making.html</link>
			</item>
		
			<item>
				<title>Two Factor Auth: Allow AWS IAM users to manage their own MFA devices</title>
				<description>&lt;p&gt;&lt;em&gt;(all info and screenshots are from 09/02/2014)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In light of all the recent incidents involving attackers taking control of a
company’s root AWS account, myself and most everyone I know that is managing any
sort of infrastructure have been re-auditing accounts and stepping up efforts to
get everyone within our teams to turn on MFA (multi-factor authentication).  MFA
makes it impossible for someone to log in as you with just a username/password
combo. An additional “factor” is required to confirm the user’s identity -
typically a code from a synchronized number sequence. This has been standard
practice in larger companies and capital-E Enterprise for many years, and is now
starting to be taken seriously by folks operating at a smaller scale and in the
cloud. No one wants to be the &lt;a href=&quot;http://it.slashdot.org/story/14/06/18/1513252/code-spaces-hosting-shutting-down-after-attacker-deletes-all-data&quot;&gt;next
tragedy&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MFA (or 2-factor auth) has traditionally been embodied by RSA tokens
attached to a keychain or a badge lanyard. These days, your phone can act as an
adequate substitute.&lt;/p&gt;

&lt;p&gt;Turning on MFA for your root AWS account is fairly easy:&lt;/p&gt;

&lt;p class=&quot;center&quot;&gt;
    &lt;img src=&quot;/imgs/posts/awsmfa/root_mfa.png&quot; alt=&quot;mfa device for root acct&quot; class=&quot;constrained&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;However, it took me an unfortunate amount of time to figure out how to allow
users created as IAM accounts to manage their own MFA devices. Setting people’s
devices up by hand through the root account  was simply not an acceptable
solution. Even at our size it was going to be a major headache, especially
for our remote employee.&lt;/p&gt;

&lt;p&gt;In the end, it’s all documented in AWS docs, but it’s a bit buried, and multiple
steps are involved. Hopefully this post saves you some time.&lt;/p&gt;

&lt;h3 id=&quot;just-the-right-amount&quot;&gt;Just The Right Amount&lt;/h3&gt;

&lt;p&gt;The critical thing is to give everyone JUST what they need and no more. Since
you’ve already secured your root account, you can likely curtail the breach of
an IAM account reasonably quickly, but it’s best if the account can wreak minimal
havoc in the first place. For example, if a compromised account was able to 
fiddle with the credentials of other users, the exposure and cleanup effort
would increase greatly.&lt;/p&gt;

&lt;p&gt;Unfortunately, the IAM permissions policy system is rather arcane. That is an
undesirable property for a security-related system to have (easy to get wrong),
but alas, it’s the one we’ve got.&lt;/p&gt;

&lt;p&gt;IAM Policies are made up of combinations of JSON blobs (“stanzas”) each containing a
unique identifier, an effect (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Allow&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Deny&lt;/code&gt;), an action, and a resource to
which the effect/action combo should be applied. There’s a whole bunch of
documentation on the subject
&lt;a href=&quot;http://docs.aws.amazon.com/IAM/latest/UserGuide/PermissionsOverview.html&quot;&gt;here&lt;/a&gt;
so I won’t spend too much time elucidating it. Let’s cut straight to what we
need.&lt;/p&gt;

&lt;h3 id=&quot;mfa-device-permissions&quot;&gt;MFA Device Permissions&lt;/h3&gt;

&lt;p&gt;When you create an IAM user, by default they are unable to do literally
anything. When you pull up the IAM dashboard (where you have to go in order to
set up your MFA device), you literally just see permissions errors everywhere:&lt;/p&gt;

&lt;p class=&quot;center&quot;&gt;
    &lt;img src=&quot;/imgs/posts/awsmfa/no_perms.png&quot; alt=&quot;no permissions by default&quot; class=&quot;constrained&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;“Well that sucks,” I thought, looking over a co-workers shoulder. Googling
“allow IAM user to manage own mfa device,” we find this lovely page:
&lt;a href=&quot;http://docs.aws.amazon.com/IAM/latest/UserGuide/Credentials-Permissions-examples.html&quot;&gt;Example Policies for Administering IAM Resources&lt;/a&gt;
Under the heading “Allow Users to Manage Their Own Virtual MFA Devices (AWS
Management Console)”, we find an example policy that should do the trick.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;{
  &quot;Version&quot;: &quot;2012-10-17&quot;,
  &quot;Statement&quot;: [
    {
      &quot;Sid&quot;: &quot;AllowUsersToCreateDeleteTheirOwnVirtualMFADevices&quot;,
      &quot;Effect&quot;: &quot;Allow&quot;,
      &quot;Action&quot;: [&quot;iam:*VirtualMFADevice&quot;],
      &quot;Resource&quot;: [&quot;arn:aws:iam::ACCOUNT-ID-WITHOUT-HYPHENS:mfa/${aws:username}&quot;]
    },
    {
      &quot;Sid&quot;: &quot;AllowUsersToEnableSyncDisableTheirOwnMFADevices&quot;,
      &quot;Effect&quot;: &quot;Allow&quot;,
      &quot;Action&quot;: [
        &quot;iam:DeactivateMFADevice&quot;,
        &quot;iam:EnableMFADevice&quot;,
        &quot;iam:ListMFADevices&quot;,
        &quot;iam:ResyncMFADevice&quot;
      ],
      &quot;Resource&quot;: [&quot;arn:aws:iam::ACCOUNT-ID-WITHOUT-HYPHENS:user/${aws:username}&quot;]
    },
    {
      &quot;Sid&quot;: &quot;AllowUsersToListVirtualMFADevices&quot;,
      &quot;Effect&quot;: &quot;Allow&quot;,
      &quot;Action&quot;: [&quot;iam:ListVirtualMFADevices&quot;],
      &quot;Resource&quot;: [&quot;arn:aws:iam::ACCOUNT-ID-WITHOUT-HYPHENS:mfa/*&quot;]
    },
    {
      &quot;Sid&quot;: &quot;AllowUsersToListUsersInConsole&quot;,
      &quot;Effect&quot;: &quot;Allow&quot;,
      &quot;Action&quot;: [&quot;iam:ListUsers&quot;],
      &quot;Resource&quot;: [&quot;arn:aws:iam::ACCOUNT-ID-WITHOUT-HYPHENS:user/*&quot;]
    }
  ]
}
```                                                                                                                              }

Since this is in no way obvious, I will also note that the account ID is found
on the &quot;Security Credentials&quot; page of the root AWS account.

&amp;lt;p class=&quot;center&quot;&amp;gt;
    &amp;lt;img src=&quot;/imgs/posts/awsmfa/account_id.png&quot; alt=&quot;aws account ids&quot;
    class=&quot;constrained&quot; /&amp;gt;
&amp;lt;/p&amp;gt;

This appears to be sufficient to let users find themselves in the &quot;Users&quot; menu,
click the &quot;Manage MFA Device&quot; button, and go through the rest of the process.

&amp;lt;p class=&quot;center&quot;&amp;gt;
    &amp;lt;img src=&quot;/imgs/posts/awsmfa/iamtestuser_mfa.png&quot; alt=&quot;test user's mfa button&quot;
    class=&quot;constrained&quot; /&amp;gt;
&amp;lt;/p&amp;gt;

### Passwords etc

I also found it useful to give our users the ability to manage the rest of their
own credentials.  The relevant policy stanzas can be found
[here](http://docs.aws.amazon.com/IAM/latest/UserGuide/Credentials-Permissions-examples.html#creds-policies-credentials).

Surprisingly, the default &quot;Password Policy&quot; on our AWS account was set to
allow passwords as short as 6 characters with no additional requirements. Even
with MFA enabled, you'll want to crank that up to something quite a bit more
robust.

### Keeping the robots at bay

One other important aspect of our setup is the fact that only humanoid users are
able to mange their own credentials. We have a number of automation-related
&quot;bot&quot; accounts who have security policies tailored specifically to their
purpose - the `backup` user only has access to a specific S3 bucket, the
`dnsupdater` user only has access to a specific Route53 zone, etc. Even with
this limited set of permissions, it's important to make it difficult for an
attacker to gain control of these users. They do not have passwords, and they
are never granted permissions to manage their own credentials. This is
accomplished by attaching the policies described above to a `humans` group and
only adding users with a verified heartbeat to that group.

### Enforcing a Policy

We have a policy of not allowing access to any AWS resources without an MFA
device enabled. However a policy is only as good as its enforcement. I did a
brief google and didn't find any automated tools to do the job, though I did not
try very hard. I did find that the [AWS CLI
tool](http://aws.amazon.com/cli/) has a `aws iam get-credential-report`
command, which returns a base64-encoded CSV file containing information about
all the IAM users' credentials. One of the columns is `mfa_active`, so the data
is all there to automatically enforce an MFA policy. 

(**NB:** you have to run `aws iam generate-credential-report` beforehand. Full docs are [here](http://docs.aws.amazon.com/IAM/latest/UserGuide/credential-reports.html))

For example, the following python snippet (available as a gist
[here](https://gist.github.com/mihasya/a1fd1c4bbef04495a12b)) will parse the
contents of the report and tell you who doesn't have MFA enabled. All you have
to do is `chmod +x` the file to make it executable, then pipe the report into it
like so: `aws iam get-credential-report | ./scripts/parse_credential_report.py`.

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;#!/usr/bin/env python
from sys import stdin
import json
import base64&lt;/p&gt;

&lt;p&gt;report = json.loads(stdin.read())
table = base64.b64decode(report[“Content”]).splitlines()
head = table[0].split(“,”)
table = table[1:]&lt;/p&gt;

&lt;p&gt;for row in iter(table):
    user = dict(zip(head, row.split(“,”)))
    # you now have a dictionary with keys like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mfa_active&lt;/code&gt;,
    # and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;password_last_changed&lt;/code&gt;
    print “%s %s” % (user[“user”], user[“mfa_active”])
```&lt;/p&gt;

&lt;p&gt;For our current team size and growth rate, and compliance needs, this is
sufficient. I did come across an example of what a fully-fleshed out tool would
look like in the excellent &lt;a href=&quot;http://devopsweekly.com/&quot;&gt;DevOps Weekly&lt;/a&gt;: The
Guardian’s &lt;a href=&quot;https://github.com/guardian/gu-who&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gu-who&lt;/code&gt;&lt;/a&gt; for performing
account audits on GitHub accounts.&lt;/p&gt;
</description>
				<published>2014-09-02 00:00:00 +0000</published>
				<link>http://blog.mihasya.com/2014/09/02/Allow-aws-iam-users-to-manage-their-own-mfa-device.html</link>
			</item>
		
			<item>
				<title>Low-hassle HTTP metrics with Tigertonic and Go-metrics</title>
				<description>&lt;h3 id=&quot;first-things-first-what-the-shit-is-tigertonic&quot;&gt;First things first: What the shit is tigertonic?&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;http://gunshowcomic.com/338&quot;&gt;&lt;img src=&quot;/imgs/posts/tt-metrics/tigertonic.png&quot; class=&quot;right small&quot; /&gt;&lt;/a&gt;
Tigertonic is a framework for making webservices in Go written by Richard
Crowley (I have contributed a bug fix or a feature here and there). Its defining
characteristic is that it allows you to translate functions which take and
return specific Go types into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;http.Handler&lt;/code&gt; implementations that understand and
return JSON payloads. Define your signature, pass it into the correct Tigertonic
wrapper, and out comes a web service that take in JSON, unmarshals it to the
input type, passes it to your handler, then takes the return value from your
handler and marshals it into JSON for the response.&lt;/p&gt;

&lt;p&gt;It’s similar to JAX-RS/Jersey annotations, but with much less code, and with
most of the ugly bits hidden from the framework’s user.&lt;/p&gt;

&lt;p&gt;Check out &lt;a href=&quot;https://github.com/rcrowley/go-tigertonic#usage&quot;&gt;the README&lt;/a&gt; for
more info. Richard has also &lt;a href=&quot;http://rcrowley.org/articles/tiger-tonic.html&quot;&gt;written&lt;/a&gt;
and &lt;a href=&quot;http://rcrowley.org/talks/gosf-2014-01-15.html#1&quot;&gt;spoken&lt;/a&gt; about
Tigertonic on various occasions. It’s all well worth reading.&lt;/p&gt;

&lt;p&gt;Here’s an example of a very simple tigertonic service:&lt;/p&gt;

&lt;pre&gt;
type Book struct {
        Author, Title string
}

// this takes a Book object and returns an empty body
func PutBook(u *url.URL, h http.Header, book *Book) (status int, responseHeaders http.Header, _ interface{}, err error){ ... } 
// this takes an empty body and returns a Book object
func GetBook(u *url.URL, h http.Header, _ interface{}) (status int, responseHeaders http.Header, book *Book, err error) {}

func main() {
        mux := tigertonic.NewTrieServeMux()
        mux.Handle(&quot;GET&quot;, &quot;/books/{book_id}&quot;, tigertonic.Marshaled(GetBook))
        mux.Handle(&quot;PUT&quot;, &quot;/books/{book_id}&quot;, tigertonic.Marshaled(PutBook))

        server := tigertonic.NewServer(&quot;localhost:34334&quot;, mux)
        log.Fatal(server.ListenAndServe())
}
&lt;/pre&gt;

&lt;p&gt;(full code is &lt;a href=&quot;https://github.com/mihasya/ttmetricsexample/blob/master/basic/main.go&quot;&gt;here&lt;/a&gt;)&lt;/p&gt;

&lt;h3 id=&quot;so-you-want-some-metrics&quot;&gt;So You Want Some Metrics&lt;/h3&gt;

&lt;p&gt;At &lt;a href=&quot;http://opsmatic.com&quot;&gt;Opsmatic&lt;/a&gt; we strive to be a “learning organization” -
we want to learn something from every release, every change, every customer
interaction. An important component of that philosophy is an obsession with
measuring things. Jim, our CEO, wants “If you can’t measure it, don’t ship it”
written on his headstone when the time is right. No joke.&lt;/p&gt;

&lt;p&gt;One of the things we wanted to measure was the number of requests served by our
API. While we were at it, we thought we’d grab the timing data too for
operational purposes.&lt;/p&gt;

&lt;h3 id=&quot;go-metrics-and-tigertonic&quot;&gt;go-metrics and Tigertonic&lt;/h3&gt;

&lt;p&gt;Richard is adamant about everything in Tigertonic reducing to an implementation
of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;http.Handler&lt;/code&gt;, and with good reason: doing so enables the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Handler&lt;/code&gt; that
actually performs the business logic to be wrapped in any number of completely
orthogonal &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Handlers&lt;/code&gt; that handle all sorts of other concerns - logging, CORS rules,
authentication.. &lt;strong&gt;and metrics!&lt;/strong&gt; (the
&lt;a href=&quot;https://github.com/rcrowley/go-tigertonic/blob/master/README.md&quot;&gt;README&lt;/a&gt; lists
the available handlers.) The separation of concerns afforded by this approach is
truly refreshing.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/rcrowley/go-metrics&quot;&gt;Go-metrics&lt;/a&gt; is a library, also
maintained by Richard, that provides similar capabilities to Coda Hale’s great
&lt;a href=&quot;http://metrics.codahale.com/&quot;&gt;Java metrics library&lt;/a&gt;. It makes it very easy to
time and count things, as well as to extract the data from the timers and
counters.&lt;/p&gt;

&lt;p&gt;Tigertonic comes with a few wrappers that hook up our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Handlers&lt;/code&gt; directly
to these metrics. We’re going to look at a couple in particular: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Timed&lt;/code&gt; and
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CountedByStatusXX&lt;/code&gt;. The former is a very thin wrapper around the functionality
of a go-metrics &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Timer&lt;/code&gt; - it just times the request and records the reading:&lt;/p&gt;

&lt;pre&gt;
func (t *Timer) ServeHTTP(w http.ResponseWriter, r *http.Request) {
        defer t.UpdateSince(time.Now())
        t.handler.ServeHTTP(w, r)
}
&lt;/pre&gt;

&lt;p&gt;The latter is a bit more involved, but is also ultimately a thin wrapper around
some go-metrics primivites which counts the number of requests that result in a
given class of response codes &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2XX&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;5XX&lt;/code&gt;, etc. You can look at the code
&lt;a href=&quot;https://github.com/rcrowley/go-tigertonic/blob/abfd9c347631ef79c0b0d04e702c376efd5985fb/metrics.go#L155&quot;&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Adding a counter is done by calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tigertonic.Counted(yourHandlerHere, ...)&lt;/code&gt;.
Since the return value is also an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;http.Handler&lt;/code&gt;, you can pass that to
tigertonic’s multiplexer or really anything that operatoes on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;http.Handler&lt;/code&gt; -
including the stdlib http server.&lt;/p&gt;

&lt;h3 id=&quot;putting-it-all-together&quot;&gt;Putting it all together&lt;/h3&gt;

&lt;p&gt;The goal at the outset was to easily capture metrics on all our endpoints. How are we doing on that?&lt;/p&gt;

&lt;p&gt;Quite well, it turns out. All we have to do to achieve the goals is some wrapping:&lt;/p&gt;

&lt;pre&gt;
func wrapHandler(name string, h http.Handler) http.Handler {
        return tigertonic.CountedByStatusXX(
                tigertonic.Timed(
                        tigertonic.ApacheLogged(h),
                        name,
                        metrics.DefaultRegistry,
                ),
                name,
                metrics.DefaultRegistry,
        )
}
&lt;/pre&gt;

&lt;p&gt;Then we invoke this wrapper before registering our handlers:&lt;/p&gt;

&lt;pre&gt;
mux.Handle(&quot;GET&quot;, &quot;/books/{book_id}&quot;, wrapHandler(&quot;get-book&quot;, tigertonic.Marshaled(GetBook)))
mux.Handle(&quot;PUT&quot;, &quot;/books/{book_id}&quot;, wrapHandler(&quot;put-book&quot;, tigertonic.Marshaled(PutBook)))
&lt;/pre&gt;

&lt;p&gt;ET VOILA. We need to give our handlers some names for the purposes of metrics
collection, so we create a little wrapper function that takes that name and a
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Handler&lt;/code&gt; and wraps it in all the properly named metrics collectors. When we
need to add more handlers, we wrap those too and the data shows up for
free. In the &lt;a href=&quot;https://github.com/mihasya/ttmetricsexample/blob/master/instrumented/main.go&quot;&gt;instrumented version of the code&lt;/a&gt;
you can see that I’ve also made a call to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;metrics.Log&lt;/code&gt; which spawns a 
reporter goroutine off into the background, printing out the stats every 10
seconds. There are a number of more useful reporters available - for example,
I’ve contribued a &lt;a href=&quot;https://github.com/rcrowley/go-metrics/blob/master/librato/librato.go&quot;&gt;Librato reporter&lt;/a&gt;
which posts the metrics to the &lt;a href=&quot;http://support.metrics.librato.com/knowledgebase/articles/66171-correlate-create-an-instrument-&quot;&gt;Librato API&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/imgs/posts/tt-metrics/graphs.png&quot; class=&quot;constrained&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;slightly-more-advanced&quot;&gt;Slightly More Advanced&lt;/h3&gt;

&lt;p&gt;The full Opsmatic version of the above code is included below for additional
illustration. It is expanded to include the name of the service, some CORS
defaults, and two versions of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wrap&lt;/code&gt; method - one that includes a call to
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tigertonic.Marshal&lt;/code&gt; and one that does not; we need the latter to accommodate a
couple of endpoints we have that do not return JSON.&lt;/p&gt;

&lt;pre&gt;
type OpsmaticService struct {
        serviceName    string
        allowedOrigins []string
        allowedHeaders []string
}

func NewOpsmaticService(name string, origins []string, headers []string) *OpsmaticService {
        return &amp;amp;OpsmaticService{name, origins, headers}
}

func NewDefaultOpsmaticService(name string) *OpsmaticService {
        return NewOpsmaticService(name, []string{&quot;[redacted]&quot;}, []string{&quot;Authorization&quot;})
}

func (self *OpsmaticService) WrapHandler(name string, h http.Handler) http.Handler {
        cors := tigertonic.NewCORSBuilder().AddAllowedOrigins(self.allowedOrigins...).AddAllowedHeaders(self.allowedHeaders...)

        return cors.Build(
                tigertonic.CountedByStatusXX(
                        tigertonic.Timed(
                                tigertonic.ApacheLogged(h),
                                fmt.Sprintf(&quot;%s-%s&quot;, self.serviceName, name),
                                metrics.DefaultRegistry,
                        ),
                        fmt.Sprintf(&quot;%s-%s&quot;, self.serviceName, name),
                        metrics.DefaultRegistry,
                ),
        )
}

func (self *OpsmaticService) MarshalAndWrapHandler(name string, f interface{}) http.Handler {
        return self.WrapHandler(name, tigertonic.Marshaled(f))
}
&lt;/pre&gt;

&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;Using this little bit of boilerplate code, we can readily instrument new
endpoints as they come online without cluttering the code with counters and
timers. Using the aforementioned Librato reporter, we get graphs for new
endpoints that we deploy instantly and with zero additional wrangling. It’s
quite a nice setup that required a fairly modest amount of code and requires
very minimal marginal effort on new endpoints. We hope that you enjoy it as
well.&lt;/p&gt;
</description>
				<published>2014-02-07 00:00:00 +0000</published>
				<link>http://blog.mihasya.com/2014/02/07/tt-metrics.html</link>
			</item>
		
			<item>
				<title>The Myth of the Uninterrupted Programmer</title>
				<description>&lt;p&gt;&lt;img src=&quot;/imgs/posts/uninterrupted/warcraft.jpg&quot; class=&quot;right small&quot; /&gt;
This &lt;a href=&quot;http://blog.42floors.com/our-office-is-too-loud/#.UkSLhhb3BZK&quot;&gt;post about office noise level&lt;/a&gt;
and distractions came through my inbox, and a particular voice in the comments
section caught my eye.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“Show me an office with caves and I’ll show you my resume”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Plenty of comments followed echoing this sentiment.&lt;/p&gt;

&lt;p&gt;While I agree that stretches of concentration are important for figuring out a
specific task, I think that this chorus is at the heart of a serious
misunderstanding many engineers have about their value as members of an
organization that is resulting in a tremendous amount of waste.&lt;/p&gt;

&lt;p&gt;Sure, constant interruptions and context switches are exhausting and difficult.
I’m not suggesting that we should spend all day turning from one conversation to
another. It’s easy to overdo meetings and office shennanigans. However, a
healthy amount of interaction and socialization has some very important
benefits.&lt;/p&gt;

&lt;h2 id=&quot;interruptions-cause-you-to-retrace-your-steps---this-is-often-good&quot;&gt;Interruptions cause you to retrace your steps - this is often good&lt;/h2&gt;

&lt;p&gt;There is a much less edifying real-life counterpoint to the widely romanticized
deeply concentrated programmer. It’s that of a programmer spending 4 hours
trying to track down a confusing, elusive bug, only to figure it all out 5
minutes after walking away from it.  I’ve done it, I’ve seen it, and I continue doing
it and seeing it.&lt;/p&gt;

&lt;p&gt;There’s a very simple explanation for this phenomenon: in order to be able to
reason about an algorithm, especially a complex one, we have to assume and take
a whole load of things for granted. The stack, the configuration, the interfaces
on top of which we’re working.&lt;/p&gt;

&lt;p&gt;An incorrect assumption is a common source of confusion and infuriating
debugging. If you’re lucky, the false assumption will be illuminated by a
debugger or a log line. However, the longer you’d been staring at the same
problem, the more likely you are to miss something much more simple. That helper
function you stubbed out earlier while testing something else? Yeah, that’s
still there. You’ll feel real dumb when you remember.&lt;/p&gt;

&lt;p&gt;Interruptions - planned or unplanned - cause you to “resurface” and to have to
re-engage the problem almost from scratch. Part of that process is rebuilding
that chain of assumptions. Stepping back from a problem and seeing the bigger
picture is often much more productive than spinning down in the bowels of your
code.&lt;/p&gt;

&lt;p&gt;(Here’s a great &lt;a href=&quot;http://vimeo.com/44984049&quot;&gt;talk&lt;/a&gt; Joe Damato with a pretty good
discussion of discovering violations in your basic assumptions)&lt;/p&gt;

&lt;h3 id=&quot;re-reading-your-own-code-is-the-best-way-to-write-readable-code&quot;&gt;Re-reading your own code is the best way to write readable code&lt;/h3&gt;

&lt;p&gt;If you’re writing a bunch of code in a hurry, and especially if you’re doing so
while fighting through bugs, you’re likely leaving a disaster zone in your wake.
Even if you think you’re writing “clean code” and writing tests to go along with
it, there are probably sections in your code that barely make any sense by the
time you’ve gotten them to do what you want.&lt;/p&gt;

&lt;p&gt;Pair programming is one way of solving this - your passenger will point at the
screen and call you out for getting too fancy or too casual with your
single-letter variables. I’m still torn on pair programming, but I do think
it’s a great idea to re-read your own code regularly for reasons related to the
first secion.&lt;/p&gt;

&lt;p&gt;While an interruption causing you to lose context can be annoying, the forced
re-construction of context can point out flaws in your reasoning and force you
to recognize sections of code that are hard to read - because you’ll have
trouble reading them too.&lt;/p&gt;

&lt;h2 id=&quot;your-peer-has-likely-seen-the-same-problem-before&quot;&gt;Your peer has likely seen the same problem before&lt;/h2&gt;

&lt;p&gt;We spend a lot of time talking about sharing code and know-how in the OSS
community. We’ve also been putting lots of emphasis on DRY - “Don’t Repeat
Yourself.” Well, it’s more like DRO - “Don’t repeat others.” This broader
message applies to your peers as well. When you’re dealing with OSS code and you
find a bug you can’t sort out, you ask the internet and see if anyone else has
had the same problem. For whatever reason, we find this easy, but we find
turning to our neighbor and asking the same thing difficult - PROBABLY because
we’re afraid of the stigma of interrupting them. So we spin our wheels. Awesome.&lt;/p&gt;

&lt;p&gt;Don’t forget that someone in the room  is very likely to have used the same
software and tools you’re using, seen similar problems in the same or similar
systems, or, if you’re really lucky, wrote the damn thing in the first place.&lt;/p&gt;

&lt;p&gt;Interruptions often come with an opportunity to ask your colleagues - they
may well be interrupted too.&lt;/p&gt;

&lt;h2 id=&quot;are-you-even-solving-the-correct-problem&quot;&gt;Are you even solving the correct problem?&lt;/h2&gt;

&lt;p&gt;Many conversations between engineers about productivity make it sound like the
goal of programming is to write as many lines of code as possible. This has been
reinforced by stories of companies like Google which were “run by the
engineers.” I believe this has caused people to imagine the original Google
employees all furisouly writing code for 16 hours a day without uttering a word
to each other or anyone else, inevitably producing the world’s best search
engine.&lt;/p&gt;

&lt;div class=&quot;left&quot;&gt;
&lt;img src=&quot;http://farm3.staticflickr.com/2600/3998279762_ae2c6ede06_n.jpg&quot; class=&quot;small&quot; /&gt;&lt;br /&gt;
&lt;small&gt;Photo by &lt;a href=&quot;http://www.flickr.com/photos/10422465@N00/3998279762&quot;&gt;
Paul Simpson&lt;/a&gt;&lt;/small&gt;
&lt;/div&gt;
&lt;p&gt;This is pure professional hubris. Hubris is all I hear when engineers bitch
about product and project managers interrupting them with all their “process.”
Sure, it’s easy to overdo, but it brings us back to that whole
&lt;a href=&quot;http://blog.mihasya.com/2013/06/11/how-do-i-devops.html&quot;&gt;“know your business”&lt;/a&gt;
thing.&lt;/p&gt;

&lt;p&gt;Sure, if you sit in your little cave for 16 hours, you’re going to write a whole
bunch of code. But… what did you just produce? Sure, it’s “correct” in the
strict engineering sense of the way - the right inputs produce the right
outputs, etc.. But is it correct in the context of a product? Did you actually
build something people will want? Does it work, as in, does it behave the way a
customer would expect?  Chances are it does not, because it’s hard to build
things for humans without talking to them.&lt;/p&gt;

&lt;p&gt;The reality of the matter is that Google’s early engineers were successful
because they were good at all those other things as well, not because they
ignored everything around them and ground code.&lt;/p&gt;

&lt;h2 id=&quot;how-hard-are-you-concentrating-anyway&quot;&gt;How hard are you concentrating, anyway?&lt;/h2&gt;

&lt;p&gt;You can tell engineers don’t REALLY mind being interrupted by just looking at
the constant shitpile of activity on HackerNews, Twitter, Google Plus, IRC, etc.
It’s not about interruptions. It’s just flat out whining. We don’t like getting
out of our comfort zone and thinking about things about which we’re not that
good at thinking. Stop coming up with excuses and get better at it.&lt;/p&gt;

&lt;h2 id=&quot;interruptions-force-you-to-ship&quot;&gt;Interruptions force you to ship.&lt;/h2&gt;

&lt;p&gt;There’s no disputing that interruptions and context switches are painful and
difficult, but knowing that they’re coming can have a positive impact - if you
anticipate only having a couple of hours before you’re interrupted, you will
work in more incremental chunks, which lend itself better to testing,
documentation, abstraction etc. These are all good things.&lt;/p&gt;

&lt;p&gt;For example - there are guests coming over for dinner shortly, so I’m just
going to wrap this up and post it. It’s too long as is.&lt;/p&gt;

&lt;h2 id=&quot;tldr&quot;&gt;tl;dr&lt;/h2&gt;

&lt;p&gt;Sitting in a dark basement in silence great for leveling-up your World of
Warcraft character. It’s no way to build good, usable software. There’s no
substitute for good communication.&lt;/p&gt;
</description>
				<published>2013-11-17 00:00:00 +0000</published>
				<link>http://blog.mihasya.com/2013/11/17/myth-of-uninterrupted-productivity.html</link>
			</item>
		
			<item>
				<title>A Reliable, simple way to get a PDF out of Showoff</title>
				<description>&lt;p&gt;Perpetually agonized by actually using Keynote or Powerpoint to make slides, I
continue to use &lt;a href=&quot;http://github.com/schacon/showoff&quot;&gt;Showoff&lt;/a&gt; to make my slide
decks. Unfortunately, the codebase appears a bit neglected, and certain features
have stopped working very well over the course of re-installs. I have neither
the Ruby-fu nor the time nor the patience to figure out why PDF generation has
stopped working (I actually don’t think that particular feature ever worked for
me at all), so I’ve had to resort to trickery.&lt;/p&gt;

&lt;p&gt;I am posting this here because I keep forgetting how to do this and having to
blindly figure it out each time. Hopefully my own blog will be an obvious enough
place to look. &lt;strong&gt;This has only been tested on a Mac using Chrome&lt;/strong&gt;, but it looks
like Safari will work to with a bit of tweaking&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Add the following to a css file that is included in your preso&lt;/li&gt;
&lt;/ol&gt;

&lt;pre&gt;
    # preso {
        width: 11in;
        height: 8in;  // this may need to be lowered slightly for Safari
    }
    .slide {
        width: 11in;
        height: 8in; // this may need to be lowered slightly for Safari
    }
&lt;/pre&gt;

&lt;ol&gt;
  &lt;li&gt;run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;showoff serve&lt;/code&gt; from your repo&lt;/li&gt;
  &lt;li&gt;Go to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;http://localhost:9090/singlepage&lt;/code&gt; (obviously port may vary if you used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-p&lt;/code&gt;)&lt;/li&gt;
  &lt;li&gt;Use your browsers’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Print&lt;/code&gt; function to generate a PDF&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;DONE. Happy PDFin.&lt;/p&gt;
&lt;p class=&quot;center&quot;&gt;
    &lt;img src=&quot;/imgs/posts/showoff/print_showoff.png&quot; alt=&quot;print dialog&quot; class=&quot;constrained&quot; /&gt;
&lt;/p&gt;
</description>
				<published>2013-11-17 00:00:00 +0000</published>
				<link>http://blog.mihasya.com/2013/11/17/PDF-from-showoff.html</link>
			</item>
		
			<item>
				<title>How Do I DevOps?</title>
				<description>&lt;p&gt;There is lots of talk about what DevOps is and means, even a &lt;a href=&quot;http://en.wikipedia.org/wiki/DevOps&quot;&gt;Wikipedia
page&lt;/a&gt;, to which I may soon give some much
needed love. However, a friend recently asked if I knew anyone worth hiring for
a “devops” role, and I found myself asking clarifying questions about the sort
of &lt;em&gt;person&lt;/em&gt; he had in mind. Seemed worth writing down.&lt;/p&gt;

&lt;p&gt;The friend was looking for engineers. So what does it mean for an engineer to be
devops-y?&lt;/p&gt;

&lt;h2 id=&quot;tldr&quot;&gt;TL;DR&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;Understand the Whole Company as a System&lt;/li&gt;
  &lt;li&gt;Respect Other Functions Within The Organization Profoundly&lt;/li&gt;
  &lt;li&gt;Have a Strong Sense of Personal Accountability&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Build your software like you give a shit about the people whose jobs and lives
are affected by it.&lt;/p&gt;

&lt;h2 id=&quot;1-understand-the-whole-company-as-a-system&quot;&gt;1. Understand the Whole Company as a System&lt;/h2&gt;

&lt;p class=&quot;center&quot;&gt;
    &lt;img src=&quot;/imgs/posts/devops/beer.jpg&quot; alt=&quot;bottles!!&quot; /&gt;&lt;br /&gt;
    &lt;small&gt;Photo by &lt;a href=&quot;http://www.flickr.com/photos/verifex/4840711173&quot;&gt;verifex&lt;/a&gt;&lt;/small&gt;
&lt;/p&gt;

&lt;p&gt;Your company has inputs (money, labor, etc) and outputs (product, money, etc).
I’ve grown to loathe the phrase “above my pay grade” because it tends to betray
a complete lack of interest in the big picture. Hanging around my new colleague
Jim, aka Mr Manager, I’ve recently started to identify things as “tactical” vs
“strategic.” Strategic is the big picture - where is the company going; what are
the company’s goals; what will make or break our success. Tactical is the every
day - what features are left on the current project and which one should I work
on next; how much time should I spend on this bug, what with the massive
deadline looming; hell, should I even be looking at bugs? If you don’t have a
good grip on how you and your project fit into the bigger picture of the
company, you are always tactical. Tactical can quickly become boring,
repetitive, and un-rewarding. It’s also a nice way to never grow as an
individual. In the DevOps picture, it means you probably don’t make judgment
calls well with regards to what is and isn’t important, distributing your time
poorly. Your colleagues probably notice; they probably don’t like it.&lt;/p&gt;

&lt;p&gt;This is a great segue to:&lt;/p&gt;

&lt;h2 id=&quot;2-respect-other-functions-within-the-organization-profoundly&quot;&gt;2. Respect Other Functions Within The Organization Profoundly&lt;/h2&gt;

&lt;p&gt;For our immediate purposes, we can focus on just the ops team, but it applies
well beyond. Understanding and respecting the priorities and needs of
non-technical teams and taking them seriously helps greatly reduce the number of
surprises on both sides. Also, if you’re really living number 1 above, you
probably won’t be surprised that your goals are very closely related.&lt;/p&gt;

&lt;p&gt;But back to your relationship with the ops team (or, if you’re living in devops
dream land, your colleagues, since you’re all part of the combined devops
utopia, right?) What makes them tick? What wakes or keeps them up at night? What
makes their job harder? Easier? I like to make it personal: how have I made
their lives better or worse?&lt;/p&gt;

&lt;p&gt;Let’s look inwards for a moment: what if someone is asking these questions about
me? Well, I’m a software engineer. I grind code for a living. I get some
requirements (new product spec, a bug, something I think up in my free time and
don’t tell anyone about, etc), figure out how to meet those requirements, write
some code, and push it to production.&lt;/p&gt;

&lt;p&gt;What are the things that make me happy while performing these functions? Well,
there’s a whole bunch of them, but they can all be summed up very easily: &lt;em&gt;lack
of friction&lt;/em&gt;. A relatively low number of things I have to do beyond my core
activities in order to get to the end; a limited number of context switches. A
clean, consistent, reproducible dev environment. A responsive, intelligible
build system. A mostly-automated way of moving my code through various
environments.&lt;/p&gt;

&lt;h3 id=&quot;what-has-ops-done-for-me&quot;&gt;What has ops done for me?&lt;/h3&gt;

&lt;p&gt;Well, shit, I’m actually mad spoiled. Flickr was a PHP site with a &lt;a href=&quot;http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr&quot;&gt;well oiled
deploy
machine&lt;/a&gt;
that we’ve all heard about - since you didn’t need to restart anything to get
your code out (an under-appreciated side effect of the way PHP is traditionally
served), we’d literally just push a button and the new code got rsynced to
the boxes while also keeping a nice, visible record of the what, when, and why
(a form of this now available to the masses in the form of Etsy’s
&lt;a href=&quot;https://github.com/etsy/deployinator&quot;&gt;Deployinator&lt;/a&gt;). SimpleGeo and Urban
Airship use(d) Puppet and Chef respectively to great success, and there was an
ever-improving set of tools available to make it easier to start working
on a project and to test it as I went along. When I was done, it got reviewed,
merged, built and sent off to a package repo, then deployed to production using
automation. I spent most of my time actually debugging or writing code, not
sheparding it around environments or struggling to get it to run in the first
place. It’s also easy to forget the little things that helped keep computers
out of my way - federated logins etc.&lt;/p&gt;

&lt;p&gt;These are just the more salient examples - &lt;em&gt;specific things ops has done to make
my life easier&lt;/em&gt;; it is by no means an exhaustive list of what I see as the core
strength of my prior ops teams.&lt;/p&gt;

&lt;h3 id=&quot;what-have-i-done-for-ops&quot;&gt;What have I done for ops?&lt;/h3&gt;

&lt;p class=&quot;right&quot;&gt;
&lt;img src=&quot;/imgs/posts/devops/derek-smith-ops.jpg&quot; alt=&quot;a opsian, elbow deep in
'it'&quot; class=&quot;small&quot; /&gt;&lt;br /&gt;
&lt;small&gt;Photo by &lt;a href=&quot;http://www.businessinsider.com/simplegeo-office-tour-2011-6?op=1&quot;&gt;Business
Insider&lt;/a&gt;&lt;/small&gt;
&lt;/p&gt;

&lt;p&gt;Let’s look at what my teams at each of these orgs did that I think was helpful
to and appreciated by the ops teams. This is in no particular order, and I’m
going to forego the names of the organizations because there’s a ton of overlap.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Painstakingly instrumented our services so that their state could be more
easily examined in the wild&lt;/li&gt;
  &lt;li&gt;Pumped as much data as we could into the monitoring tools kindly provided us&lt;/li&gt;
  &lt;li&gt;Thoughtfully considered what metrics and properties were helpful in
determining the health of each particular system being worked on. Business
people might call this a KPI; Mathias Meyer called it a “Soul Metric” in his
&lt;a href=&quot;http://vimeo.com/67160106&quot;&gt;monitorama talk&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;Carefully set up alerts that interpreted the above to try to minimize noise
and non-actionable alerts.&lt;/li&gt;
  &lt;li&gt;Learned at least enough about the configuration management tools to be able to
submit pull requests for desired changes in production without personal
involvement and hand holding from someone on the ops team.&lt;/li&gt;
  &lt;li&gt;Considered and tested how the software being written behaved itself before
an emergency - how is failover handled? how are configuration changes handled?&lt;/li&gt;
  &lt;li&gt;Automated or helped automate parts of the process that were difficult to
remember or tedious.&lt;/li&gt;
  &lt;li&gt;Worked on tools in our spare time that made any of the above easier.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Broadly, we tried to be sensitive to how the operators interacted with the thing
in production  and how reasonable the experience was - during changes, during
outages and failures, etc. We focused on &lt;em&gt;operability&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Why did we do all this?&lt;/p&gt;

&lt;h2 id=&quot;3-have-a-strong-sense-of-personal-accountability&quot;&gt;3. Have a Strong Sense of Personal Accountability&lt;/h2&gt;

&lt;p&gt;Because it felt like the right thing to do. When people got woken up at three in the
morning because something I had deployed broke in a confusing,
difficult-to-debug way, &lt;strong&gt;it felt bad&lt;/strong&gt;. I wanted it to be less confusing the
next time. If we’re being honest with ourselves, it probably helped with the
motivation that I woke up too and was just as frustrated and annoyed.&lt;/p&gt;

&lt;p&gt;Go back to #2 and think “Do people in the other organizations have the right
tools to perform their jobs?” The better the tools, the less friction there is,
the more quickly people can perform their reactive tasks (ops responding to
pages; marketing compiling a traffic report that the CEO suddenly needs for a
board meeting; support dealing with a massive DDoS or spam influx) The less time
people spend reacting, the better - &lt;em&gt;reacting is by definition tactical&lt;/em&gt;, and
spending all your time in tactical mode, as we’ve covered, is not great. The
list in section 2 was focused on ops, but a lot of the same stuff, especially the tools
bit, applies to other teams as well.&lt;/p&gt;

&lt;p&gt;It’ll never be perfect, but often the smallest change makes the biggest
difference. Re-arranging a dashboard ever-so-slightly could be the difference
between someone getting RSI while trying to track down spammers until late at
night and them going home in time for dinner. A good DevOps engineer in my
mind is one that feels personally responsible and accountable for the parts of
his or her job that have an effect on colleagues’ happiness and success.
Remember, everyone likes going home for dinner.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Coming back to what this all means for a software engineer: it’s all about the
big picture. In an organization whose primary output is software, everybody
depends on how well that software is equipped to help them succeed in their
particular job. Understanding your effect on these needs and striving to meet
them - that’s what DevOps means to me.&lt;/p&gt;

&lt;h3 id=&quot;further-reading&quot;&gt;Further Reading&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.agileweboperations.com/devops-these-soft-parts&quot;&gt;DevOps: These Soft Parts&lt;/a&gt;
A post by John Allspaw about the soft skills involved in making DevOps-style
cooperation work&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://rcrowley.org/2012/02/25/superconf.html&quot;&gt;Developing Operability&lt;/a&gt;
(&lt;a href=&quot;http://rcrowley.org/talks/superconf-2012/#1&quot;&gt;slides&lt;/a&gt;) A talk to Richard
Crowley with specific advice for smoothing the journey of code to production
for both devs and operators; more on the meaning of “DevOps” (warning: a wall of
text)&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://blog.lusis.org/blog/2013/06/04/devops-the-title-match/&quot;&gt;DevOps - The Title Match&lt;/a&gt; A post
by John Vincent on a common misconception about the organizational meaning of
DevOps&lt;/li&gt;
&lt;/ul&gt;
</description>
				<published>2013-06-11 00:00:00 +0000</published>
				<link>http://blog.mihasya.com/2013/06/11/how-do-i-devops.html</link>
			</item>
		
			<item>
				<title>It's a train.. no, it's a computer.. can't it be both??</title>
				<description>&lt;p&gt;I am delighted to let you spread the word about an amazing innovation from Lian
Li, the acclaimed maker of computer cases. They have thrown caution to the wind
and finally introduced the thing we’ve all been waiting for - the &lt;a href=&quot;http://www.newegg.com/Product/Product.aspx?Item=N82E16811112393&quot;&gt;Choo Choo
Train Computer
Case&lt;/a&gt;.&lt;/p&gt;

&lt;p class=&quot;center&quot;&gt;
    &lt;img src=&quot;/imgs/posts/computer-train/computer-train.jpg&quot; alt=&quot;COMPUTER TRAIN!&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;Yes. Yes. Let that sink in. It’s a computer case shaped like an old
steam-powered locomotive. It has a 300 watt power supply in the front section,
and the cart can fit a Mini ITX motherboard, a slim optical drive, and a single
internal hard drive. One might point out that these are somewhat weak specs as
far as cases go, but hey, IT’S A FUCKING TRAIN.&lt;/p&gt;

&lt;p&gt;But wait. There’s more. No, seriously, there’s more.&lt;/p&gt;

&lt;p&gt;I saw that the case had 5 star revies, so I clicked to see what proud owners had
to say about it.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;This SKU, which ends in an S, does NOT move compared to the more expensive SKU
that ends in an L. It has no motor, it’s just a case that looks like a train.
The more expensive model 
[&lt;a href=&quot;http://www.newegg.com/Product/Product.aspx?Item=N82E16811112392&quot;&gt;…&lt;/a&gt;] actually has a
motor and a transmission, and comes with extra rails, so it will roll back and
forth when the computer is turned on.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Yup. Lian Li’s product page for this puppy is
&lt;a href=&quot;http://www.lian-li.com/v2/en/product/product06.php?pr_index=625&amp;amp;cl_index=1&amp;amp;sc_index=25&amp;amp;ss_index=62&amp;amp;g=spec&quot;&gt;epic&lt;/a&gt;
Not only is there a more expensive version that moves (and comes with “Rail x6”
instead of “Rail x1”), there’s a limited edition one that &lt;em&gt;has an atomizer&lt;/em&gt;.
That’s right. It makes steam!&lt;/p&gt;

&lt;p class=&quot;center&quot;&gt;
    &lt;img src=&quot;/imgs/posts/computer-train/powertrain.jpg&quot; alt=&quot;power train&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;Basically, I’m spent just thinking about this. The amount of space accommodated
by the case isn’t ideal for the plans I have for an HTPC (I was on Newegg for a
reason, afterall), and I definitely couldn’t handle “Rail x6” and a computer
case scooting back and forth, but y’all know what to get me for my birthday now.
I’ll make it work.&lt;/p&gt;

&lt;p&gt;Unfortunately, it’s out of stock on NewEgg and I can’t find it for sale
anywhere, so I fear that the opportunity may have passed. Who knows when Lian Li
will elect to share their genius with us again? I’ll probably end up having to
purchase one of these guys on Ebay for thousands of dollars as a collectors item
years from now.&lt;/p&gt;
</description>
				<published>2013-05-02 00:00:00 +0000</published>
				<link>http://blog.mihasya.com/2013/05/02/computer-train.html</link>
			</item>
		
			<item>
				<title>Some Love For Ishmael</title>
				<description>&lt;p&gt;Back in the days of fire fighting and database optimizing at Flickr, when I
could debate the merits of different MVCC options comfortably, I built a little
tool called &lt;a href=&quot;https://github.com/mihasya/ishmael&quot;&gt;Ishmael&lt;/a&gt; to help us make sense
of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mk-query-digest&lt;/code&gt; data more easily (apparently, the project has been moved to
the “Percona Toolit” and renamed
&lt;a href=&quot;http://www.percona.com/doc/percona-toolkit/2.2/pt-query-digest.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pt-query-digest&lt;/code&gt;&lt;/a&gt;).
Tim Denike made some improvements during his remaining time at Flickr after I
had left, and then Asher Feldman took the project with him to The Wikimedia
Foundation. Eventually, he sent in a large enough pull request that I simply did
not have the capacity to test it - I, after all, have not used MySQL in anger in
ages. So I did the natural thing and made Asher a collaborator on the repo.&lt;/p&gt;

&lt;p&gt;This past week, during a moment of vanity, I noticed that there were quite a few
more stars on the repo than there had been. I wondered what might have caused
it, and shrugged. Then on Sunday the &lt;a href=&quot;http://devopsweekly.com/&quot;&gt;DevOps Weekly&lt;/a&gt;
email provided the answer: Asher had written &lt;a href=&quot;https://blog.wikimedia.org/2013/04/22/wikipedia-adopts-mariadb/&quot;&gt;a post about MariaDB on Wikimedia’s
blog&lt;/a&gt;, in which he mentions their use of Ishmael in comparing performance between
old and new database versions. It is a good read for anyone interested in
database migrations and upgrades, especially “doing it live!”&lt;/p&gt;

&lt;p&gt;Everyone, look, this is my “proud open source moment” face.&lt;/p&gt;
</description>
				<published>2013-04-29 00:00:00 +0000</published>
				<link>http://blog.mihasya.com/2013/04/29/some-love-for-ishmael.html</link>
			</item>
		
	</channel>
</rss>