<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0">
  <channel>
    <title>paperplanes</title>
    <link>http://www.paperplanes.de</link>
    <language>en</language>
    <webMaster>meyer@paperplanes.de (Mathias Meyer)</webMaster>
    <pubDate>2013-03-28T18:34:43+00:00</pubDate>
    <copyright>Copyright 2007-2009</copyright>
    <ttl>60</ttl>
    <description>software development that flies</description>
    
    <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/Paperplanes" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="paperplanes" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
      <title>Monitoring for Humans</title>
      <link>http://www.paperplanes.de//2013/3/28/monitoring-for-humans.html</link>
      <pubDate>Thu Mar 28 00:00:00 +0000 2013</pubDate>
      <guid>http://www.paperplanes.de//2013/3/28/monitoring-for-humans.html</guid>
      <description>&lt;p&gt;Hi, I'm Mathias, and I'm a developer. Other than a lot of you at this
conference, I'm far from being a monitoring expert. If anything, I'm a user, a
tinkerer of all the great tools we're hearing about at this conference.&lt;/p&gt;

&lt;p&gt;I help run a little continuous integration service called Travis CI. For that
purpose I built several home-baked things that help us collect metrics and
trigger alerts.&lt;/p&gt;

&lt;p&gt;I want to start with a little story. I spend quality time at coffee shops and I
enjoy peeking over the shoulders of the guy who's roasting coffee beans. Next to
the big roasting machine they commonly have a laptop with pretty graphs showing
how the temperature in the roaster changes over time. On two occasions I found
myself telling them: "Hey cool, I like graphs too!"&lt;/p&gt;

&lt;p&gt;On the first occasion I looked at the graph and noticed that it'd update itself
every 2-3 seconds. I mentioned that to the roaster and he said: "Yeah, I'd
really love it if it could update every second." In just two seconds the
temperature in the roaster can already drop by almost a degree (Celsius), so he
was lacking the granularity to get the best insight into his system.&lt;/p&gt;

&lt;p&gt;The second roaster did have one second resolution, and I swooned. But I noticed
that every minute or so, he wrote down the current temperature on a sheet of
paper. The first guy had done that too. I was curious why they'd do that. He
told me that he took it as his reference sheet for the next roasting batch. I
asked why he didn't have the data stored in the system. He replied that he
didn't trust it enough, because if it lost the information he wouldn't have a
reference for his next roasting sheet.&lt;/p&gt;

&lt;p&gt;He also keeps a set of coffee bean samples around from previous roasts, roasts
where the outcome is known to have resulted in a great roasting result. Even
coffee roasters have confirmation bias, though to be fully fair, when you're new
to the job, any sort of reference can help you move forward.&lt;/p&gt;

&lt;p&gt;This was quite curious. They had the technology yet they didn't trust it enough
with their data. But heck, they had one-second resolution and they had the
technology to measure data from live sensors in real time.&lt;/p&gt;

&lt;p&gt;During my first jobs as a developer touching infrastructure, five minute
collection intervals and RRDtool graphs were still very much en vogue. My alerts
basically came from Monit throwing unhelpful emails at me stating that some
process just changed from one state to another.&lt;/p&gt;

&lt;p&gt;Since my days with Munin a lot has changed. We went through the era of
#monitoringsucks, which fortunately, quickly turned into the era of
#monitoringlove. It's been pretty incredible watching this progress as someone
who loves tinkering with new and shiny tools and visualization possibilities.
We've seen the emergence of crazy new visualization ideas, like the horizon
chart, and we've seen the steady rise of using modern web technologies to render
charts, while seeing RRDtool being taken to the next level to visualize time
series data.&lt;/p&gt;

&lt;p&gt;New approaches providing incredibly detailed insight into network traffic and
providing stream analysis of time series data have emerged.&lt;/p&gt;

&lt;p&gt;One second resolution is what we're all craving, looking at beautiful and
constantly updating charts of 95th percentile values.&lt;/p&gt;

&lt;p&gt;And yet, how many of you are still using Nagios?&lt;/p&gt;

&lt;p&gt;There are great advances in monitoring at the moment, and I enjoying watching
them as someone who greatly benefits from them.&lt;/p&gt;

&lt;p&gt;Yet, I'm worried that all these advances still don't focus enough on the single
thing that's supposed to use them: humans.&lt;/p&gt;

&lt;p&gt;There's lots of work going on to solve problems to make monitoring technology
more accessible, yet I feel like we haven't solved the first problem at hand: to
make monitoring something that's easy to get into for people new to the field.&lt;/p&gt;

&lt;p&gt;Monitoring still involves a lot of looking at graphs, correlating several
different time series after the fact, and figuring out and checking for
thresholds to trigger alerts. In the end, you still find yourself looking at one
or more graphs trying to figure out what the hell it means.&lt;/p&gt;

&lt;p&gt;Tracking metrics has become very popular, thanks to Coda Hale's metrics library,
which inspired a whole slew of libraries for all kinds of languages, and tools
like StatsD, which made it very easy to throw any kind of metric at them and
have it pop up in a system like Graphite, Librato Metrics, Ganglia, etc.&lt;/p&gt;

&lt;p&gt;Yet the biggest question that I get every time I talk to someone about
monitoring, in particular people new to the idea, is: "what should I even
monitor?"&lt;/p&gt;

&lt;p&gt;With all the tools we have at hand, helping people to find the data that matters
for their systems is still among the biggest hurdles that must be conquered to
  actually make sense of metrics.&lt;/p&gt;

&lt;p&gt;Can we do a better job of educating people what they should track, what they
could track, and how they can figure out the most important metrics for their
system? It took us six months to find the single metric that best reflects the
current state of our system. I called it the soul metric, the one metric that
matters most to our users and customers.&lt;/p&gt;

&lt;p&gt;We started tracking the time since the last build was started and since the last
build was finished.&lt;/p&gt;

&lt;p&gt;On our commercial platform, where customers run builds for their own products
and customer projects, the weekend is very quiet. We only run one tenth of the
number of builds on a Sunday compared to a normal weekday. Sometimes we don't
run any build in 60 minutes. Suddenly checking when a build was last triggered
makes a lot less sense.&lt;/p&gt;

&lt;p&gt;Suddenly we're confronted with the issue that we need to look at multiple
metrics in the same context to see if a build should even have been started, as
the fact itself is solely based on a customer pushing code. We're suddenly
looking at measuring the absence of data (no new commits) and correlate it with
data derived from several attributes of the system, like no running builds and
no build request being processed.&lt;/p&gt;

&lt;p&gt;The only reasonable solution I could come up with, and it's mostly thanks to
talking to Eric from Papertrail, is if you need to measure something but it
require the existence of an activity, you have to make sure this activity is
generated on a regular basis.&lt;/p&gt;

&lt;p&gt;In hindsight, it's so obvious, though it brings up a question: if the thing that
generates the activity fails, does that mean the system isn't working? Is this
worth an alert, is this worth waking someone up for? Certainly not.&lt;/p&gt;

&lt;p&gt;This leads to another interesting question: if I need to create activity to
measure it, and if my monitoring system requires me to generate this activity to
be able to put a graph and an alert on it, isn't my monitoring system wrong? Are
all the monitoring systems wrong?&lt;/p&gt;

&lt;p&gt;If a coffee roaster doesn't trust his tools enough to give him a consistent
insight into the current, past and future roasting batches, isn't that a weird
mismatch between humans and the system that's supposed to give them the
assurance that they're on the right path?&lt;/p&gt;

&lt;p&gt;A roaster still trusts his instincts more than he trusts the data presented to
him. After all, it's all about the resulting coffee bean.&lt;/p&gt;

&lt;p&gt;Where does that take us and the current state of monitoring?&lt;/p&gt;

&lt;p&gt;We spend an eternity looking at graphs, right after an alert was triggered
because a certain threshold was crossed. Does that alert even mean anything, is
it important right now? It's where a human operator still has to decide if it's
worth the trouble or if they should just ignore the alert.&lt;/p&gt;

&lt;p&gt;As much as I enjoy staring at graphs, I'd much rather do something more
important than that.&lt;/p&gt;

&lt;p&gt;I'd love for my monitoring system to be able to tell me that something out of
the ordinary is currently happening. It has all the information at hand to make
that decision at least with a reasonable probability.&lt;/p&gt;

&lt;p&gt;But much more than that, I'd like our monitoring system to be built for humans,
reducing the barrier of entry for adding monitoring and metrics to an
application and to infrastructure without much hassle. How we'll get there? &lt;/p&gt;

&lt;p&gt;Looking at the current state of monitoring, there's a strong focus on
technology, which is great, because it helps solves bigger issues like data
storage, visualization and presentation, and stream analysis. I'd love to see
this all converge on the single thing that has to make the call in the end: a
human. Helping them make a good decision and getting there should be very high
on our list.&lt;/p&gt;

&lt;p&gt;There is a fallacy in this wish though. With more automation comes a cognitive
bias to trust what the system is telling me. Can the data presented to me be
fully trusted? Did the system actually make the right call in sending me an
alert? This is only something a human can figure, just as a coffee roaster needs
to trust his instincts even though the variables for every roast are slightly
different.&lt;/p&gt;

&lt;p&gt;We want to avoid for our users having to have a piece of paper around that tells
them exactly what happened the last time this alert was triggered. We want to
make sure they don't have to look at samples of beans at different stages to
find confirmation for the problem at hand. If the end user always looks at
previous samples of data to compare it to the most recent one, the only thing
they'll look for is confirmation.&lt;/p&gt;

&lt;p&gt;Lastly, the interfaces of the monitoring tools we work with every day are
designed to be efficient, they're designed to dazzle with visualization, yet
they're still far from being easy to use. If we want everyone in our company to
be able to participate in running a system in production, we have to make sure
the systems we provide them with interfaces that treat them as what they are:
people.&lt;/p&gt;

&lt;p&gt;But most importantly, I'd like to see the word spread on monitoring and metrics,
making our user interfaces more accessible and tell the tale of how we monitor
our systems, how other people can monitor their systems. There's a lot to learn
from each other, and I love things like &lt;a href="http://hangops.com"&gt;hangops&lt;/a&gt; and
&lt;a href="http://opsschool.org"&gt;OpsSchool&lt;/a&gt;, they're great starts to get the word out.&lt;/p&gt;

&lt;p&gt;Because it's easier to write things down to realize where you are, to figure out
where you want to be.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/Paperplanes/~4/FyxNBJeRYqQ" height="1" width="1"/&gt;</description>
      
      <category domain="http://www.paperplanes.de/tags/operations">operations</category>
      
      <category domain="http://www.paperplanes.de/tags/monitoring">monitoring</category>
      
    </item>
    
    <item>
      <title>Failure is Always an Option</title>
      <link>http://www.paperplanes.de//2013/1/21/failure-is-always-an-option.html</link>
      <pubDate>Mon Jan 21 00:00:00 +0000 2013</pubDate>
      <guid>http://www.paperplanes.de//2013/1/21/failure-is-always-an-option.html</guid>
      <description>&lt;p&gt;Failure is still one of the most undervalued things in our business, in most
businesses really. We still tend to point fingers elsewhere, blame the other
department, or try anything to cover our asses.&lt;/p&gt;

&lt;p&gt;How about we do something else instead? We embrace failure openly, turn it into
our company's culture and do everything we can to make sure every failure is
turned into a learning experience, into an opportunity?&lt;/p&gt;

&lt;p&gt;Let me start with some illustrating examples.&lt;/p&gt;

&lt;h3&gt;Wings of Fury&lt;/h3&gt;

&lt;p&gt;In 2010, &lt;a href="http://www.wired.com/autopia/2010/03/boeing-787-passes-incredible-wing-flex-test/"&gt;Boeing tested the wings of a brand new 787
Dreamliner&lt;/a&gt;.
In a giant hangar, they set up a contraption that'd pull the wings of a 787 up,
with so much pull that the wings were bound to break.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://s3itch.paperplanes.de/787-20130111-114538.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;Eventually, and after they've been flexed upwards of 25 feet, &lt;a href="http://www.youtube.com/watch?v=WRf395ioJRY"&gt;the wings broke
spectacularly.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The amazing bit: all the engineers watching it happen started to cheer and
applaud.&lt;/p&gt;

&lt;p&gt;Why? Because they anticipated the failure at the exact circumstances where it
broke, at about 150% of what wings handle at normal operation.&lt;/p&gt;

&lt;p&gt;They can break things loud and proud, they can predict when their engineering
work falls apart. Can we do the same?&lt;/p&gt;

&lt;h3&gt;Safety first&lt;/h3&gt;

&lt;p&gt;I've been reading a great book, &lt;a href="http://amzn.to/Vkcn76"&gt;"The Power of Habit"&lt;/a&gt;,
and it outlines another story of failure and how tackling that was turned into
an opportunity to improve company culture.&lt;/p&gt;

&lt;p&gt;When Paul O'Neill, later to become Secretary of the Treasury, took over
management of Alcoa, one of the United States' largest aluminum production
companies, he made it his first and foremost to tackle the safety issues in the
company's production plants.&lt;/p&gt;

&lt;p&gt;He put rules in place that any accidents must be reported to him within just a few
hours, including remedies on how this kind of accident will be prevented in the
future.&lt;/p&gt;

&lt;p&gt;While his main focus was to prevent failures, because they would harm or even
kill workers, what he eventually managed to do is to implement a company culture
where even the smallest suggestions to improve safety or to improve efficiency
from any worker would be considered and would be handed up the chain of
management.&lt;/p&gt;

&lt;p&gt;This fostered a culture of highly increased communication between production
plants, between managers, between workers.&lt;/p&gt;

&lt;p&gt;Failures and accidents still happened, but were in sharp decline, as every
single one was taken as an opportunity to learn and improve the situation to
prevent them from happening again.&lt;/p&gt;

&lt;p&gt;It was a chain of post-mortems if you will. O'Neill's interest was to make
everyone part of improving the overall situation without having to fear blame.
Everyone was made felt like they're an important part of the company. By then,
15000 people worked at Alcoa.&lt;/p&gt;

&lt;p&gt;This had an interesting effect on the company. In twelve years, O'Neill
managed to increase Alcoa's revenues from $1.5 to $23 billion dollars.&lt;/p&gt;

&lt;p&gt;His policies became an integral part of the company's culture and ensured that
everyone working for it felt like an integral part of the production chain.&lt;/p&gt;

&lt;p&gt;Floor worker's were given permission to shut down the production chain if they
deemed it necessary and were encouraged to whistle when they noticed even the
slightest risk in any activity in the company's facilities.&lt;/p&gt;

&lt;p&gt;To be quite fair, competitors were pretty much in the dark about these
practices, which gave Alcoa a great advantage on the market.&lt;/p&gt;

&lt;p&gt;But within a decade of running the company, he transformed it into a culture
that sounds strikingly similar to the ideas of DevOps. He managed to make
everyone feel responsible for delivering a great product and for everyone to be
enabled to take charge should something go wrong.&lt;/p&gt;

&lt;p&gt;All that is based on the premise of trust. Trust that when someone speaks up,
they will be taken seriously.&lt;/p&gt;

&lt;h3&gt;Three Habits of Failure&lt;/h3&gt;

&lt;p&gt;If you look at the examples above, some patterns come up. There are companies
outside of our field that have mastered or at least taken on an attitude of
accepting that failure is inevitable, anticipating failure and dealing with and
learning from failure.&lt;/p&gt;

&lt;p&gt;Looking at some more examples it occurred to me that even doing one of these
things will improve your company's culture significantly.&lt;/p&gt;

&lt;h3&gt;How do we fare?&lt;/h3&gt;

&lt;p&gt;We fail, a lot. It's in the nature of the hardware we use and the software we
build. Networks partition, hard drives fail, software bugs creep into system
that can lead to cascading failures.&lt;/p&gt;

&lt;p&gt;But do we, as a community, take enough of advantage of what we learn from each
outage?&lt;/p&gt;

&lt;p&gt;Does your company hold post-mortem meetings after a production outage? Do you
write public post-mortems for your customers?&lt;/p&gt;

&lt;p&gt;If you don't, what's keeping you from doing so? Is it fear of giving your
competitors an advantage? Is it fear of giving away too many internal details?
Fear of admitting fault in public?&lt;/p&gt;

&lt;p&gt;There's a great advantage in making this information public. Usually, it doesn't
really concern your customers what happened in all detail. What does concern
them is knowing that you're in control of the situation.&lt;/p&gt;

&lt;p&gt;A post-mortem follows three Rs: regret, reason and remedy.&lt;/p&gt;

&lt;p&gt;They're a means to say sorry to your customers, to tell them that you know what
caused the issues and how you're going to fix them.&lt;/p&gt;

&lt;p&gt;On the other hand, post-mortems are a great learning opportunity for your peer
ops and development people.&lt;/p&gt;

&lt;h3&gt;Web Operations&lt;/h3&gt;

&lt;p&gt;This learning is an important part of improving the awareness of web operations,
especially during development. There's a great deal to be learned from other
people's experiences.&lt;/p&gt;

&lt;p&gt;Web operations is a field that is mostly learning by doing right now. Which is
an important part of the profession, without a doubt.&lt;/p&gt;

&lt;p&gt;If you look at the available books, there are currently three books that give
insight into what it means to build and run reliable and scalable systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://amzn.to/pwoDun"&gt;"Release It!"&lt;/a&gt;, &lt;a href="http://amzn.to/rgI1J5"&gt;"Web
Operations"&lt;/a&gt; and &lt;a href="http://amzn.to/KAog1y"&gt;"Scalable Internet
Architectures"&lt;/a&gt; are the ones that come to mind.&lt;/p&gt;

&lt;p&gt;My personal favorite is "Release It!", because it raises developer awareness on
how to handle and prevent production issues in code.&lt;/p&gt;

&lt;p&gt;It's great to see the &lt;a href="https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern"&gt;circuit
breaker&lt;/a&gt; and the
&lt;a href="http://johnragan.wordpress.com/2009/12/08/release-it-stability-patterns-and-best-practices/"&gt;bulkhead
pattern&lt;/a&gt;
introduced in this book now being popularized by Netflix, who &lt;a href="http://techblog.netflix.com/2012/11/hystrix.html"&gt;openly write
about their experiences implementing
it&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Netflix is a great example here. They're very open about what they do, they
write detailed post-mortems when there's an outage. You should read their
&lt;a href="http://techblog.netflix.com"&gt;engineering blog&lt;/a&gt;, same for
&lt;a href="http://codeascraft.etsy.com"&gt;Etsy's&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Why? Because it attracts engineering talent.&lt;/p&gt;

&lt;p&gt;If you're looking for a job, which company would you rather work for? One that
encourages taking risks while also taking responsibility for fixing issues when
failure does come up, and one that enables a culture of fixing and improving
issues as a whole rather than to put blame?&lt;/p&gt;

&lt;p&gt;I'd certainly choose the former.&lt;/p&gt;

&lt;p&gt;Over the last two years, Amazon has also realized how important this is. Their
post-mortems have gotten very valuable for anyone interest in things that can
happen in multi-tenant, distributed systems.&lt;/p&gt;

&lt;p&gt;If you remember the most recent outage on Christmas Eve, they even had the guts
to come out and say that production data was deleted by accident.&lt;/p&gt;

&lt;p&gt;Can you imagine the shame these developers must feel? But can you imagine a
culture where the issue itself is considered an opportunity to learn instead of
blaming or firing you? If only to learn that accessing production data needs
stricter policies.&lt;/p&gt;

&lt;p&gt;It's a culture I'd love to see fostered in every company.&lt;/p&gt;

&lt;p&gt;Regarding ops education, there have been some great things last year that are
worth mentioning. &lt;a href="http://hangops.com"&gt;hangops&lt;/a&gt; is a nice little circle,
streamed live (mostly) every Friday, and available for anyone to watch on
YouTube afterwards.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.opsschool.org"&gt;Ops School&lt;/a&gt; has started a great collection of
introductory material on operations topics. It's still very young, but it's a
great start, and you can help move it forward.&lt;/p&gt;

&lt;h3&gt;Travis CI&lt;/h3&gt;

&lt;p&gt;At &lt;a href="https://travis-ci.org"&gt;Travis CI&lt;/a&gt;, we're learning from failure, a lot. As a continuous integration
platform, it started out as a hobby project and was built with a lot of positive
assumptions.&lt;/p&gt;

&lt;p&gt;It used to be a distributed system that always assumed everything would work
correctly all the time.&lt;/p&gt;

&lt;p&gt;As we grew and added more languages and more projects, this ideal fell apart
pretty quickly.&lt;/p&gt;

&lt;p&gt;It is a symptom of a lot of projects that are developer-driven, because there's
just so little public information on how to do it right, on how distributed
systems are built and run at other companies for them to work reliably.&lt;/p&gt;

&lt;p&gt;We decided to turn every failure into an opportunity to share our learnings.
We're an open source project, so it only makes sense to be open about our
problems too.&lt;/p&gt;

&lt;p&gt;Our audience and customers, who are mostly developers themselves, seem to
appreciate that. I for one am convinced that we owe to them.&lt;/p&gt;

&lt;p&gt;I encourage you to do the same, to share details on your development, on how you
run your systems. It'll be surprising how introducing these changes can affect
working as a team as a whole.&lt;/p&gt;

&lt;h3&gt;Cultural evolution&lt;/h3&gt;

&lt;p&gt;This insight didn't come easy. We're a small team, and we were all on board with
the general idea of openness about our operational work and about the failures
in our system.&lt;/p&gt;

&lt;p&gt;That openness brings with it the need to own your systems, to own your failures.
It took a while for us to get used to working together as a team to get these
issues out of the way as quickly as possible and to find a path for a fix.&lt;/p&gt;

&lt;p&gt;In the beginning, it was still too easy to look elsewhere for the cause of the
problem. Blame is one side of the story, hindsight bias is the other. It's too
easy to point out that the issue has been brought up in the past, but that
doesn't contribute anything to fixing it.&lt;/p&gt;

&lt;p&gt;The more helpful attitude than saying "I've been saying this has been broken for
months" is to say "Here's how I'll fix it." You own your failures.&lt;/p&gt;

&lt;p&gt;The only thing that matters is delivering value to the customer. Putting aside
blame and admitting fault while doing everything you can to make sure the issue
is under control is, in my opinion, the only way how you can do that, with
everyone in your company on board.&lt;/p&gt;

&lt;p&gt;Accepting this might just help transform your company's culture significantly.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/Paperplanes/~4/KXyzTs15peM" height="1" width="1"/&gt;</description>
      
      <category domain="http://www.paperplanes.de/tags/operations">operations</category>
      
      <category domain="http://www.paperplanes.de/tags/culture">culture</category>
      
    </item>
    
    <item>
      <title>Coffee and the Art of Customer Happiness</title>
      <link>http://www.paperplanes.de//2013/1/16/coffee-and-the-art-of-customer-happiness.html</link>
      <pubDate>Wed Jan 16 00:00:00 +0000 2013</pubDate>
      <guid>http://www.paperplanes.de//2013/1/16/coffee-and-the-art-of-customer-happiness.html</guid>
      <description>&lt;p&gt;This essay is an extended version of a talk I gave at &lt;a href="http://www.paperlesspost.com"&gt;Paperless
Post&lt;/a&gt; about coffee and customer happiness. While
the talk was originally titled "Coffee and the Art of Software Maintenance", I
figured that customer happiness is overall a much more fitting for the topic.&lt;/p&gt;

&lt;p&gt;For coffee, maintaining and improving your craft and making customers happy are
two means to the same end: to have loyal customers who tell their friends about
you.&lt;/p&gt;

&lt;h3&gt;Geeks everywhere!&lt;/h3&gt;

&lt;p&gt;I'm a coffee geek, and I spent a lot of time in coffee shops. But rather than
spend it on my laptop, writing code, I spend the time watching and talking to
the fine people making my coffee, the baristas.&lt;/p&gt;

&lt;p&gt;Baristas are geeks, just like we are. They love talking about the latest toys,
about which espresso machine is better than the other, they compare paper
filters with cloth, and they take detailed notes on the different aromas of
coffee when they’re cupping it.&lt;/p&gt;

&lt;p&gt;The craft of coffee making is quite fascinating, both from the perspective of
precision and customer care. But let's start with a little story.&lt;/p&gt;

&lt;p&gt;In June 2010 I had the pleasure of visiting a rather special coffee shop. The
London roaster &lt;a href="http://shop.squaremilecoffee.com"&gt;Square Mile&lt;/a&gt; had opened a
popup shop that only served filter coffee. No milk beverages, not even espresso.
Just filter coffee.&lt;/p&gt;

&lt;p&gt;It was called &lt;a href="http://www.pennyuni.com"&gt;Penny University&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.flickr.com/photos/ipom/4763878828/" title="Penny University by Mathias*, on Flickr"&gt;&lt;img src="http://farm5.staticflickr.com/4102/4763878828_c5b749d7fa.jpg" width="500" height="330" alt="Penny University"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;The greatest coffee shop in the world&lt;/h3&gt;

&lt;p&gt;The shop consisted of a bar and six stools. It offered a very simple menu, with
three different kinds of coffee served at any time. Every coffee was brewed
using a different technique and served with a piece of chocolate matching the
taste of the coffee.&lt;/p&gt;

&lt;p&gt;For instance, the Yirgacheffe from Ethiopia was brewed with a Hario V60, which
so happens to bring out its delicate and sometimes lemony flavours. It was
served with a piece of chocolate that also had a lemon flavor.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.flickr.com/photos/ipom/4763879600/" title="Penny University by Mathias*, on Flickr"&gt;&lt;img src="http://farm5.staticflickr.com/4075/4763879600_0fe2bc15d1.jpg" width="494" height="500" alt="Penny University"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You could either choose to have just a single brew or to try all three varieties
in a three course menu. The latter would require you to sit in for 30 minutes
with the barista giving you his full attention, explaining flavors, origin and
the brewing technique.&lt;/p&gt;

&lt;p&gt;It was one of the greatest coffee experiences I've had so far. The setting, the
barista, the attention to detail, the barista's focus on delivering the best
possible value, it all added up to something very special and unique.&lt;/p&gt;

&lt;p&gt;As I later found out, I was served by the owner of Square Mile, 2007 World
Barista Champion James Hoffman.&lt;/p&gt;

&lt;p&gt;Sadly, the shop closed after three months.&lt;/p&gt;

&lt;h3&gt;Meanwhile, in Berlin&lt;/h3&gt;

&lt;p&gt;As if by coincidence, after that the coffee scene in Berlin started to take of.
Since then, I've had the pleasure of hanging out with a lot of fine baristas
from all over the world chatting about coffee, all in the comfort of my
hometown. Especially at &lt;a href="http://thebarn.de"&gt;The Barn&lt;/a&gt;, a shop that opened around
the same time, I learned to appreciate to precise finesse of making coffee. It's
a downward spiral.&lt;/p&gt;

&lt;p&gt;At some point what I've learned started having affects in what I do for a
living, build and run software, and making customers happy by providing them
with the best possible value.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.flickr.com/photos/ipom/5333463392/" title="Coffee time at The Barn by Mathias*, on Flickr"&gt;&lt;img src="http://farm6.staticflickr.com/5164/5333463392_7dc0c1f5e1.jpg" width="495" height="500" alt="Coffee time at The Barn"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;Each necessary, but only jointly sufficient&lt;/h3&gt;

&lt;p&gt;Let's look at precision and what makes a good cup of coffee.&lt;/p&gt;

&lt;p&gt;While a good of coffee is a subjective experience, a barista strives for one
thing: to make every cup of coffee as great as the next.&lt;/p&gt;

&lt;p&gt;To achieve that goal, every variance must be removed. Every step of the brew
process must be subject to the same conditions.&lt;/p&gt;

&lt;p&gt;This is truly an art, though it sounds surprisingly boring, as the ultimate goal
is to have a process that's repeatable every single time. Consistency is a
barista’s prime directive.&lt;/p&gt;

&lt;p&gt;The variables start with hardness of water, involve finding the right coffee
grind setting, which varies from coffee to coffee, to making sure the
temperature of the water is always the same.&lt;/p&gt;

&lt;p&gt;Add to that water flow, circulation and agitation of coffee grounds during the
brew, measuring the water used to brew (water has a different weight when it's
hot compared to when it's cold), weighing the coffee beans and timing the whole
brew.&lt;/p&gt;

&lt;p&gt;Of course every variable can be different depending on what brew method is used
for the coffee.&lt;/p&gt;

&lt;p&gt;A barista has to make sure he can measure every single variable to make sure
the brewing conditions are the same every time. This is true both for espresso
and filter coffee. Plus, every variable can vary depending on the coffee bean,
the roast, and its origin.&lt;/p&gt;

&lt;p&gt;If he needs to change something, he can only change one variable at a time to
make an informed decision on whether the change had a positive or a negative
impact on the resulting brew.&lt;/p&gt;

&lt;p&gt;Changing only one variable can have terrible results, leading to a less
enjoyable result. Grind the coffee beans too coarse, and the coffee will have
less taste, it's under-extracted.&lt;/p&gt;

&lt;p&gt;Use too little water, and the coffee will be over-extracted. Choose a
temperature that's too hot, and the coffee will be less enjoyable, and the
customer will have to wait for it to cool down. Use boiling water and you might
kill some of the flavors that make the coffee at hand so unique.&lt;/p&gt;

&lt;p&gt;You'll find these conditions mostly in the really good coffee shops out there,
where people care about their craft. The Starbucks around the corner will make
you a latte that burns your tongue, which is unacceptable to what I'd consider a
professional barista.&lt;/p&gt;

&lt;p&gt;Does all that sound familiar?&lt;/p&gt;

&lt;h3&gt;Metrics, metrics everywhere!&lt;/h3&gt;

&lt;p&gt;Over the last two years or so we've seen the operational trend to measure
everything. Every variable that can change when code is running in production is
measured over time.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://s3itch.paperplanes.de/graph-4-20130116-072659.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;Only one variable changing at runtime can have catastrophic results on the whole
software, possibly leading to cascading failures or triggering other bugs in the
code that have remained undetected so far. Metrics and measuring give you the
insurance that if something goes wrong, if something goes off the normal flow,
you will notice it immediately.&lt;/p&gt;

&lt;p&gt;The same is true for changing code. I find it particularly hard to change code
without knowing how it currently behaves in production. Just like with brewing
coffee, changing multiple parts of a certain feature at once can lead to
behavior that’s hard to reason about.&lt;/p&gt;

&lt;p&gt;I prefer doing single changes at a time to see how they behaves in isolation.
Rather than seeing this as a restriction because of fear of breaking things, I
see that as a culture of introducing a single seam at a time to see if it breaks
or not. Breaking one thing at a time is much preferable to breaking many.&lt;/p&gt;

&lt;p&gt;The important bit is that a company's culture needs to ensure that teams can
iterate around these smaller changes quickly, continuously monitoring how they
behave in production.&lt;/p&gt;

&lt;h3&gt;Continuous Coffee Delivery&lt;/h3&gt;

&lt;p&gt;It's the equivalent of a barista shipping dozens if not hundreds of cups coffee
per day. It’s &lt;a href="http://continuousdelivery.com"&gt;continuous delivery&lt;/a&gt;, a culture
fully embraced by the barista at your favorite coffee shop. There can be tiny
variances in every single cup, but the barista focuses on keeping them as small
as possible and on changing only one thing at a time to be able to get
measurements on its effects quickly.&lt;/p&gt;

&lt;p&gt;I’ve seen baristas taste my brew before serving it, always ready to chuck it and
make a fresh one from scratch, should the end result not satisfy their own
quality standards. A smoke test, if you will. It's a great little detail that
looks odd at first but makes a lot of sense when you know how many variables are
involved.&lt;/p&gt;

&lt;p&gt;To round things off, a good barista practices every day. A few dry runs before
opening shop and after make sure that variations in the coffee bean are
continuously evened out by adapting the brewing process. As coffee beans
deteriorate over time (usually a few days to a few weeks) they get drier, and
they need a different grind setting.&lt;/p&gt;

&lt;p&gt;Of course, this also involves learning new tools, new brewing techniques,
choosing the one best applied for a particular brewing method.&lt;/p&gt;

&lt;p&gt;I've been surprised many times to how similar all this is to our own work, to
writing, shipping and running code.&lt;/p&gt;

&lt;h3&gt;Talk that talk&lt;/h3&gt;

&lt;p&gt;&lt;a href="http://www.flickr.com/photos/ipom/4489269938/" title="Blue Bottle by Mathias*, on Flickr"&gt;&lt;img src="http://farm3.staticflickr.com/2712/4489269938_c4ec1cdcdd.jpg" width="491" height="500" alt="Blue Bottle"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It’s fun and interesting to talk to baristas about their work. I've found a lot
of them to be happy to share details about what they're doing and why, and they
seem to be just as happy to know that there are people who are not just
interested in a good cup of joe, but also in how it came to be. They're
passionate about their work, just as you are about your code.&lt;/p&gt;

&lt;p&gt;Talk to them long enough and they'll think you're working in coffee too. It's
pretty fun, it's the equivalent of your customer talking to you about the nuances
of concurrency in different programming languages.&lt;/p&gt;

&lt;p&gt;It's something that's easy to forget when you spend most of your time with
people doing similar work as you do. Compared to a barista, you're just brewing
code instead of coffee.&lt;/p&gt;

&lt;p&gt;It’s great to talk to other people who are passionate about their work and
providing the best value for their customers. It's reaffirming that you're on
the right track when you realize that other professions follow similar
philosophies.&lt;/p&gt;

&lt;p&gt;There's another variable that I have yet to mention: the coffee bean itself. A
lot of coffee shops, unsatisfied with the coffee they got from other sources,
start looking into roasting their own. They want to take that last variable out
of the equation that's under someone else's control.&lt;/p&gt;

&lt;h3&gt;Plan to throw one (hundred kilos) away&lt;/h3&gt;

&lt;p&gt;&lt;a href="http://www.flickr.com/photos/ipom/6189441531/" title="Copenhagen II by Mathias*, on Flickr"&gt;&lt;img src="http://farm7.staticflickr.com/6152/6189441531_ecea81fa28.jpg" width="500" height="330" alt="Copenhagen II"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Unfortunately, roasting coffee opens a whole new can of worms. Just like it
takes time to find the right values for brewing coffee, you need to find the
right temperature and roasting time coffee for every single coffee bean.&lt;/p&gt;

&lt;p&gt;To get there, lots of coffee gets thrown away. A coffee shop in Berlin recently
started roasting, and they went through several hundred kilos of green beans
before they came up with a satisfying end result. Let me tell you that the end
result is pretty spectacular.&lt;/p&gt;

&lt;p&gt;What they basically apply here is rapid prototyping. They iterate around several
bags of coffee to find the right conditions to extract the best possibles aroma
from the bean.&lt;/p&gt;

&lt;p&gt;It sounds insane to throw away all that coffee, but it has to be to make sure
the customer gets the best possible value when buying it.&lt;/p&gt;

&lt;p&gt;This is why specialty coffee is more expensive than your bag of Starbucks or the
coffee you buy at the supermarket. The value for the person enjoying it is a lot
higher as there's a lot more to be experienced than just black coffee.&lt;/p&gt;

&lt;p&gt;Unsurprisingly, even bad coffee is these days sold for a premium. When you
extrapolate K-cups to the volume of a single bag of Cafe Grumpy beans, you end
up paying the same or even more.&lt;/p&gt;

&lt;p&gt;The value proposition is convenience. The overall experience is worse than when
controlling all the brewing steps yourself, but at least you can be sure to get
a cup of coffee quickly.&lt;/p&gt;

&lt;p&gt;The craft of coffee has a lot of similarities to software development and
maintenance. It's a gradual process, with lots of learning and experience
involved.&lt;/p&gt;

&lt;p&gt;When you run a coffee shop, there comes the time when roasting yourself is the
only option, because you want to have control over everything or because the
coffee you buys elsewhere is below your quality standards. Or simply because
it's more convenient to do everything in-house.&lt;/p&gt;

&lt;p&gt;That's like eventually writing your own custom software components or starting
to own your infrastructure more and more over time. You need the control to
ensure the best possible service to your customers. It means more work on your
end, but if it can ensure that your customers are happy, it's well worth the
effort.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.flickr.com/photos/ipom/8385652133/" title="Four Barrel by Mathias*, on Flickr"&gt;&lt;img src="http://farm9.staticflickr.com/8365/8385652133_f47b9c7480.jpg" width="500" height="332" alt="Four Barrel"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;Coffee is a personal experience&lt;/h3&gt;

&lt;p&gt;The one thing that I admire the most about baristas is that they're close to the
customer all the time. The customer can follow along every step her coffee takes
to get into her hands.&lt;/p&gt;

&lt;p&gt;The customer is free to talk to the barista along the process, and most baristas
are more than willing to share their insight, what the coffee tastes like and
where it came from.&lt;/p&gt;

&lt;p&gt;At some &lt;a href="http://www.intelligentsiacoffee.com"&gt;Intelligentsia&lt;/a&gt; shops, you're even
assigned your personal barista that takes you to the entire process of making
your coffee. I'm very much in love with that idea. If you stretch that idea to
running an internet business it's similar to having a single support person
that's taking you through the lifetime of a ticket. As a customer you know that
the person on the other hand will know all the details about the issue at hand.
It makes the whole experience of customer service a lot more personal.&lt;/p&gt;

&lt;p&gt;I went to a coffee shop in Toronto and asked the barista about their favorite
coffee, which I commonly do when I'm presented with a lot of choices I haven't
tried before. I ended up with a rather dark Sumatran brew from the Clover, one
of the greatest technical coffee inventions of all time, sadly they were bought
by Starbucks, and it was a bit too dark for my taste.&lt;/p&gt;

&lt;p&gt;As a courtesy, she offered me to get another brew, on the house of course. She
took charge of her recommendation not meeting my taste and offered me something
else for free.&lt;/p&gt;

&lt;p&gt;This face-to-face communication also makes it harder to be angry about
something. It's still possible, but it's also a lot easier to react to an angry
customer when he's right in front of you. If it happens, you offer a free
beverage.&lt;/p&gt;

&lt;h3&gt;Customer experience trumps everything else&lt;/h3&gt;

&lt;p&gt;That's one of my biggest learnings of the last year, and I have my favorite
coffee shops to thank for the inspiration. Personal customer experience trumps
everything else, even for a business that's solely accessed through the
internet.&lt;/p&gt;

&lt;p&gt;You could think that a barista telling you all about their secrets or how to
brew excellent coffee will make you stay at home and start making your own
coffee all the time.&lt;/p&gt;

&lt;p&gt;And so you will. But you will keep coming back because the barista knows you by
name, because they learn your taste in coffee, because they give you free
samples, because they let you try new coffees first.&lt;/p&gt;

&lt;p&gt;That kind of experience is priceless.&lt;/p&gt;

&lt;p&gt;A lot of coffee shops have customer loyalty cards. You get a stamp for every
coffee and the next coffee is free. I think those loyalty cards are great, and
I'm contemplating how they could be applied to internet businesses.&lt;/p&gt;

&lt;p&gt;But consider this: instead of knowing that your next coffee will be free, a
barista randomly gives you free drinks, new coffee blends, an extra shot of
espresso.&lt;/p&gt;

&lt;p&gt;Without expecting that next coffee to be free, your happiness levels will be
infinitely higher. It's something that I found to make for even more loyal
customers and to give them an overall much more personal experience. The
surprise trumps every single stamp on your loyalty card.&lt;/p&gt;

&lt;p&gt;It's one of the reasons why we send each of our customers a bag of coffee beans.
It seems so unrelated to our business, but all of us care about good coffee. And
what makes it for the customer is the surprise, them not expecting anything like
that from an internet business.&lt;/p&gt;

&lt;p&gt;It's also why &lt;a href="http://mailchimp.com/2012/"&gt;MailChimp&lt;/a&gt; sent out almost 30000
t-shirts last year. After you've successfully launched your first campaign, they
send an email to congratulate you and offer to send you a t-shirt. A great and
unexpected gesture of customer love. It's worth noting that the shirts are of a
great quality, which definitely adds to the surprise.&lt;/p&gt;

&lt;p&gt;The similarities of running a coffee shop to running an online business and
maintaining software are pretty striking, and you'd think that's only natural,
as lots of crafts and running a business are very similar.&lt;/p&gt;

&lt;p&gt;Yet the subtleties are what makes every single one of them special, and it's
worth looking at them in more detail to see if you can improve your own skills
based on the gained knowledge or if you can improve your business' customer
relationship efforts.&lt;/p&gt;

&lt;p&gt;Both the precision and the customer experience of a good barista and a great
coffee shop are something that value one thing: the best possible value for a
customer, a great cup of coffee. If you can get one cup of coffee right and make
a customer happy, they'll come again, and again, and again.&lt;/p&gt;

&lt;p&gt;Getting a customer to stick around, turning them into your most loyal customer,
that's the best thing any business, any developer building a customer-facing
product can ask for.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/Paperplanes/~4/qw6WS-0Mt10" height="1" width="1"/&gt;</description>
      
      <category domain="http://www.paperplanes.de/tags/coffee">coffee</category>
      
      <category domain="http://www.paperplanes.de/tags/customers">customers</category>
      
    </item>
    
    <item>
      <title>The Virtues of Monitoring, Redux</title>
      <link>http://www.paperplanes.de//2013/1/10/virtues-of-monitoring-redux.html</link>
      <pubDate>Thu Jan 10 00:00:00 +0000 2013</pubDate>
      <guid>http://www.paperplanes.de//2013/1/10/virtues-of-monitoring-redux.html</guid>
      <description>&lt;p&gt;Two years ago, I wrote about &lt;a href="/2011/1/5/the_virtues_of_monitoring.html"&gt;the virtues of
monitoring&lt;/a&gt;. A lot has changed, a lot
has improved, and I've certainly learned a lot since I wrote that initial
overview on monitoring as a whole.&lt;/p&gt;

&lt;p&gt;There have been a lot of improvements to existing tools, and new players entered
the market of monitoring. Infrastructure as a whole got more and more
interesting for service business around them.&lt;/p&gt;

&lt;p&gt;On the other hand, awareness for monitoring, good metrics, logging and the like
has been rising significantly.&lt;/p&gt;

&lt;p&gt;At the same time
&lt;a href="http://lusislog.blogspot.de/2011/06/why-monitoring-sucks.html"&gt;#monitoringsucks&lt;/a&gt;
raised awareness that a lot of monitoring tools are still stuck in the late
nineties when it comes to user interface and the way they work.&lt;/p&gt;

&lt;p&gt;Independent of new and old tools, I've had the pleasure of learning a lot more
about the real virtues of monitoring, about how it affects daily work and how it
evolves over time. This post is about discussing some of these insights.&lt;/p&gt;

&lt;h3&gt;Monitoring all the way down&lt;/h3&gt;

&lt;p&gt;When you start monitoring even just small parts of an application, the need for
more detail and for information about what's going on in a system arises
quickly. You start with an innocent number of application level metrics, add
metrics for database and external API latencies, start tracking system level and
business metrics.&lt;/p&gt;

&lt;p&gt;As you add monitoring to one layer of the system, the need to get more insight
into the layer below comes up sooner or later.&lt;/p&gt;

&lt;p&gt;One layer has just been tackled recently in a way that's accessible for anyone:
communication between services on the network. &lt;a href="http://boundary.com"&gt;Boundary&lt;/a&gt;
has built some pretty cool monitoring stuff that gives you incredibly detailed
insight into how services talk to each other, by way of their protocol, how
network traffic from inside and outside a network develops over time, and all
that down to the second.&lt;/p&gt;

&lt;p&gt;The real time view is pretty spectacular to behold.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://s3itch.paperplanes.de/boundary-20130110-111149.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;If you go down even further on a single host, you get to the level where you can
&lt;a href="http://queue.acm.org/detail.cfm?id=1809426"&gt;monitor disk latencies&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Or you could measure the effect of &lt;a href="http://www.youtube.com/watch?v=tDacjrSCeq4"&gt;screaming at a disk array of a running
system&lt;/a&gt;.
&lt;a href="http://hub.opensolaris.org/bin/view/Community+Group+dtrace/WebHome"&gt;dtrace&lt;/a&gt; is
a pretty incredible tool, and I hope to see it spread and become widely
available on Linux systems. It allows you to inject instrumentation into
arbitrary parts of the host system, making it possible measure any system call
without a lot of overhead.&lt;/p&gt;

&lt;p&gt;Heck, even our customer support tool allows us to track metrics for response
times, how many tickets and for how long each staff member handled.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://s3itch.paperplanes.de/helpscout-20130110-111643.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;It's easy to start obsessing about monitoring and metrics, but there comes a
time, when you either realize that you've obsessed for all the right reasons, or
you add more monitoring.&lt;/p&gt;

&lt;h3&gt;Mo' monitoring, mo' problems&lt;/h3&gt;

&lt;p&gt;The crux of monitoring more layers of a system is that with more monitoring, you
can and will detect more issues.&lt;/p&gt;

&lt;p&gt;Consider Boundary, for example. It gives you insight into a layer you haven't
had insight before, at least not at that granular level. For example, round trip
times of liveness traffic in a RabbitMQ cluster.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://s3itch.paperplanes.de/appvis-20130110-111851.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;This gives you a whole new pile of data to obsess about. It's good because that
insight is very valuable. But it requires more attention, and more issues require
investigation.&lt;/p&gt;

&lt;p&gt;You also need to learn how a system behaving normally is reflected in those new
systems, and what constitutes unusual behaviour. It takes time to learn and to
interpret the data correctly.&lt;/p&gt;

&lt;p&gt;In the long run though, that investment is well worth it.&lt;/p&gt;

&lt;h3&gt;Monitoring is an ongoing process&lt;/h3&gt;

&lt;p&gt;When we started adding monitoring to &lt;a href="http://travis-ci.org"&gt;Travis CI&lt;/a&gt;, we
started small. But we quickly realized what metrics really matter and what parts
of the application and the infrastructure around it needs more insight, more
metrics, more logging.&lt;/p&gt;

&lt;p&gt;With every new component deployed to production, new metrics need to be
maintained, more logging and new alerting need to be put in place.&lt;/p&gt;

&lt;p&gt;The same is true for new parts of the infrastructure. With every new system or
service added, new data needs to be collected to ensure the service is running
smoothly.&lt;/p&gt;

&lt;p&gt;A lot of the experience of what metrics are important there and which aren't,
it's something that develops over time. Metrics can come and go, the
requirements for metrics are subject to change, just as they are for code.&lt;/p&gt;

&lt;p&gt;As you add new metrics, old metrics might become less useful, or you need more
metrics in other parts of the setup to make sense of the new ones.&lt;/p&gt;

&lt;p&gt;It's a constant process of refining the data you need to have the best possible
insight into a running system.&lt;/p&gt;

&lt;h3&gt;Monitoring can affect production systems&lt;/h3&gt;

&lt;p&gt;The more data you collect, with higher and higher resolution, the more you run
the risk of affecting a running system. Business metrics regularly pulled from
the database can become a burden on the database that's supposed to serve your
customers.&lt;/p&gt;

&lt;p&gt;Pulling data out of running systems is a traditional approach to monitoring, one
that's unlikely to go away any time soon. However, it's an approach that's less
and less feasible as you increase resolution of your data.&lt;/p&gt;

&lt;p&gt;Guaranteeing that this collection process is low on resources is hard. It's even
harder to get a system up and running that can handle high-resolution data from
a lot of services sent concurrently.&lt;/p&gt;

&lt;p&gt;So new approaches have started to pop up to tackle this problem. Instead of
pulling data from running processes, the processes themselves collect data and
regularly push it to aggregation services which in turn send the data to a
system for further aggregation, graphing, and the like.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/etsy/statsd"&gt;StatsD&lt;/a&gt; is without a doubt the most popular
one, and it has sparked a ton of forks in different languages&lt;/p&gt;

&lt;p&gt;Instead of relying on TCP with its long connection handshakes and timeouts,
StatsD uses UDP. The processes sending data to it stuff short messages into a
UDP socket without worrying about whether or not the data arrives.&lt;/p&gt;

&lt;p&gt;If some data doesn't make it because of network issues, that only leaves a small
dent. It's more important for the system to serve customers than for it to wait
around for the aggregation service to become available again.&lt;/p&gt;

&lt;p&gt;While StatsD solves the problem of easily collecting and aggregating data
without affecting production systems, there's now the problem of being able to
inspect the high-resolution data in meaningful ways. Historical analysis and
alerting on high-resolution data becomes a whole new challenge.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://riemann.io"&gt;Riemann&lt;/a&gt; has popularized looking at monitoring data as a
stream, to which you can apply queries, and form reactions based on those
queries. You can move the data window inside the stream back and forth, so you
can compare data in a historical context before deciding on whether it's worth
an alert or not.&lt;/p&gt;

&lt;p&gt;Systems like StatsD and Riemann make it a lot easier for systems to aggregate
data without having to rely on polling. Services can just transmit their data
without worrying much about how and where they're used for other purposes like
log aggregation, graphing or alerting.&lt;/p&gt;

&lt;p&gt;The important realization is that with increasing need for scalability and
distributed systems, &lt;em&gt;software needs to be built with monitoring in mind&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Imagine RabbitMQ that instead of you having to poll the data from it, sends its
metrics as a message at a configurable interval to a configurable fanout. You
can choose to consume the data and submit it to a system like StatsD or Riemann,
or you can ignore it and the broker will just discard the data.&lt;/p&gt;

&lt;h3&gt;Who's monitoring the monitoring?&lt;/h3&gt;

&lt;p&gt;Another fallacy of monitoring is that it needs to be reliable. For it to be
fully reliable it needs to be monitored. Wait, what?&lt;/p&gt;

&lt;p&gt;Every process that is required to aggregate metrics, to trigger alerts, to
analyze logs needs to be running for the system to work properly.&lt;/p&gt;

&lt;p&gt;So monitoring in turns needs its own supervision to make sure it's working at
all times. As monitoring grows it requires maintenance and operations to take
care of it.&lt;/p&gt;

&lt;p&gt;Which makes it a bit of a burden for small teams.&lt;/p&gt;

&lt;p&gt;Lots of new companies have sprung into life serving this need. Instead of having
to worry about running services for logs, metrics and alerting by themselves, it
can be left to companies who are more experienced in running them.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://metrics.librato.com"&gt;Librato Metrics&lt;/a&gt;,
&lt;a href="http://papertrailapp.com"&gt;Papertrail&lt;/a&gt;, &lt;a href="http://opsgenie.com"&gt;OpsGenie&lt;/a&gt;,
&lt;a href="http://logentries.com"&gt;LogEntries&lt;/a&gt;, &lt;a href="http://instrumentalapp.com"&gt;Instrumental&lt;/a&gt;,
&lt;a href="http://newrelic.com"&gt;NewRelic&lt;/a&gt;, &lt;a href="http://www.datadoghq.com"&gt;DataDog&lt;/a&gt;, to name a
few. Other companies take the burden of having to run your own
&lt;a href="http://hostedgraphite.com"&gt;Graphite&lt;/a&gt; system away from you.&lt;/p&gt;

&lt;p&gt;It's been interesting to see new companies pop up in this field, and I'm looking
forward to seeing this space develop. The competition from the commercial space
is bound to trigger innovation and improvements on the open source front as
well.&lt;/p&gt;

&lt;p&gt;We're heavy users of external services for log aggregation, collecting metrics
and alerting. Simply put, they know better how to run that platform than we do,
and it allows us to focus on delivering the best possible customer value.&lt;/p&gt;

&lt;h3&gt;Monitoring is getting better&lt;/h3&gt;

&lt;p&gt;&lt;img src="http://s3itch.paperplanes.de/cubism-1-20130110-111954.png" alt=""/&gt;&lt;/p&gt;

&lt;p&gt;Lots of new tools have sprung up in the last two years. While development on it
started earlier than that, the most prominent tools are probably
&lt;a href="http://graphite.wikidot.com"&gt;Graphite&lt;/a&gt; and &lt;a href="http://logstash.net"&gt;Logstash&lt;/a&gt;.
&lt;a href="http://square.github.com/cubism/"&gt;Cubism&lt;/a&gt; brings new ideas on how to visualize
time series data, one of the several dozens of dashboards that Graphite's
existence and flexibility by offering an API has sparked.
&lt;a href="https://github.com/obfuscurity/tasseo"&gt;Tasseo&lt;/a&gt; is another one of them, a
successful experiment of having an at-a-glance dashboard with the most important
metrics in one convenient overview.&lt;/p&gt;

&lt;p&gt;It'll still be a while until we see the ancient tools like Nagios, Icinga and
others improve, but the competition is ramping up.
&lt;a href="https://github.com/sensu/sensu"&gt;Sensu&lt;/a&gt; is one open source alternative to keep
an eye on.&lt;/p&gt;

&lt;p&gt;I'm looking forward to seeing how the monitoring space evolves over the next two
years.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/Paperplanes/~4/s_TVh0K5a-o" height="1" width="1"/&gt;</description>
      
      <category domain="http://www.paperplanes.de/tags/monitoring">monitoring</category>
      
      <category domain="http://www.paperplanes.de/tags/operations">operations</category>
      
    </item>
    
    <item>
      <title>On Pager Duty</title>
      <link>http://www.paperplanes.de//2013/1/2/on-pager-duty.html</link>
      <pubDate>Wed Jan 02 00:00:00 +0000 2013</pubDate>
      <guid>http://www.paperplanes.de//2013/1/2/on-pager-duty.html</guid>
      <description>&lt;p&gt;Over the last year, as we started turning &lt;a href="http://travis-ci.org"&gt;Travis CI&lt;/a&gt; into
a &lt;a href="http://travis-ci.com"&gt;hosted product&lt;/a&gt;, we added a ton of metrics and
monitoring.  While we started out slow, we soon figured out which metrics are
key and which are necessary to monitor the overall behavior of the system.&lt;/p&gt;

&lt;p&gt;I built us a custom collector that rakes in metrics from our database and from
the API exposed by RabbitMQ. It soon dawned on me that these are our core
metrics, and that they need not only graphs, we need to be alerted when they
cross thresholds.&lt;/p&gt;

&lt;p&gt;The first iteration of that dumped alerts into Campfire. Given that we're a
small team and the room might be empty at times, that was just not sufficient
for an infrastructure platform that's used by customers and open source projects
around the world, at any time of the day.&lt;/p&gt;

&lt;p&gt;So we added alerting, by way of &lt;a href="http://opsgenie.com"&gt;OpsGenie&lt;/a&gt;. It's set up to
trigger alerts via iPhone push notifications and escalations via SMS, should an
alert not have been acknowledged or closed within 10 minutes. Eventually,
escalation needs to be done via voice calls so that someone really picks up.
It's easy to miss a vibrating iPhone when you're sound asleep, but much harder
so when it keeps on vibrating until someone picks up.&lt;/p&gt;

&lt;h3&gt;A Pager for every Developer&lt;/h3&gt;

&lt;p&gt;Just recently I read an &lt;a href="http://queue.acm.org/detail.cfm?id=1142065"&gt;interview with Werner
Vogels&lt;/a&gt; on architecture and
operations at Amazon. He said something that struck with me: "You build it, you
run it."&lt;/p&gt;

&lt;p&gt;That got me thinking. Should developers of platforms be fully involved in the
operations side of things?&lt;/p&gt;

&lt;p&gt;A quick survey on Twitter showed that there are some companies where developers
are paged when there are production issues, others fully rely on their
operations team.&lt;/p&gt;

&lt;p&gt;There's merit to both, but I could think of a few reasons why developers should
be carrying a pager just like operations does.&lt;/p&gt;

&lt;p&gt;You stay connected to what your code does in production. When code is developed,
the common tool to manage expectations is to write tests. Unfortunately, no unit
test, no integration test will be fully able to reproduce circumstances of what
your code is doing in production.&lt;/p&gt;

&lt;p&gt;You start thinking about your code running. Reasoning about what a particular
piece of code is doing under specific production circumstances is hard, but not
entirely impossible. When you're the one responsible for having it run smoothly
and serve your customers, this goes up to a whole new level.&lt;/p&gt;

&lt;p&gt;Metrics, instrumentation, alerting, logging and error handling suddenly become a
natural part of your coding workflow. You start &lt;a href="http://omniti.com/seeds/instrumentation-and-observability"&gt;making your software more
operable&lt;/a&gt;, because you're the one who has to run it. While software should be
easy to operate in any circumstances, it commonly isn't. When you're the one
having to deal with production issues, that suddenly has a very different
appeal.&lt;/p&gt;

&lt;p&gt;Code beauty is suddenly a bit less important than making sure your code can
treat errors, timeouts, increased latencies. Kind of an ironic twist like that.
Code that's resilient to production issues might not have a pretty DSL, it might
not be the most beautiful code, but it may be able to sustain whatever issue is
thrown at it.&lt;/p&gt;

&lt;p&gt;Last, when you're responsible for running things in production, you're forced to
learn about the entire stack of an application, not just the code bits, but its
runtime, the host system, hardware, network. All that turns into something that
feels a lot more natural over time.&lt;/p&gt;

&lt;p&gt;I consider that a good thing.&lt;/p&gt;

&lt;p&gt;There'll always be situations where something needs to be escalated to the
operations team, with deeper knowledge of the hardware, network and the like.
But if code breaks in production, and it affects customers, developers should be
on the front of fixing it, just like the operations team.&lt;/p&gt;

&lt;p&gt;Even more so for teams that don't have any operations people on board. At some
point, a simple exception tracker just doesn't cut it anymore, especially when
no one gets paged on critical errors.&lt;/p&gt;

&lt;h3&gt;Being On Call&lt;/h3&gt;

&lt;p&gt;For small teams in particular, there's a pickle that needs to be solved: who
gets up in the middle of the night when an alert goes off?&lt;/p&gt;

&lt;p&gt;When you have just a few people on the team, like your average bootstrapping
startup, does an on call schedule make sense? This is something I haven't fully
figured out yet.&lt;/p&gt;

&lt;p&gt;We're currently in the fortunate position that one of our team members is in New
Zealand, but we have yet to find a good way to assign on call when he's out or
for when he's back on this side of the world.&lt;/p&gt;

&lt;p&gt;The folks at dotCloud have &lt;a href="http://blog.dotcloud.com/organizing-a-24x7-bullet-proof-on-call-rotati"&gt;written about their
schedule&lt;/a&gt;,
thank you! Hey, you should share your pager and on-call experiences too!&lt;/p&gt;

&lt;p&gt;Currently we have a first come first serve setup. When an alert comes in and
someone sees it, it gets acknowledged and looked into. If that involves everyone
coming online, that's okay for now.&lt;/p&gt;

&lt;p&gt;However, it's not an ideal setup, because being able to handle an alert means
being able to log into remote systems, restart apps, inspect the database, look
at the monitoring charts. Thanks to iPhone and iPad most of that is already
possible today.&lt;/p&gt;

&lt;p&gt;But to be fully equipped to handle any situation, it's good to have a laptop at
hand.&lt;/p&gt;

&lt;p&gt;This brings up the question: who's carrying a laptop and when? Which in turns
means that some sort of on-call schedule is still required.&lt;/p&gt;

&lt;p&gt;We're still struggling on this, so I'd love to read more about how other
companies and teams handle that.&lt;/p&gt;

&lt;h3&gt;Playbooks&lt;/h3&gt;

&lt;p&gt;During a recent &lt;a href="http://hangops.com"&gt;hangops&lt;/a&gt; discussion, there was a chat about
developers being on call.  It brought up an interesting idea, a playbook on how
to handle specific alerts.&lt;/p&gt;

&lt;p&gt;It's a document explaining things to look into when an alert comes up. Ideally,
an alert already includes a link to the relevant section in the book. This is
something operations and developers should work on together to make sure all
fronts are covered.&lt;/p&gt;

&lt;p&gt;It takes away some of the scare of being on call, as you can be sure there's
some guidance when an issue comes up.&lt;/p&gt;

&lt;p&gt;It also helps refine monitoring and alerts and make sure there are appropriate
measures available to handle any of them. If there are not, that part needs
improving.&lt;/p&gt;

&lt;p&gt;I'm planning on building a playbook for Travis as we go along and refine our
monitoring and alerts, it's a neat idea.&lt;/p&gt;

&lt;h3&gt;Sleepless in Seattle&lt;/h3&gt;

&lt;p&gt;There's a psychological side to being on-call that needs a lot of getting used
to: the thought that an alert could go off at any time. While that's a natural
thing, as failures do happen all the time, it's easy to mess up your head. It
certainly did that for me.&lt;/p&gt;

&lt;p&gt;Lying in bed, not being able to sleep, because your mind is waiting for an
alert, it's not a great feeling. It takes getting used to. It's also why having
an on-call schedule is preferable over an all hands scenario. When only one
person is on call, team mates can at least be sure to get a good night's sleep.
As the schedule should be rotating, everyone gets to have that luxury on a
regular basis.&lt;/p&gt;

&lt;p&gt;It does one thing though: it pushes you to make sure alerts only go off for
relevant issues. Not everything needs to be fixed right away, some issues could
be taken care of by improving the code, others are only temporary fluxes because
of increased network latency and will resolve themselves after just a few
minutes. Alerting someone on every other exception raised doesn't cut it
anymore, alerts need to be concise and only be triggered when the error is
severe enough and affects customers directly. Getting this right is the hard
part, and it takes time.&lt;/p&gt;

&lt;p&gt;All that urges you to constantly improve your monitoring setup, to increase
relevance of alerts, and to make sure that everyone on the team is aware of the
issues, how they can come up and how they can be fixed.&lt;/p&gt;

&lt;p&gt;It's a good thing.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/Paperplanes/~4/fzcEJNyMEm8" height="1" width="1"/&gt;</description>
      
      <category domain="http://www.paperplanes.de/tags/operations">operations</category>
      
    </item>
    
    <item>
      <title>A Plea for Client Library Instrumentation</title>
      <link>http://www.paperplanes.de//2012/12/27/a-plea-for-client-library-instrumentation.html</link>
      <pubDate>Thu Dec 27 00:00:00 +0000 2012</pubDate>
      <guid>http://www.paperplanes.de//2012/12/27/a-plea-for-client-library-instrumentation.html</guid>
      <description>&lt;p&gt;The need to measure everything that moves in a distributed system or even simple
web apps is becoming the basis for thorough monitoring of an application.&lt;/p&gt;

&lt;p&gt;However, there is one thing that's starting to get in the way of of getting good
measurements of all layers in a system: client libraries used to talk to network
services, be it the database, an API, a message bus, anything that's bound to
the intricate latency variances of the network stack.&lt;/p&gt;

&lt;p&gt;Without full instrumentation of all parts of the application's stack, it's going
to be very hard to figure out where exactly a problems boils down to. Measuring
client access to a network service in addition to collecting data on the other
end, e.g. the slow query log, allows you to pinpoint issues to the network, to
increased latency, or to parsing responses.&lt;/p&gt;

&lt;p&gt;If the other end is not under your control, it's just as important to have this
data available. Having good metrics on request latencies to an external service,
even a database hosted by a third party, gives you a minimum amount of
confidence that while you maybe can't fix the underlying problem, you at least
have the data to show where the problem is most likely to be. Useful data to
have when approaching the third party vendor or hosting company about the issue.&lt;/p&gt;

&lt;p&gt;Rails has set a surprisingly good example, by way of
&lt;a href="http://api.rubyonrails.org/classes/ActiveSupport/Notifications.html"&gt;ActiveSupport::Notifications&lt;/a&gt;.
Controller requests are instrumented just as database queries of any kind.&lt;/p&gt;

&lt;p&gt;You can subscribe to the notifications and start collecting them in your own
metrics tool. &lt;a href="https://github.com/etsy/statsd"&gt;StatsD&lt;/a&gt;,
&lt;a href="http://graphite.wikidot.com"&gt;Graphite&lt;/a&gt; and &lt;a href="http://metrics.librato.com"&gt;Librato
Metrics&lt;/a&gt; are pretty great tools for this purpose.&lt;/p&gt;

&lt;p&gt;There's not much a client library needs to do to emit measurements of network
requests. The ones for Ruby could start by adding optional instrumentation based
on AS::Notifications. That'd ensure that ActiveSupport itself doesn't turn into
a direct dependency. I'd love to see the notifications bit being extracted into
a separate library that's easier to integrate than pulling in the entire
ActiveSupport ball of mud.&lt;/p&gt;

&lt;p&gt;Node.js has
&lt;a href="http://nodejs.org/api/events.html#events_class_events_eventemitter"&gt;EventEmitters&lt;/a&gt;,
which are similar to AS::Notifications, and they lend themselves quite nicely
for this purpose.&lt;/p&gt;

&lt;p&gt;I've dabbled with this for &lt;a href="https://github.com/mostlyserious/riak-js"&gt;riak-js&lt;/a&gt;,
the Node.js library for Riak. &lt;a href="https://github.com/mostlyserious/riak-js/blob/master/examples/metrics.js"&gt;There's an
example&lt;/a&gt;
that shows how to register and collect the metrics from the events emitted. The
library itself just emits the events at the right spot, adds some timestamps so
that event listeners can reconstruct the trail of a request.&lt;/p&gt;

&lt;p&gt;It worked out pretty well and is just as easy to plug into a metrics library or
to report measurements directly to StatsD.&lt;/p&gt;

&lt;p&gt;The thing that matters is that any library for a network service you write or
maintain, should have some sort of instrumentation built in. Your users and I
will be forever grateful.&lt;/p&gt;

&lt;p&gt;This goes both ways, too. Network servers need to be just as diligent in
collecting and exposing data as the client libraries talking to them.
Historically, though, a lot of servers already expose a lot of data, not always
in a convenient format, but at least it's there.&lt;/p&gt;

&lt;p&gt;Build every layer of your application and library with instrumentation in mind.
Next time you have to tackle an issue in any part of the stack, you'll be glad
you did.&lt;/p&gt;

&lt;p&gt;Now go and &lt;a href="http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/"&gt;measure
everything&lt;/a&gt;!&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/Paperplanes/~4/I3sIBzLBWME" height="1" width="1"/&gt;</description>
      
      <category domain="http://www.paperplanes.de/tags/monitoring">monitoring</category>
      
    </item>
    
    <item>
      <title>Form Objects with ActiveModel</title>
      <link>http://www.paperplanes.de//2012/12/6/form-objects-with-activemodel.html</link>
      <pubDate>Thu Dec 06 00:00:00 +0000 2012</pubDate>
      <guid>http://www.paperplanes.de//2012/12/6/form-objects-with-activemodel.html</guid>
      <description>&lt;p&gt;When I built the billing process for &lt;a href="http://travis-ci.com"&gt;Travis CI&lt;/a&gt;'s
commercial offering, I decided to try out some new things to avoid callbacks in
ActiveRecord models, including validations.&lt;/p&gt;

&lt;p&gt;In 2010 I wrote about why callbacks and validations scattered about the
persistence layer bother me. I recommend &lt;a href="/2010/5/7/activerecord_callbacks_ruined_my_life.html"&gt;reading
it&lt;/a&gt; to get the full background
on this.&lt;/p&gt;

&lt;p&gt;What I went for this time was a mix of a service layer that handles all the
business logic and a layer of form objects that handle communications between
the controller and the services, including handling validations.&lt;/p&gt;

&lt;p&gt;The goal was to have simple Ruby objects to take care of these things. No
special frameworks required. Inspiration in part stemmed from &lt;a href="https://docs.djangoproject.com/en/1.5/topics/forms/"&gt;Django's form
objects&lt;/a&gt;, though my
implementation lacks the part that talks directly to the model, for instance to
save data to the database. Quite intentionally so, as that part is up to the
services layer.&lt;/p&gt;

&lt;p&gt;The last thing I wanted to avoid is having to use &lt;code&gt;attr_accessible&lt;/code&gt; in the
persistence layer. In my view, that part is not something persistence should be
concerned with. It's a contract between the controller and the services it calls
into to make sure parameters are properly narrowed down to the set required for
any operation.&lt;/p&gt;

&lt;h3&gt;Form Objects&lt;/h3&gt;

&lt;p&gt;For form objects, I looked at &lt;a href="http://soveran.github.com/scrivener/"&gt;Scrivener&lt;/a&gt;,
which was a great start. It's a very simple framework for form objects, &lt;a href="https://github.com/soveran/scrivener/blob/master/lib/scrivener.rb"&gt;the
code could barely be
simpler&lt;/a&gt;, but
it lacks some validations, as it implements its own set.&lt;/p&gt;

&lt;p&gt;On top of that, it doesn't tie in with Rails' form handling that well, which
requires some parts of ActiveModel to work properly. Scrivener is great when you
integrate it with e.g. Sinatra and your own simple set of forms.&lt;/p&gt;

&lt;p&gt;It's so simple that I decided to take its simple parts and merge it with
&lt;a href="http://guides.rubyonrails.org/active_record_validations_callbacks.html#validation-helpers"&gt;ActiveModel's
validations&lt;/a&gt;
support. Thanks to Rails 3, that part has been extracted out of the ActiveRecord
code and &lt;a href="http://yehudakatz.com/2010/01/10/activemodel-make-any-ruby-object-feel-like-activerecord/"&gt;can be used for
anything&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The beauty of form objects is that they allow you to specify different views on
the same data. Every database record wrapped by ActiveRecord can have multiple
representations depending on which data is required by a specific form.&lt;/p&gt;

&lt;h3&gt;ActiveModelSimpleForms&lt;/h3&gt;

&lt;p&gt;Here's the base code for the forms, which doesn't have a name, it's just a
snippet of code that's part of our Rails project:&lt;/p&gt;

&lt;script src="https://gist.github.com/4223741.js?file=activemodelsimpleform.rb"&gt;&lt;/script&gt;

&lt;p&gt;It defines a few things that are required by Rails' &lt;code&gt;form_for&lt;/code&gt;, but other than
that it's straight-forward. It can populate form attributes based on the model
handed in, which makes it suitable for re-use, for instance when editing an
existing object or when validations failed on update.&lt;/p&gt;

&lt;p&gt;Here's a sample form object:&lt;/p&gt;

&lt;script src="https://gist.github.com/4223741.js?file=edit_person_form.rb"&gt;&lt;/script&gt;

&lt;p&gt;It declares a few attributes and some validations. Thanks to ActiveModel you
could use anything provided by its validations package in a form object.&lt;/p&gt;

&lt;p&gt;By declaring the attributes a form object brings a simple means of implementing
mass assignment protection without requiring any sort of sanitization and
without poisoning the model with &lt;code&gt;attr_accessible&lt;/code&gt; and jumping through hoops in
tests to create valid objects to work with.&lt;/p&gt;

&lt;p&gt;If an attribute assigned to the form doesn't exist, the assignment will fail.&lt;/p&gt;

&lt;h3&gt;In the controller...&lt;/h3&gt;

&lt;p&gt;The interaction with the controller is rather simple, no added complexity:&lt;/p&gt;

&lt;script src="https://gist.github.com/4223741.js?file=people_controller.rb"&gt;&lt;/script&gt;

&lt;p&gt;I'm liking this approach a lot, and it's been in use for a few months. There'll
be some refinements, but the simplicity of it all is what I find to be the best
part of it.&lt;/p&gt;

&lt;p&gt;It's all just plain Ruby objects with some additional behaviours. Add a simple
service layer to this, and cluttered code in the ActiveRecord model is nicely
split up into lots of smaller chunks that deal with very specific concerns.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/Paperplanes/~4/2KdQ5o5bHnE" height="1" width="1"/&gt;</description>
      
      <category domain="http://www.paperplanes.de/tags/rails">rails</category>
      
      <category domain="http://www.paperplanes.de/tags/web">web</category>
      
    </item>
    
    <item>
      <title>A Story About Queues in Four Acts</title>
      <link>http://www.paperplanes.de//2012/10/30/a-story-about-queues.html</link>
      <pubDate>Tue Oct 30 00:00:00 +0000 2012</pubDate>
      <guid>http://www.paperplanes.de//2012/10/30/a-story-about-queues.html</guid>
      <description>&lt;p&gt;There are queues everywhere. This is the story of a few of them. The names of the
queues are made up, but their story is real nonetheless.&lt;/p&gt;

&lt;h3&gt;First Act&lt;/h3&gt;

&lt;p&gt;The first queue, we'll call it Unicorn, handles requests for information,
rendering the result in a beautiful markup language that's easy to read. It sits
in front of the public library building, waits for people to come in and ask for
information.&lt;/p&gt;

&lt;p&gt;Unicorn has a fixed number of peasants at its disposal to do work for it. When a
request comes in, it sends one of them into the library to fetch the
information. Peasants have access to a pretty big amount of data to choose from,
but they have to be quick.&lt;/p&gt;

&lt;p&gt;If one of them takes too long to fetch the information, Unicorn denies the
request for information and strips the peasant of its duties on the spot,
putting a new one in its place.&lt;/p&gt;

&lt;p&gt;Unicorn is not very fail-safe though. It trades off not being able to deliver
information in time for being swarmed by requests and not being able to handle
them.&lt;/p&gt;

&lt;p&gt;It also isn't very good at determining that every new peasant takes to long and
to stop processing requests. It just keeps accepting them even if all of them
time out.&lt;/p&gt;

&lt;p&gt;Maybe it'd be smarter if Unicorn could be more aware of an increased number of
information requests not returning the data in time and slow down processing or
halt it altogether to figure out what the problem is?&lt;/p&gt;

&lt;h3&gt;Second Act&lt;/h3&gt;

&lt;p&gt;The second queue, we'll call it Octocat, handles requests from people to build
something, say, a house, or a shack or a shelter, sometimes even a blue
bikeshed.&lt;/p&gt;

&lt;p&gt;To figure out what needs to be done, Octocat looks at the request's details, to
determine what materials are required and which builder needs to be allocated to
get the job done.&lt;/p&gt;

&lt;p&gt;In some cases, Octocat sends a request to the warehouse to see if they have the
required material in stock. Because the warehouse doesn't have a means to send a
messenger back to Octocat, it's a fully automated system, it calls a hotline to
check the status. It listens to Rick Astley while it's on hold, waiting for the
system to get back to it.&lt;/p&gt;

&lt;p&gt;Sometimes, there's a problem in the warehouse and Octocat is stuck for a long
time, and it can't process any other requests in the meantime. It doesn't want
to miss the system getting back to it, so all its focus is on this one build
request.&lt;/p&gt;

&lt;p&gt;To speed things up a little, the Octocat hired a second person. But now both of
them are stuck in a waiting loop with the warehouse, not being able to process
more build requests. No matter how many people Octocat's companies would hire,
at some point all of them will be stuck on hold, all of them listening to Rick
Astley.&lt;/p&gt;

&lt;p&gt;Wouldn't it be better if, when the Octocat is waiting for the warehouse, it
presses # to cancel the request, hangs up the phone and process another request
in the meantime? It could just retry five minutes later to see if the system is
now able to process the request.&lt;/p&gt;

&lt;p&gt;As time passes, it can just increase the waiting time between calls, as it gets
less and less likely that the warehouse will be able to process the request this
time around.&lt;/p&gt;

&lt;p&gt;Or it could put the current request to the end of the queue, and come back to it
later, trying to go through the process again at a later point in the day. Maybe
the warehouse just has a problems finding information on this particular
material, and other requests that don't require it will work out just fine.&lt;/p&gt;

&lt;p&gt;If the warehouse is unable to process an increased number of requests, maybe
Octocat should just cease calls altogether to give the warehouses' employees
time to clear things up and to process what has piled up in their inbox.&lt;/p&gt;

&lt;h3&gt;Third Act&lt;/h3&gt;

&lt;p&gt;The third queue processes long texts that were, for efficiency reasons, split up
into smaller chunks. They're usually send in Morse code for bandwidth
efficiency, ready to be turned back into texts.&lt;/p&gt;

&lt;p&gt;We'll call it Logger. Logger has strict requirements to process the chunks. He
needs to put them back together very quickly, otherwise the readers on the other
side will be unhappy, waiting for new text to appear. They're fast readers, so
Logger has to make sure he delivers in a timely fashion.&lt;/p&gt;

&lt;p&gt;The queue has to go through a lot of text, and it has to make sure that it
processes it in the correct order. Otherwise the text wouldn't make sense
anymore, things putting context out of.&lt;/p&gt;

&lt;p&gt;Logger relies on strict ordering of the messages it processes. It relies on
several minions to put the texts back together after they were processed. To
make sure ordering is properly applied, one minion always processes chunks from
a specific text.&lt;/p&gt;

&lt;p&gt;Logger uses the text's title to figure out which minion is responsible. This has
the advantage that Logger can call in more minions as more texts are coming in.
As titles vary pretty wildly, Logger can just assume that work will be
distributed efficiently enough. Of course there's still the chance that one
minion gets a lot of longer texts, compared to the others, but overall, it
should be fine.&lt;/p&gt;

&lt;p&gt;There is one downside to this system. Logger has to know the exact number of
minions upfront. If one of them calls in sick, he has to find a replacement
quickly, so that work on this minion's desk doesn't pile up.&lt;/p&gt;

&lt;p&gt;If he can't find a replacement quickly, he has to reassign all the numbers and
redistribute the work on their desks, which is a very dreadful process.&lt;/p&gt;

&lt;p&gt;What if Logger could group minions so that they form subdivisions, each
controlled by a supervisor of their own, who in turn distributes the work on his
team of minions.&lt;/p&gt;

&lt;p&gt;With little groups, he can rely on the supervisor to increase and decrease the
number of minions as needed. Logger would be oblivious to their shift schedule.&lt;/p&gt;

&lt;p&gt;To split up the work more efficiently, Logger could also rely on the first
letter of the title, splitting the alphabet into smaller sub-alphabets, e.g.
A-E, F-M, and so on. He assigns the ranges directly to groups, and he can, as
groups come and go for their shifts, quickly reallocate ranges of letters to new
groups. That still means that work has to be distributed, but Logger adds a
group of messengers to the process that can shift stacks of texts quickly from
one group to the other.&lt;/p&gt;

&lt;p&gt;If one group for some reason becomes unavailable, Logger could just adapt the
way he schedules work and burden another team with its range. That might overall
be a bit slower, but work would still be spread out evenly across the remaining
groups.&lt;/p&gt;

&lt;p&gt;Logger still has to make sure that all groups are on the same floor though, so
that the messengers don't have to climb stairs to lengthen the latency of
redistributing the texts.&lt;/p&gt;

&lt;p&gt;If Logger wasn't bound to having to process texts with very low latency, he
could even consider placing groups in different buildings. If a fire breaks out
in one of them, the other groups could still continue processing.&lt;/p&gt;

&lt;h3&gt;Fourth Act&lt;/h3&gt;

&lt;p&gt;The fourth queue is a builder, we'll call it Bob. Bob builds garages, houses and
lots other things.&lt;/p&gt;

&lt;p&gt;Bob is a sloppy builder though. He breaks things a lot, leaving windows broken,
plaster with holes and floors uncleaned. Sometimes he even forgets to put a tile
in, so that it leaves an empty area on the wall. Or he drops one of his tools on
the floor, leaving a dent in the wood.&lt;/p&gt;

&lt;p&gt;He tends to not be too careful and just assumes that everything he does turns
out right. He pours concrete when it's raining, he leaves&lt;/p&gt;

&lt;p&gt;Bob needs to get a grip and make sure his tasks are processed correctly. How
could he do that?&lt;/p&gt;

&lt;p&gt;Instead of ignoring mistakes, he could learn to accept them and take the
appropriate measures to make sure he cleans up. If he notices that he breaks
things to often, he could slow down his work and make sure he gets it right. Or
he could go out for a coffee and come back when he's a bit more confident that
he'll get the job right.&lt;/p&gt;

&lt;p&gt;If things are really bad, he can even start from scratch, to make sure the end
result is good. That might mean that processing can slow down, but that Bob is
aware of his own failures. His mindset would change to making sure he gets the
task right instead of leaving a mess everywhere he goes.&lt;/p&gt;

&lt;p&gt;Bob's customers would be a lot happier if he did. It'd cost him more resources
but he'd make a lot of people much happier, leaving every place he's worked on
clean.&lt;/p&gt;

&lt;h3&gt;Queues, queues everywhere!&lt;/h3&gt;

&lt;p&gt;What have all the queues in this story in common? They fail to exponentially back
off when they encounter errors in processing requests. They fail to make sure to
not lose messages when processing them failed. They fail to retry when
delivering a message has failed. They fail to make sure their processing is
idempotent. They assume that the resources required for processing the messages
are always available.&lt;/p&gt;

&lt;p&gt;There are queues everywhere. They have a tendency to cause problems when being
used. We just assume they work all the time, and we just assume that we're able
to process everything they throw at us in a timely fashion?&lt;/p&gt;

&lt;p&gt;We do have the best of intentions, but they usually turn out wrong. When a queue
starts to become the central backbone of a system, careful steps need to be
taken that the system can handle backpressure, increased failure rates, and the
queue itself being unavailable.&lt;/p&gt;

&lt;p&gt;Maybe we should start building our queues and the processes around it with the
worst in mind and adjust our thinking accordingly? It's not queues, queues
everywhere. It's failures, failures everywhere! Queues have a tendency to
intensify failures by adding a less predictable element to our infrastructure.
As &lt;a href="https://twitter.com/rbranson/statuses/261139185694568449"&gt;Rick Branson put
it&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
    &lt;p&gt;"Keeping distributed systems running smoothly seems to be mostly about
    figuring out ways to not DDoS yourself."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A queue is a lot of fun until you're unable to keep up with what it's throwing
at you, until your database's capacity doesn't match that of the queue, until
you drop messages on the floor just because something broke in the backend, or
until it floods your system with so many messages it can't process anything else
in the meantime.&lt;/p&gt;

&lt;p&gt;Maybe you already knew all of that, but I sure as heck had to learn all of these
lessons above &lt;a href="http://about.travis-ci.org/blog/2012-09-05-on-yesterdays-log-outage/"&gt;the hard
way&lt;/a&gt;, in a
&lt;a href="http://about.travis-ci.org/blog/2012-09-24-post-mortem-pull-request-unavailability/"&gt;very small amount of
time&lt;/a&gt;,
within a &lt;a href="http://about.travis-ci.org/blog/2012-09-13-an-update-on-the-sites-availability/"&gt;matter of
weeks&lt;/a&gt;,
to be exact.&lt;/p&gt;

&lt;p&gt;We're still working on picking up the pieces and cleaning up. There'll be less
queues in the future, just as there will be a lot more of them. More on this
soon!&lt;/p&gt;

&lt;p&gt;The queue is dead, long live the queue!&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/Paperplanes/~4/0OUgl7p4uQo" height="1" width="1"/&gt;</description>
      
      <category domain="http://www.paperplanes.de/tags/messaging">messaging</category>
      
      <category domain="http://www.paperplanes.de/tags/queues">queues</category>
      
    </item>
    
    <item>
      <title>September Reading List</title>
      <link>http://www.paperplanes.de//2012/9/11/september-reading-list.html</link>
      <pubDate>Tue Sep 11 00:00:00 +0000 2012</pubDate>
      <guid>http://www.paperplanes.de//2012/9/11/september-reading-list.html</guid>
      <description>&lt;p&gt;Been a while since the last reading list (&lt;a href="/2012/6/29/june-reading-list.html"&gt;here's a handy
link&lt;/a&gt;, in case you're looking for more to
read). Time to remedy that. Disclaimer: All links below are Amazon affiliate
links. You'll be feeding my reading habit. Thank you in advance!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="http://amzn.to/QxutQq"&gt;Pricing With Confidence&lt;/a&gt;&lt;/strong&gt; by Reed Holden&lt;/p&gt;

&lt;p&gt;I know I already mentioned this on the previous list, but it's just so good. A
must read for pricing products or even your time as a freelancer. Must. Read.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="http://amzn.to/PfCJCC"&gt;Poke the Box&lt;/a&gt;&lt;/strong&gt; by Seth Godin&lt;/p&gt;

&lt;p&gt;A nice and short manifesto about starting (and finishing) things. If you don't
finish, technically you never really started, right? Pretty delightful read and
a nice kick in the pants about starting something, anything, about making things
happen. Because if you don't, who else is there?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="http://amzn.to/QD6IaL"&gt;Fool's Gold&lt;/a&gt;&lt;/strong&gt; by Gillian Tett&lt;/p&gt;

&lt;p&gt;An excellent rundown of how the 2008 financial crisis came about and how
derivatives and collateralized debt obligations came about. The interesting bit
is that they were created with good intentions originally, but as with a lot of
things, the short-sightedness and greed of investors and banks turned it into a
mind-boggling web that was bound to end up as a cataclysmic and cascading
failure across the entire financial system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="http://amzn.to/QhEXCA"&gt;Start Small, Stay Small&lt;/a&gt;&lt;/strong&gt; by Rob Walling&lt;/p&gt;

&lt;p&gt;If you're interested in running a small business, built around profitable
products, marketing and building them yourself, this is a great little
introduction on everything you need to know. I got quite a few ideas from this
book for my next ventures.&lt;/p&gt;

&lt;p&gt;After you're done with it, and you want to keep going, &lt;a href="http://unicornfree.com/30x500/"&gt;Amy Hoy's 30x500
class&lt;/a&gt; is highly recommended.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="http://amzn.to/PfF8xg"&gt;Architecture of Open Source Applications Vol. 2&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The second edition of this great compilation is upon us, and it's great. I loved
the chapter on &lt;a href="http://www.aosabook.org/en/zeromq.html"&gt;ZeroMQ&lt;/a&gt; in particular,
but there's still a lot I need to read, e.g. the chapter on
&lt;a href="http://www.aosabook.org/en/nginx.html"&gt;nginx&lt;/a&gt; or the one on
&lt;a href="http://www.aosabook.org/en/pypy.html"&gt;PyPy&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="http://amzn.to/Q06fNg"&gt;How To Win Friends and Influence People&lt;/a&gt;&lt;/strong&gt; by Dale Carnegie&lt;/p&gt;

&lt;p&gt;This book is now 80 years old yet its content is pretty much timeless. The title
might be a bit misleading about what it's really about. If you're interested in
improving your people skills, how to make people want something you have to
offer and how you can turn them over to your side, this book is for you. If
you're running a business of any kind, this is a must read. The single most
revealing book I've read in a while.&lt;/p&gt;

&lt;p&gt;It turns out, people and how we interact have barely changed at all. Still so
much to learn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="http://amzn.to/OEgK94"&gt;Predictably Irrational&lt;/a&gt;&lt;/strong&gt; by Dan Ariely&lt;/p&gt;

&lt;p&gt;A delightful and pretty revealing book about how irrational yet predictable
human behaviour is. Driven by scientific experiments, this book is also rather
revealing when it comes to marketing products, for example. I'd call this
another must-read if you run a business of sorts or sell something for a living.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="http://amzn.to/OEgZAS"&gt;It Will Be Exilirating&lt;/a&gt;&lt;/strong&gt; by Dan Provost&lt;/p&gt;

&lt;p&gt;A very short but nice read about how Studio Neat, makers of the Glif and the
Cosmonaut, came about. Talks a bit about successfully running a Kickstarter
campaign, but also about running their small business in general. A few bits and
pieces to pick up in this one. Most importantly, it's another inspiration to
start something.&lt;/p&gt;

&lt;p&gt;Happy reading!&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/Paperplanes/~4/WlQvChAn42g" height="1" width="1"/&gt;</description>
      
      <category domain="http://www.paperplanes.de/tags/reading">reading</category>
      
      <category domain="http://www.paperplanes.de/tags/books">books</category>
      
    </item>
    
    <item>
      <title>A Culture of Failure</title>
      <link>http://www.paperplanes.de//2012/8/23/a-culture-of-failure.html</link>
      <pubDate>Thu Aug 23 00:00:00 +0000 2012</pubDate>
      <guid>http://www.paperplanes.de//2012/8/23/a-culture-of-failure.html</guid>
      <description>&lt;p&gt;&lt;a href="http://www.flickr.com/photos/nnova/2970063644/in/photostream/"&gt;&lt;img src="http://farm4.staticflickr.com/3272/2970063644_d70d643711_d.jpg" width="550"/&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Recently, I've been thinking a lot about failure, my daughter, risk and
punishment, and the whole culture that has evolved around trying to avoid
failure, trying to point fingers or putting blame elsewhere.&lt;/p&gt;

&lt;p&gt;Simplest example: my daughter spills something over the table. What's the first
reaction? Scolding or punishment of sorts. I'm guilty as charged. I read
something pretty simple and wonderful recently, a very short read titled &lt;a href="http://www.instapaper.com/text?u=http%3A%2F%2Fwww.csua.berkeley.edu%2F~chrislw%2Fdadforget.html"&gt;"Father
Forgets"&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;That read got me thinking: why do we tend to punish failure immediately? It's
not just something to do with our kids, it's human nature. We tend to put blame
elsewhere, we tend to get defensive because people turn to us to fix a problem,
when something is broken in production, for example.&lt;/p&gt;

&lt;p&gt;Why can't we instead make failure a part of our culture? Not just at home, with
our kids, but in our work place?&lt;/p&gt;

&lt;p&gt;As soon as people feel like they need to get defensive, or they're blamed for a
problem that occurred due to a recent change of theirs, negativity hits everyone
on the team. It's hard to stay calm, it's hard to stay focused on what really
matters: that something is broken in production, affecting your customers.&lt;/p&gt;

&lt;p&gt;As soon as people feel threatened or pressured, they get defensive or they feel
down because some of their own code broke something. Their vision is clouded.
Finding the problem's cause and implementing a solution is suddenly just a blur,
something that's hard to focus on. Even though that's what really that matters. &lt;/p&gt;

&lt;p&gt;When people feel like failure is not an option, they'll stop taking risks. When
people stop taking risks, your team and your company is doomed, innovation comes
to a grinding halt. Most of us are in the lucky position that lives don't depend
on our work. We can try new things, iterate quickly, disregard or improve them.&lt;/p&gt;

&lt;p&gt;If my daughter doesn't take any risks because I keep punishing or scolding her,
she might just stop trying altogether. The analogy is an odd one, but there's a
striking similarity.&lt;/p&gt;

&lt;p&gt;If a problem comes up, you fix it, you learn your lesson, you make sure it
doesn't happen again, you move on. It can be that simple. When everyone on the
team feels like failure is an accepted part of running an application, fixing
the problems as they occur as a team becomes a lot easier.&lt;/p&gt;

&lt;p&gt;In the end, it's not a question of *if* something breaks, it's rather about
&lt;em&gt;when&lt;/em&gt; it breaks. And the answer is: all the time. Great teams focus on the one
thing that matters in these situations: how to best resolve the situation and on
being ready when it does.&lt;/p&gt;

&lt;p&gt;Embrace outages, the most common failure of our craft. Take a deep breath, phase
out distractions (including managers) and try to find joy in digging through
data and finding what's causing a problem. Turn it from a seemingly frustrating
experience into a personal challenge. You find the problem, you fix it, you make
customers happy again. Rinse, repeat.&lt;/p&gt;

&lt;p&gt;Failure is cool.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/Paperplanes/~4/YpYyuOgiKhM" height="1" width="1"/&gt;</description>
      
      <category domain="http://www.paperplanes.de/tags/operations">operations</category>
      
      <category domain="http://www.paperplanes.de/tags/humans">humans</category>
      
      <category domain="http://www.paperplanes.de/tags/failure">failure</category>
      
    </item>
    
  </channel>
</rss>
