<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">

 <title>Lindsay Holmwood - auxesis' musings</title>
 
 <link href="http://holmwood.id.au/~lindsay/" />
 <updated>2013-05-20T20:18:29+10:00</updated>
 <id>http://holmwood.id.au/~lindsay</id>
 <author>
   <name>Lindsay Holmwood</name>
   <email>lindsay@holmwood.id.au</email>
 </author>

 
 <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/AuxesisMusings" /><feedburner:info uri="auxesismusings" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><entry>
   <title>How we do Kanban</title>
   <link href="http://feedproxy.google.com/~r/AuxesisMusings/~3/gT8lGETyMTw/" />
   <updated>2013-05-20T00:00:00+10:00</updated>
   <id>http://holmwood.id.au/~lindsay/2013/05/20/how-we-do-kanban</id>
   <content type="html">&lt;p&gt;At my &lt;a href="http://bulletproof.net/"&gt;day job&lt;/a&gt;, I run a &lt;a href="http://bob.mcwhirter.org/blog/2010/09/13/remote-worker-distributed-team/"&gt;distributed team&lt;/a&gt; of infrastructure coders spread across Australia + one in Vietnam. Our team is called the Software team, but we're more analogous to a product focused &lt;a href="http://en.wikipedia.org/wiki/Research_and_development"&gt;Research &amp;amp; Development&lt;/a&gt; team.&lt;/p&gt;

&lt;p&gt;Other teams at Bulletproof are a mix of office and remote workers, but our team is a little unique in that we're fully distributed. We do daily standups using Google Hangouts, and try to do face to face meetups every few months at Bulletproof's offices in Sydney.&lt;/p&gt;

&lt;p&gt;Intra-team communication is something we're good at, but I've been putting a lot of effort lately into improving how our team communicates with others in the business.&lt;/p&gt;

&lt;p&gt;This is a post I wrote on our internal company blog explaining how we schedule work, and why we work this way.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;img src="http://farm3.staticflickr.com/2819/8757261526_b02aa4d973_c.jpg" alt="our physical wallboard in the office" /&gt;&lt;/p&gt;

&lt;h3&gt;What on earth is this?&lt;/h3&gt;

&lt;p&gt;This is a &lt;a href="http://en.wikipedia.org/wiki/Kanban_board"&gt;Kanban board&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A Kanban board is a tool for implementing Kanban. &lt;a href="http://en.wikipedia.org/wiki/Kanban"&gt;Kanban&lt;/a&gt; is a scheduling system developed at Toyota in the 70's as part of the broader &lt;a href="http://en.wikipedia.org/wiki/Toyota_Production_System"&gt;Toyota Production System&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Applied to &lt;a href="http://en.wikipedia.org/wiki/Kanban_(development)"&gt;software development&lt;/a&gt;, the top three things Kanban aims to achieve are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Visualise&lt;/strong&gt; the flow of work&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limit&lt;/strong&gt; the Work-In-Progress (WIP)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manage&lt;/strong&gt; and optimise the flow of work&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;How does Kanban work for the Software team?&lt;/h3&gt;

&lt;p&gt;In practical terms, work tends to be tracked in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="http://bestpractical.com/rt/"&gt;RT tickets&lt;/a&gt;&lt;/strong&gt;, as created using the standard request process, or escalated from other teams&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://github.com/features/projects/issues"&gt;GitHub issues&lt;/a&gt;&lt;/strong&gt;, for product improvements, and work discovered while doing other work&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ad-hoc requests&lt;/strong&gt;, through informal communication channels (IM, email)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Because Software deals with requests from many audiences, we use a Kanban board to visualise work from request to completion across all these systems.&lt;/p&gt;

&lt;h3&gt;Managing flow&lt;/h3&gt;

&lt;p&gt;As of writing, we have 5 stages a task progresses through:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm9.staticflickr.com/8118/8757262454_ffddc8d41e_c.jpg" alt="the board" /&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;To Do&lt;/strong&gt; - tasks &lt;a href="http://en.wikipedia.org/wiki/Triage"&gt;triaged&lt;/a&gt;, and scheduled to be worked on next&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Doing&lt;/strong&gt; - tasks being worked on right now&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deployable&lt;/strong&gt; - completed tasks that need to be released to production in the near future (generally during change windows)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Done&lt;/strong&gt; - completed tasks&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;That's only 4 - there is another stage called the Icebox. This is for tasks we're aware of, but haven't been triaged and aren't scheduled to be worked on yet.&lt;/p&gt;

&lt;p&gt;Done tasks are cleaned out once a week on Mondays, after the morning standup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Triage&lt;/strong&gt; is the process of taking a request and:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Determining the business priority&lt;/li&gt;
&lt;li&gt;Breaking it up into smaller tasks&lt;/li&gt;
&lt;li&gt;(Tentatively) allocating it to someone&lt;/li&gt;
&lt;li&gt;Classifying the type of work (Internal, Customer, &lt;a href="http://en.wikipedia.org/wiki/Business_as_usual_(business)"&gt;BAU&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Estimating a task completion time&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;We use the board exclusively to visualise the tasks - we don't communicate with the stakeholder through the board.&lt;/p&gt;

&lt;p&gt;Each task has a pointer to the system the request originated from:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm6.staticflickr.com/5454/8756135267_1000189eca_c.jpg" alt="detailed view" /&gt;&lt;/p&gt;

&lt;p&gt;…and a little bit of metadata about the overall progress.&lt;/p&gt;

&lt;p&gt;Communication with the stakeholder is done through the RT ticket / GitHub issue / email.&lt;/p&gt;

&lt;h3&gt;Limiting WIP&lt;/h3&gt;

&lt;p&gt;The &lt;a href="http://en.wikipedia.org/wiki/Work_in_process"&gt;WIP&lt;/a&gt; Limit is an artificial limit on the number of tasks the whole team can work on simultaneously. We currently calculate the WIP as:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;(Number of people in Software) x 2&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;The goal here is to ensure no one person is ever working on more than 2 tasks at once.&lt;/p&gt;

&lt;p&gt;I can hear you thinking &lt;em&gt;"That's crazy and will never work for me! I'm always dealing with multiple requests simultaneously"&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The key to making the WIP Limit work is that &lt;strong&gt;tasks are never pushed&lt;/strong&gt; through the system - &lt;strong&gt;they are pulled&lt;/strong&gt; by the people doing the work. Once you finish your current task, you pull across the next highest priority task from the To Do column.&lt;/p&gt;

&lt;p&gt;The WIP Limit is particularly useful when coupled with visualising flow because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If people need to work on more than 2 things at once, it's indicative of a bigger scheduling contention problem that needs to be solved. We are likely context switching rapidly, which rapidly reduces our delivery throughput.&lt;/li&gt;
&lt;li&gt;If the team is constantly working at the WIP limit, we need more people. We always aim to have at least 20% slack in the system to deal with ad-hoc tasks that bubble up throughout the day. If we're operating at 100% capacity, we have no room to breathe, and this severely reduces our operational effectiveness.&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;Visualising flow&lt;/h3&gt;

&lt;p&gt;Work makes it way from left to right across the board.&lt;/p&gt;

&lt;p&gt;This is valuable for communicating to people where their requests sit in the overall queue of work, but also in identifying bottlenecks where work isn't getting completed.&lt;/p&gt;

&lt;p&gt;The &lt;a href="http://kanbanery.com/"&gt;Kanban tool&lt;/a&gt; we use colour codes tasks based on how long they have been sitting in the same column:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm3.staticflickr.com/2857/8756137525_ff6a337ca7.jpg" alt="colour coding of tasks" /&gt;&lt;/p&gt;

&lt;p&gt;This is vital for identifying work that people are blocking on completing, and tends to be indicative of one of two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Work that is too large and needs to be broken down into smaller tasks&lt;/li&gt;
&lt;li&gt;Work that is more complex or challenging than originally anticipated&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The latter is an interesting case, because it may require pulling people off other work to help the person assigned that task push through and complete that work.&lt;/p&gt;

&lt;p&gt;Normally as a manager this isn't easy to discover unless you are regularly polling your people about their progress, but that behaviour is incredibly annoying to be on the receiving end of.&lt;/p&gt;

&lt;p&gt;The board is updated in real time as people in the team do work, which means as a manager I can get out of their way and let them Get Shit Done while having a passive visual indicator of any blockers in the system.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/AuxesisMusings/~4/gT8lGETyMTw" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://holmwood.id.au/~lindsay/2013/05/20/how-we-do-kanban/</feedburner:origLink></entry>
 
 <entry>
   <title>Escalating Complexity</title>
   <link href="http://feedproxy.google.com/~r/AuxesisMusings/~3/Ujdt8FOpJ8Q/" />
   <updated>2013-05-15T00:00:00+10:00</updated>
   <id>http://holmwood.id.au/~lindsay/2013/05/15/escalating-complexity-af447</id>
   <content type="html">&lt;p&gt;Back in 2009 when I was backpacking around Europe I remember waking up on the morning of June 1 and reading about how an Air France flight had disappeared somewhere over the Atlantic.&lt;/p&gt;

&lt;p&gt;The lack of information on what happened to the flight intrigued me, and given the traveling I was doing, I was left wondering "what if I was on that plane?"&lt;/p&gt;

&lt;p&gt;Keeping an ear out for updates, in December 2011 I stumbled upon the &lt;a href="http://www.popularmechanics.com/technology/aviation/crashes/what-really-happened-aboard-air-france-447-6611877"&gt;Popular Mechanics article&lt;/a&gt; describing the final moments of the flight. I was left fascinated by how a technical system so advanced could fail so horribly, apparently because of the faulty meatware operating it.&lt;/p&gt;

&lt;p&gt;Around the same time I began reading the works of &lt;a href="http://sidneydekker.com/"&gt;Sidney Dekker&lt;/a&gt;. I was left in a state of cognitive dissonance, trying to reconcile the mainstream explanation of what happened in the final moments of AF447 (the pilots were poorly trained, inexperienced, and simply incompetent) with the New View that the operators were merely locally rational actors within a complex system, and that "root cause is simply the place you stop looking further" - with that cause far too commonly attributed to humans.&lt;/p&gt;

&lt;p&gt;I decided to do my own research, which resulted in me producing a talk that has received the strongest reaction of any talk I've ever given.&lt;/p&gt;

&lt;iframe width="560" height="315" src="http://www.youtube.com/embed/P8hZOHtrHn0" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;


&lt;blockquote&gt;&lt;p&gt;On June 1, 2009 Air France 447 crashed into the Atlantic ocean killing all 228 passengers and crew. The 15 minutes leading up to the impact were a terrifying demonstration of the how thick the fog of war is in complex systems.&lt;/p&gt;

&lt;p&gt;Mainstream reports of the incident put the blame on the pilots - a common motif in incident reports that conveniently ignore a simple fact: people were just actors within a complex system, doing their best based on the information at hand.&lt;/p&gt;

&lt;p&gt;While the systems you build and operate likely don't control the fate of people's lives, they share many of the same complexity characteristics. Dev and Ops can learn an abundance from how the feedback loops between these aviation systems are designed and how these systems are operated.&lt;/p&gt;

&lt;p&gt;In this talk Lindsay will cover what happened on the flight, why the mainstream explanation doesn't add up, how design assumptions can impact people's ability to respond to rapidly developing situations, and how to improve your operational effectiveness when dealing with rapidly developing failure scenarios.&lt;/p&gt;&lt;/blockquote&gt;

&lt;iframe src="http://www.slideshare.net/slideshow/embed_code/18183459" width="427" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen webkitallowfullscreen mozallowfullscreen&gt; &lt;/iframe&gt;


&lt;p&gt;&lt;/p&gt;

&lt;p&gt;The subject matter is heavy, and I while it's something I'm passionate about, it was an emotionally taxing talk to prepare, and a talk that angers me when presenting.&lt;/p&gt;

&lt;p&gt;Time to let it sit and rest.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/AuxesisMusings/~4/Ujdt8FOpJ8Q" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://holmwood.id.au/~lindsay/2013/05/15/escalating-complexity-af447/</feedburner:origLink></entry>
 
 <entry>
   <title>Data failures, compartmentalisation challenges, monitoring pipelines</title>
   <link href="http://feedproxy.google.com/~r/AuxesisMusings/~3/6IRVZpFwQmk/" />
   <updated>2013-03-25T00:00:00+11:00</updated>
   <id>http://holmwood.id.au/~lindsay/2013/03/25/data-failures-compartments-pipelines</id>
   <content type="html">&lt;p&gt;To recap, &lt;a href="http://holmwood.id.au/~lindsay/2013/03/22/monitoring-pipelines/"&gt;pipelines are a useful way of modelling monitoring systems&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Each compartment of the pipeline manipulates monitoring data before making it available to the next.&lt;/p&gt;

&lt;p&gt;At a high level, this is how data flows between the compartments:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm9.staticflickr.com/8370/8579331916_e698523190_o.png" alt="basic pipeline" /&gt;&lt;/p&gt;

&lt;p&gt;This design gives us a nice separation of concern that enables scalability, fault tolerance, and clear interfaces.&lt;/p&gt;

&lt;h3&gt;The problem&lt;/h3&gt;

&lt;p&gt;What happens when there is no data available for the checks to query?&lt;/p&gt;

&lt;p&gt;In this very concrete case, we can divide the problem into two distinct classes of failure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Latency when accessing the metric storage layer&lt;/strong&gt;, manifested as &lt;a href="http://holmwood.id.au/~lindsay/2012/01/09/monitoring-sucks-latency-sucks-more/"&gt;checks timing out&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency or failure when pushing metrics into the storage layer&lt;/strong&gt;, manifested as checks being unable to retrieve fresh data.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;There are two outcomes from this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We need to provide clearer feedback to the people responding to alerts, to give them more insight into what's happening within the pipeline&lt;/li&gt;
&lt;li&gt;We need to make the technical system more robust when dealing with either of the above cases&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;Alerting severity levels aren't granular or accurate in a modern monitoring context&lt;/h3&gt;

&lt;p&gt;There are entire classes of monitoring problems (like the one we're dealing with here) that map poorly into the existing levels. This is an artefact of an industry wide cargo culting of the alerting levels from Nagios, and these levels may not make sense in a modern monitoring pipeline with distinctly compartmentalised stages.&lt;/p&gt;

&lt;p&gt;For example, the &lt;a href="http://nagiosplug.sourceforge.net/developer-guidelines.html#AEN76"&gt;Nagios plugin development guidelines&lt;/a&gt; state that &lt;code&gt;UNKNOWN&lt;/code&gt; from a check can mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Invalid command line arguments were supplied to the plugin&lt;/li&gt;
&lt;li&gt;Low-level failures internal to the plugin (such as unable to fork, or open a tcp socket) that prevent it from performing the specified operation.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;"Low-level failures" is extremely broad, and it's important operationally to provide precise feedback to the people maintaining the monitoring system.&lt;/p&gt;

&lt;p&gt;Adding an additional level (or levels) with contextual debugging information would help close this feedback loop.&lt;/p&gt;

&lt;p&gt;In defence of the current practice, there are operational benefits to mapping problems into just 4 levels. For example, there are only ever 4 levels that an engineer needs to be aware of, as opposed to a system where there are 5 or 10 different levels that capture the nuance of a state, but engineers don't understand what that nuance actually is.&lt;/p&gt;

&lt;h3&gt;Compartmentalisation as the saviour and bane&lt;/h3&gt;

&lt;p&gt;The core idea driving the pipeline approach is compartmentalisation. We want to split out the different functions of monitoring into separate reliable compartments that have clearly defined interfaces.&lt;/p&gt;

&lt;p&gt;The motivation for this approach comes from the performance limitations of traditional monitoring systems where all the functions essentially live on a single box that can only be scaled vertically. Eventually you will reach the vertical limit of hardware capacity.&lt;/p&gt;

&lt;p&gt;This is bad.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm9.staticflickr.com/8085/8579405596_46095fa5cc_o.png" alt="a monolithic monitoring system" /&gt;&lt;/p&gt;

&lt;p&gt;Thus the &lt;a href="http://holmwood.id.au/~lindsay/2013/03/22/monitoring-pipelines/"&gt;pipeline approach&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Each stage of the pipeline is handled by a different compartment of monitoring infrastructure that analyses and manipulates the data before deciding whether to pass it onto the next compartment.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;This sounds great, except that now we have to deal with the relationships between each compartment both in the normal mode of operation (fetching metrics, querying metrics, sending notifications, etc), but during failure scenarios (one or more compartments being down, incorrect or delayed information passed between compartments, etc).&lt;/p&gt;

&lt;p&gt;The pipeline attempts to take this into account:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Ideally, failures and scalability bottlenecks are compartmentalised.&lt;/p&gt;

&lt;p&gt;Where there are cascading failures that can't be contained, safeguards can be implemented in the surrounding compartments to dampen the effects.&lt;/p&gt;

&lt;p&gt;For example, if the data storage infrastructure stops returning data, this causes the check infrastructure to return false negatives. Or false positives. Or false UNKNOWNs. Bad times.&lt;/p&gt;

&lt;p&gt;We can contain the effects in the event processing infrastructure by detecting a mass failure and only sending out a small number of targeted notifications, rather than sending out alerts for each individual failing check.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;While the design is in theory meant to allow this containment, the practicalities of doing this are not straightforward.&lt;/p&gt;

&lt;p&gt;Some simple questions that need to be asked of each compartment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How does the compartment deal with a response it hasn't seen before?&lt;/li&gt;
&lt;li&gt;What is the &lt;a href="http://en.wikipedia.org/wiki/Adaptive_capacity"&gt;adaptive capacity&lt;/a&gt; of each compartment? How robust is each compartment?&lt;/li&gt;
&lt;li&gt;Does a failure in one compartment cascade into another? How far?&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The initial answers won't be pretty, and the solutions won't be simple (ideal as that would be) or easily discovered.&lt;/p&gt;

&lt;p&gt;Additionally, the robustness of each compartments in the pipeline &lt;em&gt;will be different&lt;/em&gt;, so making each compartent fault tolerant is a hard slog with unique challenges in each compartment.&lt;/p&gt;

&lt;h3&gt;How are people solving this problem?&lt;/h3&gt;

&lt;p&gt;Netflix recently &lt;a href="https://github.com/Netflix/Hystrix/wiki"&gt;open sourced a project called Hystrix&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;Specifically, Netflix talk about how they make this happen:&lt;/p&gt;

&lt;blockquote&gt;&lt;h4&gt;How does Hystrix accomplish this?&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Wrap all calls to external systems (dependencies) in a HystrixCommand object (command pattern) which typically executes within a separate thread.&lt;/li&gt;
&lt;li&gt;Time-out calls that take longer than defined thresholds. A default exists but for most dependencies is custom-set via properties to be just slightly higher than the measured 99.5th percentile performance for each dependency.&lt;/li&gt;
&lt;li&gt;Maintain a small thread-pool (or semaphore) for each dependency and if it becomes full commands will be immediately rejected instead of queued up.&lt;/li&gt;
&lt;li&gt;Measure success, failures (exceptions thrown by client), timeouts, and thread rejections.&lt;/li&gt;
&lt;li&gt;Trip a circuit-breaker automatically or manually to stop all requests to that service for a period of time if error percentage passes a threshold.&lt;/li&gt;
&lt;li&gt;Perform fallback logic when a request fails, is rejected, timed-out or short-circuited.&lt;/li&gt;
&lt;li&gt;Monitor metrics and configuration change in near real-time.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h3&gt;Potential Solutions&lt;/h3&gt;

&lt;p&gt;We can apply many of the strategies from Hystrix to the monitoring pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wrap all monitoring checks with a timeout that returns an &lt;code&gt;UNKNOWN&lt;/code&gt; (assuming you stick with the existing severity levels)&lt;/li&gt;
&lt;li&gt;Add some sort of signalling mechanism to the checks so they fail faster, e.g.

&lt;ul&gt;
&lt;li&gt;Stick a load balancer like HAProxy or Nginx in front of the data storage compartment&lt;/li&gt;
&lt;li&gt;Cache the state of the data storage compartment that all monitoring checks check before querying the compartment&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Detect mass failures, and notify on-call and the monitoring system owners directly to shorten the &lt;a href="http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/"&gt;MTTR&lt;/a&gt;. This is something &lt;a href="https://github.com/flpjck/flapjack"&gt;Flapjack&lt;/a&gt; aims to do &lt;a href="http://holmwood.id.au/~lindsay/2013/03/15/rebooting-flapjack/"&gt;as part of the reboot&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;I don't profess to have all (or even any) of the answers. This is new ground, and I'm very curious to hear how other people are solving this problem.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/AuxesisMusings/~4/6IRVZpFwQmk" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://holmwood.id.au/~lindsay/2013/03/25/data-failures-compartments-pipelines/</feedburner:origLink></entry>
 
 <entry>
   <title>Pipelines: a modern approach to modelling monitoring</title>
   <link href="http://feedproxy.google.com/~r/AuxesisMusings/~3/ZMTF_DaSmu0/" />
   <updated>2013-03-22T00:00:00+11:00</updated>
   <id>http://holmwood.id.au/~lindsay/2013/03/22/monitoring-pipelines</id>
   <content type="html">&lt;p&gt;Over the last few years I have been experimenting with different approaches for scaling  systems that monitor large numbers of heterogenous hosts, specifically in hosting environments.&lt;/p&gt;

&lt;p&gt;This post outlines a pipeline approach for modelling and manipulating monitoring data.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Monitoring can be represented as a pipeline which data flows through, and is eventually turned into a notification for a human.&lt;/p&gt;

&lt;p&gt;This approach has several benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Failures are compartmentalised&lt;/li&gt;
&lt;li&gt;Compartments can be scaled independently from one another&lt;/li&gt;
&lt;li&gt;Clear interfaces are required between compartments, enabling composability&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Each stage of the pipeline is handled by a different compartment of monitoring infrastructure that analyses and manipulates the data before deciding whether to pass it onto the next compartment.&lt;/p&gt;

&lt;p&gt;These components are the bare minimum required for a monitoring pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data collection infrastructure&lt;/strong&gt;, is generally a collection of agents on target systems, or standalone tools that extract metrics from opaque systems (preferably via an API).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data storage infrastructure&lt;/strong&gt;, provides a place to push collected metrics. These metrics are almost always numerical. These metrics are then queried and fetched for graphing, monitoring checks, and reporting - thus enabling &lt;a href="http://agilesysadmin.net/pillar-one"&gt;"We alert on what we draw"&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Check execution infrastructure&lt;/strong&gt;, runs the monitoring checks that are configured for each host, that query the data storage infrastructure. Checks that query textual data often poll the target system directly, which can have &lt;a href="http://holmwood.id.au/~lindsay/2012/01/09/monitoring-sucks-latency-sucks-more/"&gt;effects on latency&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Notification infrastructure&lt;/strong&gt;, processes check results from the check execution infrastructure to send notifications to engineers or stakeholders. Ideally the notification infrastructure can also feed back actions from engineers to acknowledge, escalate, or resolve alerts.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;At a high level, this is how data flows between the compartments:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm9.staticflickr.com/8370/8579331916_e698523190_o.png" alt="basic pipeline" /&gt;&lt;/p&gt;

&lt;p&gt;When using Nagios, the check + notification infrastructure are generally collapsed into one compartment (with the exception of &lt;a href="http://exchange.nagios.org/directory/Addons/Monitoring-Agents/NRPE--2D-Nagios-Remote-Plugin-Executor/details"&gt;NRPE&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Many monitoring pipelines start out with the data collection + storage infrastructure decoupled from the check infrastructure. Monitoring checks query the same targets that are being graphed, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Because the check intervals don't necessarily match up to the data collection intervals, it can be hard to correlate monitoring alerts to features on the graphs.&lt;/li&gt;
&lt;li&gt;The more systems poll the target system, the more the &lt;a href="http://en.wikipedia.org/wiki/Observer_effect_(physics)"&gt;observer effect&lt;/a&gt; is amplified.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;There are two other compartments that are becoming increasingly common:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Event processing infrastructure&lt;/strong&gt;. Sitting between the check execution and notification infrastructure, this compartment processes events generated from the check infrastructure, identifies trends and emergent behaviours, and forwards the alerts to the notification infrastructure. It may also make decisions on who to send alerts to.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Management infrastructure&lt;/strong&gt;, provides command + control facilities across all the compartments, as well as being the natural place for graphing and dashboards of metrics in the data storage infrastructure to live. If the target audience is non-technical or strongly segmented (e.g. many customers on a shared monitoring infrastructure), it can also provide an abstracted pretty public face to all the compartments.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;This is how event processing + management fit into the pipeline:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm9.staticflickr.com/8095/8579345790_663f5d3e09_o.png" alt="event processing + management added to the pipeline" /&gt;&lt;/p&gt;

&lt;p&gt;The management infrastructure can likely be broken up into different compartments as well, but for now it serves as a placeholder.&lt;/p&gt;

&lt;p&gt;Let's explore the benefits of this pipeline design.&lt;/p&gt;

&lt;h3&gt;Failures are compartmentalised&lt;/h3&gt;

&lt;p&gt;Ideally, failures and scalability bottlenecks are compartmentalised.&lt;/p&gt;

&lt;p&gt;Where there are cascading failures that can't be contained, safeguards can be implemented in the surrounding compartments to dampen the effects&lt;sup&gt;&lt;a href="#blah"&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;For example, if the data storage infrastructure stops returning data, this causes the check infrastructure to return false negatives. Or false positives. Or false UNKNOWNs. Bad times.&lt;/p&gt;

&lt;p&gt;We can contain the effects in the event processing infrastructure by detecting a mass failure and only sending out a small number of targeted notifications, rather than sending out alerts for each individual failing check.&lt;/p&gt;

&lt;p&gt;This problem is tricky, interesting, and fodder for further blog posts. :-)&lt;/p&gt;

&lt;h3&gt;Compartments can be scaled independently&lt;/h3&gt;

&lt;p&gt;Monolithic monitoring architectures are a pain to scale. Viewing a monolithic architecture through the prism of the pipeline model, all of the compartments are squeezed onto a single machine. Quite often there isn't a data collection or storage layer either.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm9.staticflickr.com/8085/8579405596_46095fa5cc_o.png" alt="a monolithic monitoring system" /&gt;&lt;/p&gt;

&lt;p&gt;Monolithic architectures often use the same moving parts under the hood, but they tend to be very closely entwined. Each tool has very distinct performance characteristics, but because they all run on a single machine and poorly separated, the only way to improve performance is by throwing expensive hardware at the problem.&lt;/p&gt;

&lt;p&gt;If you've ever worked with a monolithic monitoring system, you will likely be experiencing painful flashbacks right about now.&lt;/p&gt;

&lt;p&gt;To generalise the workload of the different compartments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check execution, notifications, and event processing tends to be very CPU intensive + network latency sensitive&lt;/li&gt;
&lt;li&gt;Data storage is IO intensive + disk space expensive&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Making sure each compartment is humming along nicely is super important when providing a consistent and reliable monitoring service.&lt;/p&gt;

&lt;p&gt;Splitting the compartments onto separate infrastructure enables us to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Optimise the performance of each component individually, either through using hardware that's more appropriate for the workloads (SSDs, multi-CPU physical machines), or tuning the software stack at the kernel and user space level.&lt;/li&gt;
&lt;li&gt;Expose data through well defined APIs, which leads into the next point:&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;Clear interfaces are required between compartments&lt;/h3&gt;

&lt;p&gt;I like to think of this as "the Duplo approach" - compartments with well defined interfaces you can plug together to compose your pipeline.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm3.staticflickr.com/2518/3999316430_8df5fdda1f_z.jpg" alt="a Dulpo brick" /&gt;&lt;/p&gt;

&lt;p&gt;Clear interfaces abstract the tools used in each compartment of the pipeline, which is essential for chaining tools in a composable way.&lt;/p&gt;

&lt;p&gt;Clear interfaces help us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replace underperforming tools that have reached their scalability limits&lt;/li&gt;
&lt;li&gt;Test new tools in parallel with the old tools by verifying their inputs + outputs&lt;/li&gt;
&lt;li&gt;Better identify input that could be considered erroneous, and react appropriately&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Concepts like &lt;a href="http://en.wikipedia.org/wiki/Design_by_contract"&gt;Design by Contract&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Service-oriented_architecture"&gt;Service Oriented Architecture&lt;/a&gt;, or &lt;a href="http://en.wikipedia.org/wiki/Defensive_programming"&gt;Defensive Programming&lt;/a&gt; then have direct applicability to the design of individual components and the pipeline overall.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;It's not all rainbows and unicorns. There are some downsides to the pipeline approach.&lt;/p&gt;

&lt;h3&gt;Greater Cost&lt;/h3&gt;

&lt;p&gt;There will almost certainly be a bigger initial investment in building a monitoring system with the pipeline approach.&lt;/p&gt;

&lt;p&gt;You'll be using more components, thus more servers, thus the cost is greater. While the cost of scaling out may be greater up-front, you limit the need to scale up later on.&lt;/p&gt;

&lt;p&gt;You can counteract some of these effects by starting small and dividing up compartments over time as part of a piecemeal strategy, but this takes time + persistence.&lt;/p&gt;

&lt;p&gt;I can tell you from personal project management experience when rolling out of this pipeline design that it's hard work keeping a model of the complexity in your head and also well documented.&lt;/p&gt;

&lt;h3&gt;More Complexity&lt;/h3&gt;

&lt;p&gt;The pipeline makes it easier to eliminate scalability bottlenecks at the expense of more moving parts. The more moving parts, the greater the likelihood of failure.&lt;/p&gt;

&lt;p&gt;Operationally it will be more difficult to troubleshoot when failures occur, and this becomes worse as you increase the safeguards and fault tolerance within your compartments.&lt;/p&gt;

&lt;p&gt;This is the cost of scalability, and there is no easy fix.&lt;/p&gt;

&lt;h3&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;The pipeline model maps nicely to existing monitoring infrastructures, but also to larger distributed monitoring systems.&lt;/p&gt;

&lt;p&gt;It provides scalability, fault tolerance, and composability at the cost of a larger upfront investment.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;a id="blah"&gt;1&lt;/a&gt;: This is a vast simplification of a very complex topic. Thinking of failure as an energy to be contained by barriers was a popular perspective in accident prevention circles from the 1960's to the 1980's, &lt;a href="https://www.msb.se/Upload/Kunskapsbank/Forskningsrapporter/Slutrapporter/2009%20Resilience%20Engineering%20New%20directions%20for%20measuring%20and%20maintaining%20safety%20in%20complex%20systems.pdf"&gt;but the concept doesn't necessarily apply to complex systems&lt;/a&gt;.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/AuxesisMusings/~4/ZMTF_DaSmu0" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://holmwood.id.au/~lindsay/2013/03/22/monitoring-pipelines/</feedburner:origLink></entry>
 
 <entry>
   <title>Rebooting Flapjack</title>
   <link href="http://feedproxy.google.com/~r/AuxesisMusings/~3/uhKu5kYZ9EI/" />
   <updated>2013-03-15T00:00:00+11:00</updated>
   <id>http://holmwood.id.au/~lindsay/2013/03/15/rebooting-flapjack</id>
   <content type="html">&lt;p&gt;This is the first time I've actually blogged about Flapjack.&lt;/p&gt;

&lt;h3&gt;The past&lt;/h3&gt;

&lt;p&gt;In 2008 I started talking with &lt;a href="https://twitter.com/imprecise_matt"&gt;Matt Moor&lt;/a&gt; about building a "next generation monitoring system" that would be simple to setup &amp;amp; operate, and provide obvious paths to scale.&lt;/p&gt;

&lt;p&gt;In 2009 I started hacking on Flapjack while backpacking, and by mid 2009 I had a working prototype running basic monitoring checks.&lt;/p&gt;

&lt;p&gt;The fundamental idea was simple: decouple the check execution from the alerting and notification, and use message queues to distribute the check execution across lots of machines.&lt;/p&gt;

&lt;p&gt;It seems simple and obvious now, but at the time nobody was really talking about doing this, so Flapjack gathered a reasonable amount of attention relatively quickly after I started talking about it at conferences.&lt;/p&gt;

&lt;p&gt;2010 rolled around and I was unable to maintain a good development pace and hold that attention gained by talking at conferences due to some &lt;a href="http://www.flickr.com/photos/auxesis/7104782937"&gt;fairly significant life changes&lt;/a&gt;. Pretty much all of my open source projects suffered, and in the space of 12 months:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://cucumber-nagios.org"&gt;cucumber-nagios&lt;/a&gt; maintainership was handed over&lt;/li&gt;
&lt;li&gt;&lt;a href="http://visage-app.com"&gt;Visage&lt;/a&gt; got a small trickle of bug fixes&lt;/li&gt;
&lt;li&gt;&lt;a href="http://flapjack-project.com"&gt;Flapjack&lt;/a&gt; was wound up and &lt;a href="https://github.com/flpjck/flapjack/commit/661dbd84d2d94a67b6cea58e5f6e86c82b6b316b"&gt;I considered it a dead project&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;There were plenty of other interesting projects like &lt;a href="http://sensuapp.org/"&gt;Sensu&lt;/a&gt; that were achieving similar goals excellently, so while winding up Flapjack was a source of bitter personal disappointment, it was offset by seeing other people doing awesome work in the monitoring space.&lt;/p&gt;

&lt;h3&gt;The present&lt;/h3&gt;

&lt;p&gt;Mid &lt;abbr title='2012'&gt;last year&lt;/abbr&gt;, an interesting problem arose at work:&lt;/p&gt;

&lt;p&gt;In a modern "monitoring system", how do you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Notify a dynamic group of people on a variety of media based on monitoring events?&lt;/strong&gt; &lt;a href="http://bulletproof.net"&gt;Bulletproof&lt;/a&gt; has thousands of people that may need to be notified by our monitoring system, depending on what monitoring checks are failing. While the thresholds on each monitoring check are universal, each of these people can have different notification settings based on time of day or week, the type of service affected, or the severity of the failure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dampen or roll up common events so on-call isn't bombarded during outages?&lt;/strong&gt; When one system deep in the stack fails, it has significant flow-on effects to everything else that depends on it. This generally manifests as thousands (or tens of thousands, in extremely bad cases) of alerts being sent to on-call in a very short period of time (&amp;lt;60 seconds). Obviously this is bad, and we simply want to detect cases like these, and wake up people involved in the incident response process.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Do the above in an API driven way?&lt;/strong&gt; We need to solve both problems in a way that works in a multitenant environment with strong segregation between customers, and integrates with an existing monitoring &amp;amp; customer self-service stack.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Thus, &lt;a href="https://github.com/flpjck/flapjack"&gt;Flapjack was rebooted&lt;/a&gt; with a significantly altered focus:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Event processing&lt;/li&gt;
&lt;li&gt;Correlation &amp;amp; rollup&lt;/li&gt;
&lt;li&gt;API driven configuration&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;We've been actively working on the reboot since July last year, and have been sending alerts from Flapjack to customers since January.&lt;/p&gt;

&lt;p&gt;We're developing Flapjack as a &lt;a href="http://en.wikipedia.org/wiki/MIT_License"&gt;fully Open Source&lt;/a&gt; &lt;a href="https://github.com/flpjck/flapjack/wiki/USING"&gt;composable platform&lt;/a&gt; on which you can &lt;a href="https://github.com/flpjck/flapjack/wiki/IMPORTING"&gt;adapt&lt;/a&gt; and build to your organisation's needs by hooking it into your existing check execution infrastructure (we ship a Nagios event processor), and self service and provisioning automation tools.&lt;/p&gt;

&lt;p&gt;Because we care deeply about people integrating Flapjack into their existing environments, we have invested a lot of time and energy into writing quality documentation that covers &lt;a href="https://github.com/flpjck/flapjack/wiki/API"&gt;working with the API&lt;/a&gt;, &lt;a href="https://github.com/flpjck/flapjack/wiki/DEBUGGING"&gt;debugging production issues&lt;/a&gt;, and &lt;a href="https://github.com/flpjck/flapjack/wiki/DATA_STRUCTURES"&gt;the data structures&lt;/a&gt; used behind the scenes. That's all on top of the &lt;a href="https://github.com/flpjck/flapjack/wiki/USING"&gt;usage documentation&lt;/a&gt;, of course.&lt;/p&gt;

&lt;p&gt;Flapjack is built on Redis, and funnily enough &lt;a href="https://twitter.com/ripienaar"&gt;R.I. Pienaar&lt;/a&gt; did a post &lt;a href="http://www.devco.net/archives/2013/01/06/solving-monitoring-state-storage-problems-using-redis.php"&gt;earlier this year&lt;/a&gt; that investigates using Redis to solve the same problem in an extremely similar way. R.I.'s post provides a good primer on some of the thinking behind Flapjack, so I recommend giving it a read.&lt;/p&gt;

&lt;h3&gt;The future&lt;/h3&gt;

&lt;p&gt;Fundamentally, Flapjack is trying to plug a notification hole in the monitoring ecosystem that I don't believe is being adequately addressed by other tools, but the key to doing this is to play nicely with other tools and build a composable pipeline.&lt;/p&gt;

&lt;p&gt;The above is merely a glimpse of Flapjack that leaves quite a few questions unanswered (e.g. &lt;em&gt;"Why aren't you using $x feature of $y check execution engine to do roll-up?"&lt;/em&gt;, &lt;em&gt;"Do Flapjack and &lt;a href="http://riemann.io/"&gt;Riemann&lt;/a&gt; play nicely with one another?"&lt;/em&gt;), so stay tuned for more:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://24.media.tumblr.com/tumblr_lx2uc33Q0Z1qb6v7mo1_500.gif" alt="more waffles" /&gt;&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/AuxesisMusings/~4/uhKu5kYZ9EI" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://holmwood.id.au/~lindsay/2013/03/15/rebooting-flapjack/</feedburner:origLink></entry>
 
 <entry>
   <title>Upcoming speaking engagements and travel</title>
   <link href="http://feedproxy.google.com/~r/AuxesisMusings/~3/h3n-_iOu5tc/" />
   <updated>2013-03-05T00:00:00+11:00</updated>
   <id>http://holmwood.id.au/~lindsay/2013/03/05/upcoming-speaking-and-travel</id>
   <content type="html">&lt;p&gt;My next 2 months is going to be jam packed with conferences and travel!&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.devopsdays.org/events/2013-newzealand/"&gt;Devopsdays NZ&lt;/a&gt;, &lt;strong&gt;March 8 2013&lt;/strong&gt;. I will be giving a talk that analyses &lt;a href="http://www.devopsdays.org/events/2013-newzealand/proposals/LessonsCollaborativeMaintenance/"&gt;AA261 through a DevOps lense&lt;/a&gt;, looking at the collaborative maintenance and operation of the MD-83 in the crash.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://monitorama.com/"&gt;Monitorama&lt;/a&gt;, &lt;strong&gt;March 28-29 2013&lt;/strong&gt;. I'm looking forward to slowing down and listening at Monitorama, which has a tremendous line up of speakers. I'll be keen to hear what others think of the work &lt;a href="http://bulletproof.net"&gt;we've&lt;/a&gt; been doing &lt;a href="http://github.com/flpjck/flapjack"&gt;on Flapjack&lt;/a&gt; the &lt;a href="https://speakerdeck.com/auxesis/zombie-pancakes-rebooting-flapjack-lindsay-holmwood"&gt;last 6 months&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://mtnwestrubyconf.org/"&gt;Mountain West Ruby Conf 2013&lt;/a&gt;, &lt;strong&gt;April 3-5 2013&lt;/strong&gt;. MWRC has added an extra day of DevOps content to the conference this year, and I'll be joining an esteemed speaker lineup to talk about what both dev and ops can learn from &lt;a href="http://en.wikipedia.org/wiki/Air_France_Flight_447"&gt;AF447&lt;/a&gt; when responding to rapidly evolving failure scenarios.&lt;/li&gt;
&lt;li&gt;I'll be staying in the Netherlands for a little under a week between conferences, visiting family and friends. Hopefully I can visit a meetup or two.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.netways.de/en/osdc/osdc_2013/overview/"&gt;Open Source Data Center Conference 2013&lt;/a&gt;, &lt;strong&gt;April 17-18 2013&lt;/strong&gt;. This will be my first time in Nürenberg, and I'm really looking forward to saying I have attended &lt;a href="http://en.wikipedia.org/wiki/Open_Source_Developers'_Conference"&gt;both&lt;/a&gt; &lt;a href="http://www.netways.de/en/osdc/osdc_2013/overview/"&gt;OSDCs&lt;/a&gt;. I'll be talking about &lt;a href="http://github.com/auxesis/ript"&gt;Ript&lt;/a&gt;, a DSL for describing firewall rules, and a tool for incrementally applying them.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.netways.de/puppetcamp"&gt;Puppet Camp Nürenberg 2013&lt;/a&gt;, &lt;strong&gt;April 19 2013&lt;/strong&gt;. Straight after OSDC I'll be talking about how we are using &lt;a href="http://bulletproof.net/"&gt;Puppet at Bulletproof Networks&lt;/a&gt; in multi-tenant, isolated environments.&lt;/li&gt;
&lt;/ul&gt;

&lt;img src="http://feeds.feedburner.com/~r/AuxesisMusings/~4/h3n-_iOu5tc" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://holmwood.id.au/~lindsay/2013/03/05/upcoming-speaking-and-travel/</feedburner:origLink></entry>
 
 <entry>
   <title>How I make interesting technical presentations</title>
   <link href="http://feedproxy.google.com/~r/AuxesisMusings/~3/YSLrgmVZhfM/" />
   <updated>2013-01-26T00:00:00+11:00</updated>
   <id>http://holmwood.id.au/~lindsay/2013/01/26/how-i-make-interesting-technical-presentations</id>
   <content type="html">&lt;p&gt;Whenever I talk at conferences, I am routinely asked how I go about preparing and making my presentations.&lt;/p&gt;

&lt;p&gt;There are no hard and fast rules, but these are some things I have learnt:&lt;/p&gt;

&lt;h3&gt;Start analog&lt;/h3&gt;

&lt;p&gt;The most limiting thing you can do when you start putting together a presentation is to reach for slideware. I use a paper notebook to brainstorm my ideas with multicoloured pens, then scan it so I can refer back to it quickly when putting the slides together.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm9.staticflickr.com/8098/8414483097_871baba740.jpg" alt="mindmapping a talk" /&gt;&lt;/p&gt;

&lt;h3&gt;Don't create slides linearly&lt;/h3&gt;

&lt;p&gt;I focus on an idea in the brainstorm that surprised me the most when I wrote it down, and use it as a jump-off point for creating slides. I've found exploring that initial idea helps set the tone for the rest of the presentation.&lt;/p&gt;

&lt;h3&gt;Weave a story&lt;/h3&gt;

&lt;p&gt;Kathy Sierra used to bang on &lt;a href="http://headrush.typepad.com/creating_passionate_users/2006/02/where_theres_pa.html"&gt;about this heaps&lt;/a&gt;. We're wired as a species to find stories interesting, so use this to your advantage.&lt;/p&gt;

&lt;p&gt;But don't concoct a story just for the talk - try to relate the content back to your own experiences. Nobody wants to hear about &lt;a href="http://en.wikipedia.org/wiki/Alice_and_Bob"&gt;Alice and Bob&lt;/a&gt;, they want to hear you and your co-workers rise above adversity and the setbacks you had along the way.&lt;/p&gt;

&lt;p&gt;Chris Fegan's &lt;a href="http://www.nbnco.com.au/"&gt;NBNCo&lt;/a&gt; talk at Puppet Camp Sydney 2013 was a good example of how to weave technical detail into an organisational growth story.&lt;/p&gt;

&lt;h3&gt;Use slides appropriately&lt;/h3&gt;

&lt;p&gt;They are a visual aid, and a visual aid alone. People's attention should be on you - you are the speaker after all! Use lots of supporting visuals, and minimal text. No bullet point lists! Put each point on a separate slide.&lt;/p&gt;

&lt;p&gt;I use &lt;a href="http://www.flickr.com/search/?q=wave&amp;amp;l=cc&amp;amp;ss=0&amp;amp;ct=0&amp;amp;mt=all&amp;amp;w=all&amp;amp;adv=1"&gt;Flickr's Creative Commons search&lt;/a&gt; to find relevant images, and favourite them when I want to use them again across multiple presentations. Sometimes they even provide a visual trigger that moves the presentation in a direction I wasn't expecting.&lt;/p&gt;

&lt;p&gt;If I post the slides after the presentation, it's always nice to comment on the picture on Flickr to let the photographer know I appreciate their contributions to Open Culture.&lt;/p&gt;

&lt;h3&gt;Don't rely on the slides&lt;/h3&gt;

&lt;p&gt;Ideally if your laptop died 5 minutes before the talk, you should know your material well enough that you could deliver it by voice alone.&lt;/p&gt;

&lt;h3&gt;Be thorough&lt;/h3&gt;

&lt;p&gt;Shortcuts are obvious to your audience. I spend at least 20 hours preparing each presentation.&lt;/p&gt;

&lt;p&gt;A lot of that time is research (I spent 10 hours alone doing research on &lt;a href="http://en.wikipedia.org/wiki/AF447"&gt;AF447&lt;/a&gt; before I created a single slide, and that research was probably too little given the depth of subject matter), and a lot of it is finding images on Flickr. :-)&lt;/p&gt;

&lt;p&gt;Maybe 20 hours is a lot, but every minute you put into preparation pays off.&lt;/p&gt;

&lt;h3&gt;Tailor your content&lt;/h3&gt;

&lt;p&gt;It's ok to give the same talk at multiple conferences, but make sure you alter the content so it's relevant to your audience.&lt;/p&gt;

&lt;p&gt;I gave my &lt;a href="http://www.slideshare.net/auxesis/monitoring-web-application-behaviour-with-cucumbernagios"&gt;cucumber-nagios talk&lt;/a&gt; tens of times over an 18 month period, but the talk was different every time.&lt;/p&gt;

&lt;p&gt;If I was at a developer conference, I would talk about how to reuse your existing tests as monitoring checks. If I was at a sysadmin conference, I would talk about testing systems infrastructure. If I was at a DevOps conference, I would talk about encoding &amp;amp; communicating business processes in your monitoring.&lt;/p&gt;

&lt;h3&gt;Practice, practice, practice&lt;/h3&gt;

&lt;p&gt;Know the timing of your talk. Work out what the average time you should spend on each slide. I generally rehearse each talk at least 3-5 times before I give it the first time, and will revise and rehearse at least 1-2 times on subsequent presentations.&lt;/p&gt;

&lt;p&gt;  Don't wait until you've finished the presentation before you start practicing. I'll often practice the 20% I've put together and discover it feels mechanical, or the ideas don't flow well into one another. Refactor.&lt;/p&gt;

&lt;h3&gt;Test your equipment&lt;/h3&gt;

&lt;p&gt;Plug your laptop into the projector at least once, preferably twice, before your talk. I carry multiple adapters for every conceivable display type out there, some display cables, a power board, and &lt;a href="http://www.logitech.com/en-au/product/professional-presenter-r800?crid=11"&gt;a clicker&lt;/a&gt;. Test everything, then test it again.&lt;/p&gt;

&lt;h3&gt;Mirror your display&lt;/h3&gt;

&lt;p&gt;It's tempting to use your laptop screen for presenter notes and stopwatch widgets. Don't. Know your material. Use a physical stopwatch. Split displays will break unexpectedly, and you'll lose your flow. Besides, mirroring is always easier than craning your neck to see what your audience is seeing.&lt;/p&gt;

&lt;h3&gt;Watch yourself&lt;/h3&gt;

&lt;p&gt;If you're lucky to talk at a conference where your talk is recorded, go back and watch your talk. This is vitally important for working out what bits flowed well and what bits were stilted.&lt;/p&gt;

&lt;p&gt;--&lt;/p&gt;

&lt;p&gt;The most important thing is to speak at many events as often as possible. You're only going to get better at presenting if you present. Start working towards that &lt;a href="http://en.wikipedia.org/wiki/Outliers_(book)"&gt;10,000 hours of mastery&lt;/a&gt;!&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/AuxesisMusings/~4/YSLrgmVZhfM" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://holmwood.id.au/~lindsay/2013/01/26/how-i-make-interesting-technical-presentations/</feedburner:origLink></entry>
 
 <entry>
   <title>DevOps Down Under 2012 - what happened?</title>
   <link href="http://feedproxy.google.com/~r/AuxesisMusings/~3/A7NTdpy0Jyg/" />
   <updated>2012-11-22T00:00:00+11:00</updated>
   <id>http://holmwood.id.au/~lindsay/2012/11/22/devops-down-under-2012-what-happened</id>
   <content type="html">&lt;p&gt;Almost 2 days ago &lt;a href="https://twitter.com/patrickdebois"&gt;Patrick&lt;/a&gt; kicked off a discussion about organising another Australian DevOps conference in 2013 amongst a small group of passionate DevOps who are actively involved in the Australian community.&lt;/p&gt;

&lt;p&gt;While the discussion was trundling on without me, I felt I owed everyone involved an explanation of what happened with this year's &lt;a href="http://devopsdownunder.org"&gt;unrealised conference&lt;/a&gt;, and why the conference fell flat.&lt;/p&gt;

&lt;p&gt;Let's start at the beginning.&lt;/p&gt;

&lt;p&gt;Having come back from a year of backpacking around Europe and attending the first DevOpsDays conference, I took it upon myself to try and replicate the success by organising the first DevOps Down Under conference in 2010.&lt;/p&gt;

&lt;p&gt;It was a relatively small affair held downstairs at Atlassian's Corn Exchange offices in Sydney, and I put the thing together on a shoestring budget in my spare time with some on-the-ground help from Atlassian's &lt;a href="https://twitter.com/nickmuldoon"&gt;Nicholas Muldoon&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The event was successful, with people from all across Australia and New Zealand to attending. At the end of the conference, each attendee was asked to write down one thing they loved, and one thing they hated about the conference.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm5.staticflickr.com/4079/5448860355_1e98e1c647_z.jpg" alt="Stacks of love and hate" /&gt;&lt;/p&gt;

&lt;p&gt;This gave me a great starting point to build another conference on, and in early 2011 I started getting the itch to do another. At the same time, &lt;a href="https://twitter.com/evanbottcher"&gt;Evan Bottcher&lt;/a&gt; pinged me about ThoughtWorks lending a hand to organise another DevOps Down Under in Melbourne later in 2011.&lt;/p&gt;

&lt;p&gt;The most consistent feedback we got from the 2010 conference was that the coffee was "a little bit shit", so we fixed that by moving the whole conference to Melbourne.&lt;/p&gt;

&lt;p&gt;After an initial planning meeting, ThoughtWorks kindly lent &lt;a href="https://twitter.com/chrisbushelloz"&gt;Chris Bushell&lt;/a&gt; and &lt;a href="http://www.linkedin.com/pub/natalie-drucker/2a/233/911"&gt;Natalie Drucker&lt;/a&gt; to assist with organising.&lt;/p&gt;

&lt;p&gt;I was just starting a new position at work, and wasn't able to dedicate nearly as much time to organising as I had in 2010. I provided the initial vision and direction, but without Chris and Natalie's tireless efforts and persistent pestering of me to get my arse into gear, the conference would have been but a shadow of itself.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm9.staticflickr.com/8070/8206025708_56cf336d68_z.jpg" alt="Attendees at #dodu2011" /&gt;&lt;/p&gt;

&lt;p&gt;By the time DevOps Down Under 2011 wrapped up in July, I was tired and wasn't feeling fired up about putting on another conference just yet. I decided to wait and see how I felt in the new year.&lt;/p&gt;

&lt;p&gt;Around March this year I started thinking about doing another conference, but the spark wasn't there like in other years. I decided to press on regardless, motivated by my perceived expectation that people wanted another conference.&lt;/p&gt;

&lt;p&gt;The vision for DevOps Down Under 2012 was to build a quiet, intimate, and safe atmosphere that was removed from the rat race. To achieve this, the plan was to cap the number of attendees at 140, find a venue outside a major capital city, and source high quality talks.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm9.staticflickr.com/8489/8206032538_cfc53dfa14_z.jpg" alt="Venue shot for #dodu2012" /&gt;&lt;/p&gt;

&lt;p&gt;The venue &amp;amp; budget was in place, and we got a really great collection of talks submitted. I simply failed to execute on anything beyond that.&lt;/p&gt;

&lt;p&gt;The main reasons why execution failed were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I had lost the passion for organising the conference, and was motivated by the wrong reasons.&lt;/li&gt;
&lt;li&gt;I had even less time to commit.&lt;/li&gt;
&lt;li&gt;Everyone involved was similarly time poor.&lt;/li&gt;
&lt;li&gt;There was no organisational cadence.&lt;/li&gt;
&lt;li&gt;I didn't lean enough on other people to help me do the grunt work.&lt;/li&gt;
&lt;li&gt;I didn't have the time to fix any of these problems.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;With the benefit of hindsight, I simply shouldn't have tried to put it on.&lt;/p&gt;

&lt;p&gt;Seeing people putting their hands up to organise a 2013 conference takes a huge mental weight off my shoulders.&lt;/p&gt;

&lt;p&gt;Through my own actions and inactions, I have felt the responsibility of leading the conference organisation year-on-year has fallen to me. In 2012 that pressure became paralysing, and my eventual coping mechanism was to ignore the conference entirely.&lt;/p&gt;

&lt;p&gt;As for my future involvement: I am still burnt out, and it would simply be unfair to myself, the organisers, speakers, and attendees to commit to taking an active role in organising a 2013 conference.&lt;/p&gt;

&lt;p&gt;I have provided the current crop of potential organisers a collection of resources to get them started, and I am extremely confident they will manage to pull off something spectacular.&lt;/p&gt;

&lt;p&gt;Drawing on my battered experience of organising several conferences, these are the key actionable things I believe you need to make an event like DevOps Down Under happen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Have at least 3 people who can each dedicate 2+ hours a week to doing the grunt work.&lt;/strong&gt; Anyone who tells you organising a conference is anything but a hard slog is either lying to you, or doesn't know what they are talking about.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Do weekly catchup meetings to keep things on track.&lt;/strong&gt; Increase the frequency of these closer to the conference date.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use a mailing list for asynchronous organisation.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nominate someone to lead &amp;amp; own the conference vision &amp;amp; organisation.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;I hope the above arms you with enough information to avoid falling into the same traps I did.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/AuxesisMusings/~4/A7NTdpy0Jyg" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://holmwood.id.au/~lindsay/2012/11/22/devops-down-under-2012-what-happened/</feedburner:origLink></entry>
 
 <entry>
   <title>Ript: quick, reliable, and painless firewalling</title>
   <link href="http://feedproxy.google.com/~r/AuxesisMusings/~3/l1XZb3Q_hDo/" />
   <updated>2012-11-12T00:00:00+11:00</updated>
   <id>http://holmwood.id.au/~lindsay/2012/11/12/ript-quick-reliable-painless-firewalling</id>
   <content type="html">&lt;p&gt;Running your own servers? Hate managing firewall rules?&lt;/p&gt;

&lt;p&gt;For the last year at &lt;a href="http://bulletproof.net"&gt;Bulletproof Networks&lt;/a&gt; I've been working on a little tool called &lt;a href="http://github.com/bulletproofnetworks/ript"&gt;Ript&lt;/a&gt; to make writing firewall rules a joy, and applying them quick, reliable, and painless.&lt;/p&gt;

&lt;p&gt;Ript is a clean and opinionated &lt;a href="http://en.wikipedia.org/wiki/Domain-specific_language"&gt;Domain Specific Language&lt;/a&gt; for describing firewall rules, and a tool with database migrations-like functionality for applying these rules with zero downtime.&lt;/p&gt;

&lt;h3&gt;The DSL&lt;/h3&gt;

&lt;p&gt;At Ript's core is an easy to use Ruby DSL for describing both simple and complex sets of iptables firewall rules. After defining the hosts and networks you care about:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="ruby"&gt;&lt;span class="n"&gt;partition&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;joeblogsco&amp;quot;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;www.joeblogsco.com&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="ss"&gt;:address&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;172.19.56.216&amp;quot;&lt;/span&gt;
  &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;app-01&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                  &lt;span class="ss"&gt;:address&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;192.168.5.230&amp;quot;&lt;/span&gt;
  &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;joeblogsco uat subnet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="ss"&gt;:address&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;192.168.5.0/24&amp;quot;&lt;/span&gt;
  &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;joeblogsco stage subnet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:address&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;10.60.2.0/24&amp;quot;&lt;/span&gt;
  &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;joeblogsco prod subnet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="ss"&gt;:address&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;10.60.3.0/24&amp;quot;&lt;/span&gt;
  &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;bad guy&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                 &lt;span class="ss"&gt;:address&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;172.19.110.247&amp;quot;&lt;/span&gt;
  &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;bad guys&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="ss"&gt;:address&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;10.0.0.0/8&amp;quot;&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;...you use Ript's helpers for accepting, dropping, &amp;amp; rejecting packets, as well as for performing DNAT and SNAT:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="ruby"&gt;&lt;span class="n"&gt;partition&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;joeblogsco&amp;quot;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;www.joeblogsco.com&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="ss"&gt;:address&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;172.19.56.216&amp;quot;&lt;/span&gt;
  &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;app-01&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                  &lt;span class="ss"&gt;:address&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;192.168.5.230&amp;quot;&lt;/span&gt;
  &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;joeblogsco uat subnet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="ss"&gt;:address&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;192.168.5.0/24&amp;quot;&lt;/span&gt;
  &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;joeblogsco stage subnet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:address&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;10.60.2.0/24&amp;quot;&lt;/span&gt;
  &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;joeblogsco prod subnet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="ss"&gt;:address&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;10.60.3.0/24&amp;quot;&lt;/span&gt;
  &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;bad guy&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                 &lt;span class="ss"&gt;:address&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;172.19.110.247&amp;quot;&lt;/span&gt;
  &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;bad guys&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="ss"&gt;:address&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;10.0.0.0/8&amp;quot;&lt;/span&gt;

  &lt;span class="n"&gt;rewrite&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;public website + ssh access&amp;quot;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
    &lt;span class="n"&gt;ports&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;
    &lt;span class="n"&gt;dnat&lt;/span&gt;  &lt;span class="s2"&gt;&amp;quot;www.joeblogsco.com&amp;quot;&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;app-01&amp;quot;&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;rewrite&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;private to public&amp;quot;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
    &lt;span class="n"&gt;snat&lt;/span&gt;  &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;joeblogsco uat subnet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;joeblogsco stage subnet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s2"&gt;&amp;quot;joeblogsco prod subnet&amp;quot;&lt;/span&gt;  &lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;www.joeblogsco.com&amp;quot;&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;reject&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;bad guy&amp;quot;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
    &lt;span class="n"&gt;from&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;bad guy&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;to&lt;/span&gt;   &lt;span class="s2"&gt;&amp;quot;www.joeblogsco.com&amp;quot;&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;drop&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;bad guys&amp;quot;&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
    &lt;span class="n"&gt;protocols&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;udp&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;from&lt;/span&gt;      &lt;span class="s2"&gt;&amp;quot;bad guys&amp;quot;&lt;/span&gt;
    &lt;span class="n"&gt;to&lt;/span&gt;        &lt;span class="s2"&gt;&amp;quot;www.joeblogsco.com&amp;quot;&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;The DSL provides many &lt;a href="https://github.com/bulletproofnetworks/ript#shortcuts"&gt;helpful shortcuts&lt;/a&gt; for DRYing up your firewall rules, and tries to do as much of the heavy lifting for you as possible.&lt;/p&gt;

&lt;p&gt;Part of Ript being opinionated is that it doesn't expose all the underlying features of iptables. This was done for several reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The DSL would become complex, and thus harder to use.&lt;/li&gt;
&lt;li&gt;Not all features within iptables map to Ript's DSL&lt;/li&gt;
&lt;li&gt;Ript caters for the simple-to-moderately complex use cases that 80% of users have. If you need to use iptables features documented deep within the man pages, Ript is almost certainly not the tool for you.&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;Rule application&lt;/h3&gt;

&lt;p&gt;While the DSL is pretty, we didn't write Ript because of it - we wrote it because we're working with tens of thousands of iptables rules &amp;amp; making several changes a day to those rules, and the traditional way of applying changes doesn't cut it at scale.&lt;/p&gt;

&lt;p&gt;Most tools try to apply firewall rules by flushing all the loaded rules and loading in new ones. This works fine if you only have a few hundred rules, but as soon as you start scaling into thousands of rules, the load time becomes very noticable.&lt;/p&gt;

&lt;p&gt;The effects of this are fairly simple: the rule load time manifests itself as downtime.&lt;/p&gt;

&lt;p&gt;Because the ruleset has to be applied serially, rules at the end of the set are held up by rules still being applied at the beginning of the set. From a service provider's perspective, this means that a rule change for one customer can end up causing downtime for other completely unrelated customers. Not cool.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;iptables-save&lt;/code&gt; and &lt;code&gt;iptables-restore&lt;/code&gt; help with this, but you still end up writing + applying rules by hand - a tedious task if you're making lots of firewall changes every day.&lt;/p&gt;

&lt;p&gt;Ript's killer feature is incrementally applying rules.&lt;/p&gt;

&lt;p&gt;Ript generates firewall chains in a very specific way that allows it to apply new rules incrementally, and clean out old rules intelligently. Here's an example session:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="bash"&gt;&lt;span class="c"&gt;# Output all the generated rules by interpreting all files under /etc/firewall&lt;/span&gt;
ript rules generate /etc/firewall
&lt;span class="c"&gt;# Output a diff of rules to apply, based on what rules are currently loaded in memory&lt;/span&gt;
ript rules diff /etc/firewall
&lt;span class="c"&gt;# Apply the aforementioned diff&lt;/span&gt;
ript rules apply /etc/firewall
&lt;span class="c"&gt;# Output the currently loaded rule in iptables-restore format&lt;/span&gt;
ript rules save
&lt;span class="c"&gt;# Output a diff of rules to delete&lt;/span&gt;
ript clean diff /etc/firewall
&lt;span class="c"&gt;# Apply the aforementioned diff&lt;/span&gt;
ript clean apply /etc/firewall
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;h3&gt;Getting started&lt;/h3&gt;

&lt;p&gt;Ript has been Open Sourced under an MIT license, and is &lt;a href="http://github.com/bulletproofnetworks/ript"&gt;available on GitHub&lt;/a&gt;. To get you going, Ript ships with &lt;a href="https://github.com/bulletproofnetworks/ript#the-dsl"&gt;extensive DSL usage documentation&lt;/a&gt;, and a &lt;a href="https://github.com/bulletproofnetworks/ript/tree/master/examples"&gt;boatload of examples&lt;/a&gt; used by the &lt;a href="https://github.com/bulletproofnetworks/ript/tree/master/features"&gt;tests&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I'll also be giving a &lt;a href="https://lca2013.linux.org.au/schedule/30293/view_talk?day=None"&gt;talk about Ript at linux.conf.au&lt;/a&gt; in Canberra in January 2013.&lt;/p&gt;

&lt;p&gt;Happy Ripting!&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/AuxesisMusings/~4/l1XZb3Q_hDo" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://holmwood.id.au/~lindsay/2012/11/12/ript-quick-reliable-painless-firewalling/</feedburner:origLink></entry>
 
 <entry>
   <title>Incentivising automated changes</title>
   <link href="http://feedproxy.google.com/~r/AuxesisMusings/~3/BTvgAJAWR8o/" />
   <updated>2012-10-29T00:00:00+11:00</updated>
   <id>http://holmwood.id.au/~lindsay/2012/10/29/incentivising-automated-changes</id>
   <content type="html">&lt;p&gt;Matthias Marschall wrote a great peice last week on the &lt;a href="http://www.agileweboperations.com/devops-protocol-no-manual-changes"&gt;pitfalls of making manual changes&lt;/a&gt; to production systems. &lt;strong&gt;TL,DR; Making manual changes in the heat of the moment will bite you at the most inopportune times&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The article finishes with this suggestion:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;You should have your configuration management tool (like Puppet or Chef) setup so that you can try out possible solutions without having to go in and do it manually.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;In my experience, this is the key to solving the problem.&lt;/p&gt;

&lt;p&gt;Rather than coercing people to follow a "no manual changes" policy, you make the incentives for making changes with automation better than for making changes manually.&lt;/p&gt;

&lt;p&gt;Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Make it simple.&lt;/em&gt; Reduce the number of steps to make the change with automation. It should be quicker to find the place in your Chef or Puppet code and deploy than logging into the box, editing a file, and restarting a service.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Make it fast.&lt;/em&gt; The time from thinking about the change to the change being applied should be shorter with automation than by doing it manually.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Make it safe.&lt;/em&gt; Provide a rollback mechanism for changes. A safety harness can be as simple as a thin process around "git revert" + deploy.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;It's a perfect example of how tools should complement culture.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/AuxesisMusings/~4/BTvgAJAWR8o" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://holmwood.id.au/~lindsay/2012/10/29/incentivising-automated-changes/</feedburner:origLink></entry>
 
 <entry>
   <title>Instrumenting your monitoring checks with New Relic</title>
   <link href="http://feedproxy.google.com/~r/AuxesisMusings/~3/SmwVcEMY2Io/" />
   <updated>2012-01-13T00:00:00+11:00</updated>
   <id>http://holmwood.id.au/~lindsay/2012/01/13/instrumenting-your-monitoring-checks-with-new-relic</id>
   <content type="html">&lt;p&gt;&lt;em&gt;This post is part 3 of 3 in a series on monitoring scalability.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In parts 1 and 2 of this series I talked about &lt;a href="http://holmwood.id.au/~lindsay/2012/01/09/monitoring-sucks-latency-sucks-more/"&gt;check latency&lt;/a&gt; and how you can mitigate its effects by splitting data collection + storage out from alerting, while looking at monitoring systems &lt;a href="http://holmwood.id.au/~lindsay/2012/01/11/monitoring-system-equal-web-app-when-diagnosing-performance-bottlenecks/"&gt;through the prism&lt;/a&gt; of an MVC web application.&lt;/p&gt;

&lt;p&gt;This final post in the series provides a concrete example of how to instrument your monitoring checks so you can identify which exact parts of your checks are inducing latency in your monitoring system.&lt;/p&gt;

&lt;p&gt;When debugging performance bottlenecks, I tend to use a simple but effective workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;observe the system&lt;/li&gt;
&lt;li&gt;analyse the results&lt;/li&gt;
&lt;li&gt;optimise the bottleneck that is having the most impact&lt;/li&gt;
&lt;li&gt;rinse and repeat until the system is performing within the expected performance parameters&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;What if we continue to look at monitoring checks as micro MVC web applications? What tools exist to aid this optimisation workflow, and how can we hook instrumentation into our checks?&lt;/p&gt;

&lt;p&gt;The cr&amp;egrave;me de la cr&amp;egrave;me of web app performance monitoring + optimisation tools is &lt;a href="http://newrelic.com"&gt;New Relic&lt;/a&gt;, boasting an incredibly rich feature set that lets you drill down deep into your application while also providing a high level view of app-wide performance.&lt;/p&gt;

&lt;p&gt;But is it possible to hook New Relic into applications that aren't web apps? Let's give it a go.&lt;/p&gt;

&lt;p&gt;Here's an example monitoring check:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="ruby"&gt;&lt;span class="c1"&gt;#!/usr/bin/env ruby&lt;/span&gt;
&lt;span class="c1"&gt;#&lt;/span&gt;
&lt;span class="c1"&gt;# Usage: check.rb &amp;lt;time&amp;gt;&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Check&lt;/span&gt;
  &lt;span class="kp"&gt;attr_reader&lt;/span&gt; &lt;span class="ss"&gt;:opts&lt;/span&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{})&lt;/span&gt;
    &lt;span class="vi"&gt;@opts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;opts&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:time&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
    &lt;span class="nb"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="no"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;StandardError&lt;/span&gt;&lt;span class="o"&gt;][&lt;/span&gt;&lt;span class="nb"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;
    &lt;span class="nb"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="no"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;ArgumentError&lt;/span&gt;&lt;span class="o"&gt;][&lt;/span&gt;&lt;span class="nb"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

    &lt;span class="nb"&gt;puts&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;OK: we made it!&amp;quot;&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="vi"&gt;@opts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="no"&gt;Check&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:time&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="no"&gt;ARGV&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;].&lt;/span&gt;&lt;span class="n"&gt;to_i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;As you can see, it's &lt;a href="http://www.urbandictionary.com/define.php?term=flat%20out%20like%20a%20lizard%20drinking"&gt;flat out like a lizard drinking&lt;/a&gt; inducing latency by sleeping and spicing things up by randomly throwing exceptions. All things considered, it's actually a pretty good example of a monitoring check that aims to misbehave.&lt;/p&gt;

&lt;p&gt;Let's start instrumenting!&lt;/p&gt;

&lt;p&gt;First up we need to load some libraries:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="ruby"&gt;&lt;span class="c1"&gt;#!/usr/bin/env ruby&lt;/span&gt;

&lt;span class="nb"&gt;require&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;rubygems&amp;#39;&lt;/span&gt;
&lt;span class="nb"&gt;require&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;newrelic_rpm&amp;#39;&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Check&lt;/span&gt;
  &lt;span class="kp"&gt;include&lt;/span&gt; &lt;span class="no"&gt;NewRelic&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;Agent&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;Instrumentation&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;ControllerInstrumentation&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;Reading through the New Relic API documentation...&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="ruby"&gt;&lt;span class="c1"&gt;# When the app environment loads, so does the Agent. However, the&lt;/span&gt;
&lt;span class="c1"&gt;# Agent will only connect to the service if a web front-end is found. If&lt;/span&gt;
&lt;span class="c1"&gt;# you want to selectively monitor ruby processes that don&amp;#39;t use&lt;/span&gt;
&lt;span class="c1"&gt;# web plugins, then call this method in your code and the Agent&lt;/span&gt;
&lt;span class="c1"&gt;# will fire up and start reporting to the service.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;...it looks like we need to manually start up the agent:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="ruby"&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Check&lt;/span&gt;
  &lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="no"&gt;NewRelic&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;Agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manual_start&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;Now we need to tell the New Relic agent what to instrument. The API provides methods to do this at the transaction and method level:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="ruby"&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Check&lt;/span&gt;
  &lt;span class="c1"&gt;# ...&lt;/span&gt;

  &lt;span class="n"&gt;add_transaction_tracer&lt;/span&gt; &lt;span class="ss"&gt;:run&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="ss"&gt;:name&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;run&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:class_name&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;#{self.class}&amp;#39;&lt;/span&gt;
  &lt;span class="n"&gt;add_method_tracer&lt;/span&gt;      &lt;span class="ss"&gt;:model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;Nagios/#{self.class.name}/model&amp;#39;&lt;/span&gt;
  &lt;span class="n"&gt;add_method_tracer&lt;/span&gt;      &lt;span class="ss"&gt;:view&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="s1"&gt;&amp;#39;Nagios/#{self.class.name}/view&amp;#39;&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;In New Relic parlance, a transaction is an end-to-end process that is comprised of many smaller units of work, and a method is an individual unit of work. In this monitoring check scenario, a transaction is an invocation of the check.&lt;/p&gt;

&lt;p&gt;When using the New Relic agent with Rails, by default it captures the query parameters passed to the controller action. This helps massively when debugging why a certain transaction takes longer to complete on particular inputs.&lt;/p&gt;

&lt;p&gt;Wouldn't it be cool if we could treat the command line arguments to the monitoring check as query parameters to the controller action? That way we could identify which services are running slowly and holding up the check.&lt;/p&gt;

&lt;p&gt;Turns out this is just another option to &lt;code&gt;add_transaction_tracer&lt;/code&gt;:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="ruby"&gt;&lt;span class="n"&gt;add_transaction_tracer&lt;/span&gt; &lt;span class="ss"&gt;:run&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:name&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;run&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:class_name&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;#{self.class}&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:params&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;self.opts&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;Provided you store all your options in an instance variable with an &lt;code&gt;attr_reader&lt;/code&gt;, you can capture whatever data is passed to the check on execution.&lt;/p&gt;

&lt;p&gt;One piece of data the New Relic agent captures is an Apdex score for each request. An Apdex score is a measurement of user satisfaction when interacting with an application or service.&lt;/p&gt;

&lt;p&gt;In this particular scenario, the "user" is actually a monitoring system, so the score may not be that meaningful. Let's disable it for now:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="ruby"&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Check&lt;/span&gt;
  &lt;span class="c1"&gt;# ...&lt;/span&gt;

  &lt;span class="n"&gt;newrelic_ignore_apdex&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;So far everything has been very smooth - we've taken an existing check and added some instrumentation points with New Relic - but we're about to hit a complication.&lt;/p&gt;

&lt;p&gt;Internally the New Relic agent spawns a separate thread from which it sends all this instrumented data to the New Relic service. Establishing a connection to the New Relic service actually takes a while (15+ seconds in the worst cases), which doesn't quite fit the paradigm we're working in where monitoring checks are returning sub-second results.&lt;/p&gt;

&lt;p&gt;Essentially this means that we're collecting all this interesting data with the New Relic agent but it's never actually sent to the New Relic service.&lt;/p&gt;

&lt;p&gt;In the PHP world this is a very real problem as PHP processes will exit at the end of each request. In the PHP edition of New Relic there's quite a cute workaround for exactly this problem - each PHP process sends data to a daemon running in the background that buffers it and sends it to New Relic at a regular interval.&lt;/p&gt;

&lt;p&gt;Let's emulate this functionality in Ruby:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="ruby"&gt;&lt;span class="nb"&gt;at_exit&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="no"&gt;NewRelic&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;Agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;save_data&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;This will serialise the captured data to &lt;code&gt;log/newrelic_agent_store.db&lt;/code&gt; as a marshalled Ruby object. The last step is to send this data to New Relic at a regular interval:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="ruby"&gt;&lt;span class="c1"&gt;#!/usr/bin/env ruby&lt;/span&gt;
&lt;span class="c1"&gt;#&lt;/span&gt;
&lt;span class="c1"&gt;# Usage: collector.rb&lt;/span&gt;
&lt;span class="c1"&gt;#&lt;/span&gt;

&lt;span class="nb"&gt;require&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;rubygems&amp;#39;&lt;/span&gt;
&lt;span class="nb"&gt;require&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;newrelic_rpm&amp;#39;&lt;/span&gt;

&lt;span class="k"&gt;module&lt;/span&gt; &lt;span class="nn"&gt;NewRelic&lt;/span&gt;
  &lt;span class="k"&gt;module&lt;/span&gt; &lt;span class="nn"&gt;Agent&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nc"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connected?&lt;/span&gt;
      &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;connected?&lt;/span&gt;
    &lt;span class="k"&gt;end&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="vg"&gt;$stdout&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sync&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kp"&gt;true&lt;/span&gt;
&lt;span class="no"&gt;NewRelic&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;Agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manual_start&lt;/span&gt;

&lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Waiting to connect to the NewRelic service&amp;quot;&lt;/span&gt;
&lt;span class="k"&gt;until&lt;/span&gt; &lt;span class="no"&gt;NewRelic&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;Agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;connected?&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="nb"&gt;print&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;.&amp;#39;&lt;/span&gt;
  &lt;span class="nb"&gt;sleep&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="nb"&gt;puts&lt;/span&gt;

&lt;span class="no"&gt;NewRelic&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;Agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load_data&lt;/span&gt;
&lt;span class="no"&gt;NewRelic&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;Agent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shutdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:force_send&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="kp"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;This waits for the New Relic agent to establish a connection to the New Relic service, loads the data serialised by the checks, and sends it to New Relic.&lt;/p&gt;

&lt;p&gt;Just for testing, we can run our pseudo collector like this:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="bash"&gt;&lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;; &lt;span class="k"&gt;do &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Sending&amp;quot;&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; ruby send.rb &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;Sleeping 30&amp;quot;&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; sleep 30 ; &lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;And invoke the monitoring check like this:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="bash"&gt;&lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; ; &lt;span class="k"&gt;do &lt;/span&gt;&lt;span class="nv"&gt;RACK_ENV&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;development bundle &lt;span class="nb"&gt;exec &lt;/span&gt;ruby main.rb 5 ; &lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;Now we've got all this set up, we can log into New Relic to view some pretty visualisations of our monitoring check latency:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm8.staticflickr.com/7030/6682260777_ae93ba89f3_o.jpg" alt="New Relic dashboard screenshot" /&gt;&lt;/p&gt;

&lt;p&gt;New Relic automatically identifies which transactions are the slowest, and lets you deep dive to identify where the slowness is:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm8.staticflickr.com/7005/6682278901_ca151d8508_o.jpg" alt="New Relic transaction deep dive screenshot" /&gt;&lt;/p&gt;

&lt;p&gt;If you haven't got a &lt;a href="http://en.wikipedia.org/wiki/Brass_razoo"&gt;brass razoo&lt;/a&gt; there are plenty of Open Source alternatives to New Relic, but you'll have to do a bit more grunt work to get them going.&lt;/p&gt;

&lt;p&gt;This post concludes this series on monitoring scalability! The &lt;strong&gt;TL;DR&lt;/strong&gt; series summary:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check latency is the monitoring system killer.&lt;/li&gt;
&lt;li&gt;Even in simple environments check latency slows down your monitoring system and obfuscates incidents.&lt;/li&gt;
&lt;li&gt;To eliminate latency, separate data collection from alerting.&lt;/li&gt;
&lt;li&gt;Make your monitoring checks as non-blocking as possible.&lt;/li&gt;
&lt;li&gt;Whenever debugging monitoring performance problems, think of your monitoring system as an MVC web app.&lt;/li&gt;
&lt;li&gt;Instrument your monitoring checks to identify sources of latency.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;You can find the above code examples &lt;a href="https://gist.github.com/1598357"&gt;on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you've enjoyed this series of posts, you can find more of my keen insights, witty banter, and Australian colloquialisms &lt;a href="http://twitter.com/auxesis"&gt;on Twitter&lt;/a&gt;, or &lt;a href="http://holmwood.id.au/~lindsay/feed"&gt;subscribe to my blog&lt;/a&gt;.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/AuxesisMusings/~4/SmwVcEMY2Io" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://holmwood.id.au/~lindsay/2012/01/13/instrumenting-your-monitoring-checks-with-new-relic/</feedburner:origLink></entry>
 
 <entry>
   <title>monitoring system == web app (when diagnosing performance bottlenecks)</title>
   <link href="http://feedproxy.google.com/~r/AuxesisMusings/~3/TE1cuYBDG4A/" />
   <updated>2012-01-11T00:00:00+11:00</updated>
   <id>http://holmwood.id.au/~lindsay/2012/01/11/monitoring-system-equal-web-app-when-diagnosing-performance-bottlenecks</id>
   <content type="html">&lt;p&gt;&lt;em&gt;This post is part 2 of 3 in a series on monitoring scalability.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In &lt;a href="http://holmwood.id.au/~lindsay/2012/01/09/monitoring-sucks-latency-sucks-more/"&gt;part 1&lt;/a&gt; of this series I talked about check latency, and how it can batter you operationally if it gets out of hand.&lt;/p&gt;

&lt;p&gt;In this post I'm going to propose an alternative way of looking at monitoring systems that can hopefully shed light on some typical performance bottlenecks.&lt;/p&gt;

&lt;p&gt;Architecturally, monitoring systems and web applications share many of the same design characteristics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A check is a request to an action on a controller&lt;/li&gt;
&lt;li&gt;Actions fetch data from a model, and expose a result through a view&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;img src="http://farm8.staticflickr.com/7009/6672989429_1eece65303_o.jpg" alt="Overview diagram of monitoring system/web application request lifecycle" /&gt;&lt;/p&gt;

&lt;p&gt;If you look at monitoring systems through this prism, many monitoring performance and scalability problems become simpler to understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Poorly optimised actions can take a variable amount of time to return a response&lt;/li&gt;
&lt;li&gt;You get the best performance out of your monitoring system by optimising actions that are slow, and working towards a consistent throughput across all your monitoring checks&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;img src="http://farm8.staticflickr.com/7016/6672989701_30356eaf6f_o.jpg" alt="Diagram explaining how latency at one end of the pipeline effects the other" /&gt;&lt;/p&gt;

&lt;p&gt;Bearing this in mind, what &lt;a href="http://www.amazon.com/Scalable-Internet-Architectures-Theo-Schlossnagle/dp/067232699X/ref=sr_1_7?ie=UTF8&amp;amp;qid=1326103530&amp;amp;sr=8-7"&gt;methodologies do we use&lt;/a&gt; to remove performance bottlenecks from a web application? Can we apply those same techniques to monitoring systems?&lt;/p&gt;

&lt;p&gt;One very common technique is to precompile data to eliminate computationally expensive operations when serving up a result. The precompilation should almost always be a separate process from the main process serving requests.&lt;/p&gt;

&lt;p&gt;This has multiple benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You shift the computationally expensive and latency inducing work in a monitoring check to a separate process. This makes acheiving a low and consistent monitoring check response time vastly easier.&lt;/li&gt;
&lt;li&gt;You can throw specialisied hardware at particular parts of the monitoring pipeline. For example, use a SAN with a huge memory cache or SSDs exclusively in your data storage layer to speed up reads + writes, and beefy multicore machines in your alerting layer to increase your check parallelism.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;img src="http://farm8.staticflickr.com/7027/6672989965_43fe9411f3_o.jpg" alt="Diagram explaining where to focus optimisation efforts" /&gt;&lt;/p&gt;

&lt;p&gt;Separating data collection + storage from thresholding + notifications &lt;em&gt;is the most crucial part&lt;/em&gt; of ensuring consistent check throughput in your monitoring system&lt;/p&gt;

&lt;p&gt;In September of 2011 &lt;a href="http://twitter.com/LordCope"&gt;Stephen Nelson-Smith&lt;/a&gt; covered why this separation is so important in his article &lt;a href="http://agilesysadmin.net/pillar-one"&gt;We alert on what we draw&lt;/a&gt;. The article can be boiled down to "Your graphs and your alerts should be created from the same data source. This simplifies incident response and analysis."&lt;/p&gt;

&lt;p&gt;The other advantage that Stephen didn't cover was the massive throughput boost this gives your monitoring system. It's tempting to say that the throughput boost is a bigger advantage than the operational gains, however the two are inextricably linked. You have massive operational issues if your monitoring system is "running late" on executing monitoring checks, but you've got &lt;a href="http://en.wikipedia.org/wiki/William_Buckley_(convict)"&gt;Buckley's chance&lt;/a&gt; of effectively responding to incidents if you have no visibility of those incidents.&lt;/p&gt;

&lt;p&gt;My preference is to collect + store the data with collectd + OpenTSDB, however the DevOps community as a whole seems to be very keen on Ganglia + Graphite. YMMV, do your research and use what's best for you.&lt;/p&gt;

&lt;p&gt;The most time consuming part of adopting this separation strategy is reworking your monitoring checks to fetch from these data stores. I'd highly recommend writing a small DSL for doing common things like fetching data and comparing results.&lt;/p&gt;

&lt;p&gt;No approach is perfect, and separating your data from your alerting introduces a different set of problems.&lt;/p&gt;

&lt;p&gt;Even by separating the collection from the alerting, your monitoring checks are still essentially going to block when retrieving data from your storage layer. Keeping in mind you will never be able to truly eliminate blocking checks, it is &lt;em&gt;imperative&lt;/em&gt; you ensure these new checks block as little as possible, otherwise you'll be subjecting yourself to the same problems.&lt;/p&gt;

&lt;p&gt;Write your checks with the expectation that your data store &lt;em&gt;will become unreachable.&lt;/em&gt; The biggest drawback to separation is that when your data store becomes unreachable, all of your checks will fail &lt;em&gt;simultaneously.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm8.staticflickr.com/7025/6672990219_2e4c6b7c38_o.jpg" alt="Diagram explaining where things will break" /&gt;&lt;/p&gt;

&lt;p&gt;Operationally this can be a complete nightmare. I have seen many a pager and mobile phone melt under a deluge of notifications saying that data for a check could not be read.&lt;/p&gt;

&lt;p&gt;There are two workarounds for this problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set up a parent check for &lt;em&gt;all&lt;/em&gt; your monitoring checks that simply reads a value out of the data store, and goes critical if the data store can't be accessed. If your monitoring system does parenting properly and you have a good check throughput, this should minimise the explosion of alerts.&lt;/li&gt;
&lt;li&gt;Build a manual or automatic notification kill switch into your monitoring system so if the shit does hit the fan and your storage layer disappears, you don't suffer from information overload and &lt;a href="http://www.popularmechanics.com/technology/aviation/crashes/what-really-happened-aboard-air-france-447-6611877"&gt;do something fatally stupid&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So how do you ensure your monitoring checks aren't suffering from check latency?&lt;/p&gt;

&lt;p&gt;In the &lt;a href="http://holmwood.id.au/~lindsay/2012/01/13/instrumenting-your-monitoring-checks-with-new-relic/"&gt;next post&lt;/a&gt; in this series, we'll look at instrumenting your monitoring checks themselves to identify which parts of the checks have bottlenecks.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/AuxesisMusings/~4/TE1cuYBDG4A" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://holmwood.id.au/~lindsay/2012/01/11/monitoring-system-equal-web-app-when-diagnosing-performance-bottlenecks/</feedburner:origLink></entry>
 
 <entry>
   <title>Monitoring Sucks. Latency Sucks More.</title>
   <link href="http://feedproxy.google.com/~r/AuxesisMusings/~3/FrmOoLu_mr4/" />
   <updated>2012-01-09T00:00:00+11:00</updated>
   <id>http://holmwood.id.au/~lindsay/2012/01/09/monitoring-sucks-latency-sucks-more</id>
   <content type="html">&lt;p&gt;&lt;em&gt;This post is part 1 of 3 in a series on monitoring scalability.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href="http://lusislog.blogspot.com/2011/06/why-monitoring-sucks.html"&gt;Monitoring Sucks&lt;/a&gt; conversation has been an awesome step in the &lt;a href="https://github.com/monitoringsucks"&gt;right direction&lt;/a&gt; for defining a common language for describing monitoring concepts and documenting the available tools.&lt;/p&gt;

&lt;p&gt;The reasons monitoring sucks are many and varied - poor configuration, poor visualisation, poor scalability, poor data retention - there is a lot of well-founded hate for the available tools (some of which I have authored!)&lt;/p&gt;

&lt;p&gt;I want to take a closer look into a problem I grapple with on a daily basis as part of my job: monitoring scalability.&lt;/p&gt;

&lt;p&gt;What do I mean by "monitoring scalability"?&lt;/p&gt;

&lt;p&gt;For a monitoring system to be considered scalable, I would expect it to execute large volumes of monitoring checks under a variety of conditions (good + bad) with a consistent throughput.&lt;/p&gt;

&lt;p&gt;Why is monitoring scalability a problem? Are there deeper, subtler problems that underly monitoring system architectures in general?&lt;/p&gt;

&lt;p&gt;Nagios handles 6000+ checks like a champ. I say this with a completely straight face. At &lt;a href="http://bulletproof.net"&gt;Bulletproof&lt;/a&gt;, we have several large instances of Nagios that have been running for years with thousands of checks.&lt;/p&gt;

&lt;p&gt;There is one caveat, and it is pretty massive - if your monitoring checks take a variable amount of time to return a result (they have high check latency), you will get reduced throughput, and thus your incident response times becomes unreliable. This leads to a lack of trust in the monitoring system which can kill you operationally if you don't nip it in the bud.&lt;/p&gt;

&lt;p&gt;Let's work through some of the scalability problems by looking at a hypothetical and simplified monitoring system:&lt;/p&gt;

&lt;p&gt;Imagine you have a very small monitoring system with 150 checks running. The type of check is irrelevant (in Nagios parlance they could be "service" or "host" checks), however each check is scheduled to be executed every 300 seconds (for the sake of argument, lets just ignore that a 300 second interval is &lt;em&gt;way&lt;/em&gt; too long).&lt;/p&gt;

&lt;p&gt;To simplify this hypothetical, let's posit that all the checks are running serially in a single thread, and each check takes 1 second to execute and return a result.&lt;/p&gt;

&lt;p&gt;At this point, you're golden. All checks are executing in 150 seconds, well within the 300 second window.&lt;/p&gt;

&lt;p&gt;Now double the number of checks to 300.&lt;/p&gt;

&lt;p&gt;That's one check executed every second. All the checks execute within the execution window, but things are getting tight, and you don't have any spare capacity to add more checks.&lt;/p&gt;

&lt;p&gt;Worst of all: &lt;em&gt;what happens when the check response time goes up to 2 seconds?&lt;/em&gt; Now you can only execute 50% of your checks within the 300 second window, and your monitoring is 300 seconds "behind".&lt;/p&gt;

&lt;p&gt;Now you're suffering from &lt;em&gt;check latency&lt;/em&gt;  - a world of pain filled with plenty of insidious edge cases to cut yourself on.&lt;/p&gt;

&lt;p&gt;My favourite edge case is when a service failure occurs just after a check has executed and returned an OK result. In the above hypothetical, you would be unaware of the failure for 599 seconds. In a monitoring system suffering heavily from check latency, that period of time could be much much longer. Furthermore, the problem is amplified when you're using soft/hard states to eliminate false-positives.&lt;/p&gt;

&lt;p&gt;The above hypothetical is a tad contrived as pretty much all monitoring systems execute checks in parallel, but it illustrates the scalability challenges even in a simple scenario.&lt;/p&gt;

&lt;p&gt;Executing checks in parallel certainly helps stave off this type of bottleneck, but as you increase the number of checks and the parallelism of your monitoring system, you start running into operating system limitations such as context switching, memory exhaustion (if you use a language that gobbles up memory), or simply running out of CPU time to execute all the checks.&lt;/p&gt;

&lt;p&gt;The other enormous gotcha is that when catastrophic failures happen, it's very common to have monitoring checks that simply timeout because various network resources between your monitoring server and the machine you're checking are down or misbehaving.&lt;/p&gt;

&lt;p&gt;The last thing you want in an emergency situation is delayed alerts that may hide the root cause or feed you bad information.&lt;/p&gt;

&lt;p&gt;So how do you mitigate check latency problems to improve your monitoring scalability?&lt;/p&gt;

&lt;p&gt;In the &lt;a href="http://holmwood.id.au/~lindsay/2012/01/11/monitoring-system-equal-web-app-when-diagnosing-performance-bottlenecks/"&gt;next post&lt;/a&gt; in this series, we'll look at monitoring systems as a type of complex web application, and investigate some performance optimisation techniques you can apply.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/AuxesisMusings/~4/FrmOoLu_mr4" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://holmwood.id.au/~lindsay/2012/01/09/monitoring-sucks-latency-sucks-more/</feedburner:origLink></entry>
 
 <entry>
   <title>Treetop PEG for Puppet resources</title>
   <link href="http://feedproxy.google.com/~r/AuxesisMusings/~3/wXU0G4UW8f0/" />
   <updated>2011-10-13T00:00:00+11:00</updated>
   <id>http://holmwood.id.au/~lindsay/2011/10/13/treetop</id>
   <content type="html">&lt;p&gt;Earlier this year at Puppet Camp EU, &lt;a href="http://twitter.com/sonofhans"&gt;Randall Hansen&lt;/a&gt; ran an open space session on improving the Puppet user experience.&lt;/p&gt;

&lt;p&gt;Lots of sharp edges were identified, but one issue that I raised was the annoying need for trailing commas to break up parameters in resource declarations.&lt;/p&gt;

&lt;p&gt;I chatted about this briefly with &lt;a href="http://twitter.com/puppetmasterd"&gt;Luke&lt;/a&gt; and for a laugh I decided to write a &lt;a href="http://treetop.rubyforge.org/"&gt;Treetop&lt;/a&gt; Parsing Expression Grammar (PEG) for Puppet resources that supported newlines as the parameter delimeter:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="ruby"&gt;&lt;span class="c1"&gt;# puppet.treetop&lt;/span&gt;
&lt;span class="n"&gt;grammar&lt;/span&gt; &lt;span class="no"&gt;Puppet&lt;/span&gt;
  &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt;
    &lt;span class="n"&gt;whitespace&lt;/span&gt;
    &lt;span class="n"&gt;type&lt;/span&gt;
    &lt;span class="n"&gt;whitespace&lt;/span&gt;
    &lt;span class="nb"&gt;open&lt;/span&gt;
    &lt;span class="n"&gt;whitespace&lt;/span&gt;
    &lt;span class="nb"&gt;name&lt;/span&gt;
    &lt;span class="n"&gt;whitespace&lt;/span&gt;
    &lt;span class="n"&gt;parameters&lt;/span&gt;
    &lt;span class="n"&gt;whitespace&lt;/span&gt;
    &lt;span class="n"&gt;close&lt;/span&gt;
    &lt;span class="n"&gt;whitespace&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;resource_type&lt;/span&gt;
        &lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_value&lt;/span&gt;
      &lt;span class="k"&gt;end&lt;/span&gt;

      &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;resource_name&lt;/span&gt;
        &lt;span class="nb"&gt;name&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_value&lt;/span&gt;
      &lt;span class="k"&gt;end&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt;
    &lt;span class="n"&gt;word&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;{&amp;quot;&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;close&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;}&amp;quot;&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="nb"&gt;name&lt;/span&gt;
    &lt;span class="n"&gt;quotes&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="n"&gt;quotes&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;:&amp;quot;&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;name&lt;/span&gt;
        &lt;span class="n"&gt;word&lt;/span&gt;
      &lt;span class="k"&gt;end&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt;
    &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;zA&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Z&lt;/span&gt;&lt;span class="o"&gt;]+&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;quotes&lt;/span&gt;
   &lt;span class="s2"&gt;&amp;quot;&amp;#39;&amp;quot;&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;quot;&amp;#39;&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;parameters&lt;/span&gt;
    &lt;span class="n"&gt;newline&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;whitespace&lt;/span&gt; &lt;span class="n"&gt;parameter&lt;/span&gt; &lt;span class="n"&gt;comma_or_newline&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;parameter&lt;/span&gt;
    &lt;span class="n"&gt;whitespace&lt;/span&gt;
    &lt;span class="n"&gt;word&lt;/span&gt;
    &lt;span class="n"&gt;whitespace&lt;/span&gt;
    &lt;span class="n"&gt;arrow&lt;/span&gt;
    &lt;span class="n"&gt;whitespace&lt;/span&gt;
    &lt;span class="n"&gt;word&lt;/span&gt;
    &lt;span class="n"&gt;whitespace&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;arrow&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;=&amp;gt;&amp;quot;&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;comma_or_newline&lt;/span&gt;
    &lt;span class="n"&gt;comma&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;newline&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;comma&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;,&amp;quot;&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;newline&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="n"&gt;whitespace&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;It's throwaway code, but as far as I'm aware it's relatively idiomatic Treetop.&lt;/p&gt;

&lt;p&gt;It came in handy earlier this week when explaining PEGs to a &lt;a href="http://twitter.com/jessereynolds"&gt;new recruit&lt;/a&gt; into the R&amp;amp;D team at &lt;a href="http://bulletproof.net"&gt;work&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Said recruit suggested that I publish it, as there aren't too many examples of Treetop PEGs floating around.&lt;/p&gt;

&lt;p&gt;To run the PEG over an example snippet:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="ruby"&gt;&lt;span class="c1"&gt;#!/usr/bin/env ruby&lt;/span&gt;

&lt;span class="nb"&gt;require&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;rubygems&amp;#39;&lt;/span&gt;
&lt;span class="nb"&gt;require&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;bundler/setup&amp;#39;&lt;/span&gt;
&lt;span class="nb"&gt;require&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;polyglot&amp;#39;&lt;/span&gt;
&lt;span class="nb"&gt;require&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;treetop&amp;#39;&lt;/span&gt;

&lt;span class="no"&gt;Treetop&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;puppet&amp;quot;&lt;/span&gt;

&lt;span class="n"&gt;snippet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;-&lt;/span&gt;&lt;span class="no"&gt;SNIPPET&lt;/span&gt;
&lt;span class="sh"&gt;  package { &amp;quot;foobar&amp;quot;:&lt;/span&gt;
&lt;span class="sh"&gt;    ensure =&amp;gt; present, another =&amp;gt; bar, spoons =&amp;gt; doom&lt;/span&gt;
&lt;span class="sh"&gt;    foo    =&amp;gt; bar&lt;/span&gt;
&lt;span class="sh"&gt;  }&lt;/span&gt;
&lt;span class="no"&gt;SNIPPET&lt;/span&gt;

&lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;PuppetParser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vi"&gt;@root&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snippet&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nb"&gt;puts&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;success&amp;#39;&lt;/span&gt;
  &lt;span class="nb"&gt;p&lt;/span&gt; &lt;span class="vi"&gt;@root&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_type&lt;/span&gt;
  &lt;span class="nb"&gt;p&lt;/span&gt; &lt;span class="vi"&gt;@root&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_name&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;
  &lt;span class="nb"&gt;puts&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;failure&amp;#39;&lt;/span&gt;
  &lt;span class="nb"&gt;puts&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_reason&lt;/span&gt;
  &lt;span class="nb"&gt;puts&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_column&lt;/span&gt;
  &lt;span class="nb"&gt;puts&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_line&lt;/span&gt;
  &lt;span class="nb"&gt;puts&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failure_index&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;Gemfile for running it and all the above code is in a &lt;a href="https://gist.github.com/d4fb0770775086145db0"&gt;Gist&lt;/a&gt;.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/AuxesisMusings/~4/wXU0G4UW8f0" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://holmwood.id.au/~lindsay/2011/10/13/treetop/</feedburner:origLink></entry>
 
 <entry>
   <title>Standing desk adventures</title>
   <link href="http://feedproxy.google.com/~r/AuxesisMusings/~3/sK-6a7W0uwI/" />
   <updated>2011-10-03T00:00:00+11:00</updated>
   <id>http://holmwood.id.au/~lindsay/2011/10/03/standing-desk-adventures</id>
   <content type="html">&lt;p&gt;I've been using a standing desk for a bit over four months now, and thus far it's been quite successful.&lt;/p&gt;

&lt;p&gt;I decided to test out the idea because I spend a lot of time in front of the screen, and my back has been getting progressively sorer over the last year. Not wanting to transform into the Hunchback of Notre Dame before I turn 30, and aware of the current research suggesting that sitting down for long stretches &lt;a href="http://www.abc.net.au/news/2011-03-21/sit-less-to-lower-heart-disease-risk/2649106"&gt;increases the risk of heart disease&lt;/a&gt;, a standing desk seemed like a good alternative.&lt;/p&gt;

&lt;p&gt;Because I work from home and the office, I actually have two standing desk setups - one for each location.&lt;/p&gt;

&lt;p&gt;The home setup is incredibly makeshift, with the monitor placed on a box of our things left over from the last move, and the laptop sitting on a discontinued IKEA storage box not too dissimilar from the current &lt;a href="http://www.ikea.com/au/en/catalog/products/10184838/"&gt;Prant&lt;/a&gt; offering.&lt;/p&gt;

&lt;p&gt;The reason for the home setup dodginess is twofold:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I wanted to try out the standing desk thing without a large financial commitment&lt;/li&gt;
&lt;li&gt;We're between houses, and have to use what we've got on hand&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Once I discovered that the standing desk was the way I wanted to work for the foreseeable future, I decided to up the ante and buy a real desk for work.&lt;/p&gt;

&lt;p&gt;There are plenty of purpose built standing desk options that are far beyond my budget, so the search was on for finding a desk for a reasonable price.&lt;/p&gt;

&lt;p&gt;I stumbled across a Frankenstein IKEA desk &lt;a href="http://lifehacker.com/5739296/build-a-diy-wide-adjustable-height-ikea-standing-desk-on-the-cheap"&gt;on Lifehacker&lt;/a&gt;, but it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Was too long for the space in the office&lt;/li&gt;
&lt;li&gt;Required a non-trivial amount of construction with tools not on hand&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The same Lifehacker article linked to &lt;a href="http://thingsthatwelearn.com/#334875/Standing-Desk-Project"&gt;another blog&lt;/a&gt; about repurposing an Utby kitchen table as a standing desk:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://farm7.static.flickr.com/6162/6206718881_5b11409b7b.jpg" alt="Standing desktop and base" /&gt;&lt;/p&gt;

&lt;p&gt;This was the model I settled on.&lt;/p&gt;

&lt;p&gt;The Utby kitchen table is sold as two separate products, the 105cm high &lt;a href="http://www.ikea.com/aa/en/catalog/products/20172236"&gt;stainless steel underframe&lt;/a&gt; (not be confused with the 90cm one), and the 120x60x3.4cm &lt;a href="http://www.ikea.com/aa/en/catalog/products/40162222/#/40162222/"&gt;Vika Amon table top&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;At the time the local IKEA store did not have stock of the Vika Amon table top, so based on the advice of a shop assistant I picked up the &lt;a href="http://www.ikea.com/aa/en/catalog/products/S49871209/#/S49871209/"&gt;Galant table top&lt;/a&gt; instead, minus the normal Galant frame.&lt;/p&gt;

&lt;p&gt;I was assured in-store this would be a reasonable substitute, however as I was finishing off the construction in the office I discovered that the Galant table top I purchased is 2cm thick, as opposed to the Vika Amon which is 3.4cm. This meant that the supplied screws to mount the table top to the underframe would have broken through the surface, so alternative screws are required.&lt;/p&gt;

&lt;p&gt;Chalk that one up to a lack of research.&lt;/p&gt;

&lt;p&gt;To date I'm very impressed with the desk, with the bottom rail of the frame providing a suitable place to rest my feet against and store my bag behind.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.flickr.com/photos/auxesis/6094908180/"&gt;&lt;img src="http://farm7.static.flickr.com/6073/6094908180_d86925410f.jpg" alt="Standing desk at office" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As for the longer term effects of using a standing desk 12 hours a day 5 days a week: I haven't been able to find many others sharing their experiences. The sum of what you generally read online is "I just switched over to a standing desk a few hours ago and it's feeling great!!!".&lt;/p&gt;

&lt;p&gt;In my experience, the two key factors are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start out with a comfortable, well worn pair of shoes&lt;/li&gt;
&lt;li&gt;Make sure the surface you stand on isn't too hard (timber floorboards or carpet are best)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;My desk at home has me standing on tiles in Chucks, and boy can I feel it if I'm working long hours. If I use the desk for more than 12 hours at a time, my feet start aching pretty badly.&lt;/p&gt;

&lt;p&gt;I see this is a good thing though: if my feet are aching, it means I need to stop work for the day.&lt;/p&gt;

&lt;p&gt;I've tried working around this by wearing in different types of shoes (Birkenstock Arizonas and &lt;a href="http://www.cbdcycles.com.au/clothingdetails.php?id=47"&gt;Shimano SH-MT40s&lt;/a&gt;), but I generally end up with a pretty &lt;a href="http://twitter.com/#!/auxesis/status/118224439325368321"&gt;nasty headache within an hour&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Once we move into our new house I'll be working on timber floorboards, so I have my fingers crossed that the pain will ease up.&lt;/p&gt;

&lt;p&gt;I find that I'm shifting my weight between legs every 5-10 minutes, and am much more inclined to bop along to music now that I'm standing up.&lt;/p&gt;

&lt;p&gt;There's also an unspoken advantage to having a standing desk in a busy office environment: &lt;em&gt;people will interrupt you for much shorter periods of time if they have nowhere to sit&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If you're doing pair programming this can be pretty brutal on your partner if they're not used to standing up for long stretches, but it has a distinct advantage when you're trying to shut the world out and keep in the zone.&lt;/p&gt;

&lt;p&gt;I also find that the by the end of the day I have a mild tingly sensation in my calves, not too dissimilar from the sensation felt when returning from a bike ride with lots of climbs.&lt;/p&gt;

&lt;p&gt;Since setting up the standing desk in the office, there are two more setups that have showed up on &lt;a href="http://www.ikeahackers.net/"&gt;IKEA Hackers&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.ikeahackers.net/2011/09/sitting-standing-desk-combo.html"&gt;An Expedit-based sitting/standing desk combo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.ikeahackers.net/2011/09/another-expedit-standing-desk-with-cds.html"&gt;Another Expedit-based standing desk with CDs as risers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;I quite like the idea of the extra storage space gained with the CD-riser design, and may opt for that design when we move.&lt;/p&gt;

&lt;p&gt;Would I go back to sitting at a desk? Not in the foreseeable future.&lt;/p&gt;

&lt;p&gt;I find most of my back pain has gone, and I now value sitting a lot more. :-)&lt;/p&gt;

&lt;p&gt;Would I recommend standing desks for others? If you are working long hours in front of a screen and have trouble finding a comfortable setup to work from, it might be worth a shot.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/AuxesisMusings/~4/sK-6a7W0uwI" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://holmwood.id.au/~lindsay/2011/10/03/standing-desk-adventures/</feedburner:origLink></entry>
 

</feed>
