<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Pragmatic Dictator</title>
	
	<link>http://www.dancres.org/blitzblog</link>
	<description />
	<lastBuildDate>Fri, 29 Mar 2013 21:42:21 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/dancres/sweh" /><feedburner:info uri="dancres/sweh" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Need</title>
		<link>http://feedproxy.google.com/~r/dancres/sweh/~3/EHGusmG2tV4/</link>
		<comments>http://www.dancres.org/blitzblog/2013/03/29/need/#comments</comments>
		<pubDate>Fri, 29 Mar 2013 16:13:53 +0000</pubDate>
		<dc:creator>Dan Creswell</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[delivery]]></category>
		<category><![CDATA[Philosophy]]></category>
		<category><![CDATA[product]]></category>

		<guid isPermaLink="false">http://www.dancres.org/blitzblog/?p=523</guid>
		<description><![CDATA[It&#8217;s not about profit &#8211; making money does not help customers. It&#8217;s not about cost-saving &#8211; saving money does not help customers. It&#8217;s not about shareholders &#8211; pleasing shareholders does not help customers. It&#8217;s not about features &#8211; delivering features does not help customers. It&#8217;s not about testability &#8211; that something is tested does not ...</p><p><a href="http://www.dancres.org/blitzblog/2013/03/29/need/" class="more-link">Continue reading &#8216;Need&#8217; &#187;</a>]]></description>
				<content:encoded><![CDATA[<p>It&#8217;s not about profit &#8211; making money does not help customers.</p>
<p>It&#8217;s not about cost-saving &#8211; saving money does not help customers.</p>
<p>It&#8217;s not about shareholders &#8211; pleasing shareholders does not help customers.</p>
<p>It&#8217;s not about features &#8211; delivering features does not help customers.</p>
<p>It&#8217;s not about testability &#8211; that something is tested does not help customers.</p>
<p>Satisfying a need does help customers. Amongst other things it might make their lives easier, make something possible, educate them or entertain them. It engages them, enthrals them, creates emotion within them. From all of this, many good things will come.</p>
<p>If you are developing a system in absence of a focus on satisfaction of needs, you&#8217;ve lost already. First question then:</p>
<p>Who are your customers?</p>
<p>And if you think the only customers are those paying for what you build, you&#8217;ve lost once again. In fact, you&#8217;ve signed your own happiness away.</p>
<img src="http://feeds.feedburner.com/~r/dancres/sweh/~4/EHGusmG2tV4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dancres.org/blitzblog/2013/03/29/need/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.dancres.org/blitzblog/2013/03/29/need/</feedburner:origLink></item>
		<item>
		<title>On The Practice of Design</title>
		<link>http://feedproxy.google.com/~r/dancres/sweh/~3/Zi94AVHDH6k/</link>
		<comments>http://www.dancres.org/blitzblog/2013/03/23/on-the-practice-of-design/#comments</comments>
		<pubDate>Sat, 23 Mar 2013 18:22:50 +0000</pubDate>
		<dc:creator>Dan Creswell</dc:creator>
				<category><![CDATA[Engineering]]></category>
		<category><![CDATA[Architecture]]></category>
		<category><![CDATA[design]]></category>
		<category><![CDATA[practice]]></category>

		<guid isPermaLink="false">http://www.dancres.org/blitzblog/?p=519</guid>
		<description><![CDATA[Technology is not architecture or indeed design, it is a means for implementing a design. Various technologies (e.g. languages or frameworks) will be more or less compatible with implementing a specific design. Design is an abstract exercise. It becomes constrained by our own choices which can include using existing technology or creating anew. By default, ...</p><p><a href="http://www.dancres.org/blitzblog/2013/03/23/on-the-practice-of-design/" class="more-link">Continue reading &#8216;On The Practice of Design&#8217; &#187;</a>]]></description>
				<content:encoded><![CDATA[<p>Technology is not architecture or indeed design, it is a means for implementing a design. Various technologies (e.g. languages or frameworks) will be more or less compatible with implementing a specific design.</p>
<p>Design is an abstract exercise. It becomes constrained by our own choices which can include using existing technology or creating anew. By default, it should not be constrained, this is closer to the ideal. The more constraint exerted by technology the less ideal things are likely to be. Less ideal can be acceptable, there are always cost limits and such but it should never exist without consideration of the consequences.</p>
<p>Some argue that design cannot be an abstract exercise at all because real-world considerations demand otherwise. Performance is often cited as being too significant to ignore. Are they right?</p>
<p>The nature of performance in a system can be generalised into a set of guiding principles. In the case of computing, there&#8217;s a well known hierarchy driven by locality (starting with the fastest component):</p>
<ol>
<li>Register-based CPU instruction</li>
<li>On CPU cache access</li>
<li>Off CPU cache access</li>
<li>Main memory access</li>
<li>I/O (network, conventional disk, SSD)</li>
</ol>
<p>Jeff Dean and others have expressed this in a table of <a href="https://gist.github.com/jboner/2841832">&#8220;Numbers every programmer should know&#8221;</a>. The performance relationship amongst these components is sufficient guidance for design work. Clearly, incrementing a number across a network connection is something to be avoided (though there are ways to make this work). It would generate significant chatter as would naive distribution of an OO design.</p>
<p>So to answer the question: Performance is too important to ignore in a design but the amount of consideration required is no more than other aspects such as coupling and cohesiveness.</p>
<p>Apple decide what they want to build first then create and/or <a href="http://en.wikipedia.org/wiki/Gorilla_Glass">select the technologies</a> they need leading to great products. They ask themselves how do we make this idea, this concept we have in mind, real? This is the point at which technology becomes relevant. NASA, when set the moon-shot challenge created the technologies they needed to deliver the end result over a period of years. They iterated on engine, flight control and many other aspects to get the ultimate embodiment. 37Signals ended up creating Rails, embodying a new way of building product to deliver their vision.</p>
<p>The best designs start out as concepts or ideas and are largely un-constrained by technology (there are limits of course, e.g. a phone must have certain components possessing certain properties). They retain their elegance, a sense of style and taste. Designs that are forced to fit with early, uninformed technological choices are likely to be brittle and die.</p>
<p>Developers have a bad habit of selecting tools and technology well ahead of consideration of a problem and potential design approaches. History is littered with examples of the consequences, <a href="http://www.laputan.org/mud/">balls of mud</a> and expensive &#8220;surprise&#8221; project failures that should have been &#8220;easily dispatched&#8221; because of this or that silver-bullet technology.</p>
<p>There is nothing harmful in the general discussion of technology tradeoffs, it&#8217;s what leads to useful guidance such as that of Jeff Dean above. It also makes sense in one&#8217;s early career to work with a variety of technologies to help gain an understanding of tradeoffs and patterns that work or don&#8217;t. However, excessive technology fixation is destructive for quality design work.</p>
<p>One can certainly design from a technology driven perspective (choose your language, frameworks etc and constrain your design to fit them) but that won&#8217;t be good enough for a moonshot, a class-leading product or a high-quality solution.</p>
<img src="http://feeds.feedburner.com/~r/dancres/sweh/~4/Zi94AVHDH6k" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dancres.org/blitzblog/2013/03/23/on-the-practice-of-design/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.dancres.org/blitzblog/2013/03/23/on-the-practice-of-design/</feedburner:origLink></item>
		<item>
		<title>200</title>
		<link>http://feedproxy.google.com/~r/dancres/sweh/~3/vXZBGFaowuE/</link>
		<comments>http://www.dancres.org/blitzblog/2013/02/06/200/#comments</comments>
		<pubDate>Wed, 06 Feb 2013 14:50:34 +0000</pubDate>
		<dc:creator>Dan Creswell</dc:creator>
				<category><![CDATA[Engineering]]></category>

		<guid isPermaLink="false">http://www.dancres.org/blitzblog/?p=511</guid>
		<description><![CDATA[In aviation circles there is a thing known as the 200th hour rule. It goes something like this: After 200 hours of flight time you are expert enough to feel confident in what you do but amateur enough to still screw up. Most worryingly it&#8217;s said that come the 200th hour you will screw up ...</p><p><a href="http://www.dancres.org/blitzblog/2013/02/06/200/" class="more-link">Continue reading &#8216;200&#8217; &#187;</a>]]></description>
				<content:encoded><![CDATA[<p>In aviation circles there is a thing known as the 200th hour rule. It goes something like this:</p>
<p><em>After 200 hours of flight time you are expert enough to feel confident in what you do but amateur enough to still screw up. Most worryingly it&#8217;s said that come the 200th hour you will screw up and quite possibly in grand style.</em></p>
<p>I&#8217;m figuring that applies in many other situations and there&#8217;s probably more than one 200th hour event in many cases. One would hope that pilots, should they survive the experience, learn from the mistake and improve. Can we all say we do the same?</p>
<img src="http://feeds.feedburner.com/~r/dancres/sweh/~4/vXZBGFaowuE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dancres.org/blitzblog/2013/02/06/200/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.dancres.org/blitzblog/2013/02/06/200/</feedburner:origLink></item>
		<item>
		<title>Blueprints</title>
		<link>http://feedproxy.google.com/~r/dancres/sweh/~3/NL9xWECb6kw/</link>
		<comments>http://www.dancres.org/blitzblog/2013/01/27/blueprints/#comments</comments>
		<pubDate>Sun, 27 Jan 2013 16:00:23 +0000</pubDate>
		<dc:creator>Dan Creswell</dc:creator>
				<category><![CDATA[Engineering]]></category>
		<category><![CDATA[development]]></category>
		<category><![CDATA[practice]]></category>
		<category><![CDATA[stagnation]]></category>
		<category><![CDATA[Systems]]></category>

		<guid isPermaLink="false">http://www.dancres.org/blitzblog/?p=474</guid>
		<description><![CDATA[For a long time, I&#8217;ve wanted to write something about the state of our software practices. It&#8217;s always proven quite challenging as I find myself unerringly drawn towards philosophy, creativity, engineering and a myriad of other voluminous subjects. Producing something succinct has proven consistently elusive. They say you can&#8217;t force these things and so it would ...</p><p><a href="http://www.dancres.org/blitzblog/2013/01/27/blueprints/" class="more-link">Continue reading &#8216;Blueprints&#8217; &#187;</a>]]></description>
				<content:encoded><![CDATA[<p>For a long time, I&#8217;ve wanted to write something about the state of our software practices. It&#8217;s always proven quite challenging as I find myself unerringly drawn towards philosophy, creativity, engineering and a myriad of other voluminous subjects. Producing something succinct has proven consistently elusive. They say you can&#8217;t force these things and so it would appear as it&#8217;s taken some writing from <a href="http://research.microsoft.com/en-us/um/people/lamport/">Leslie Lamport</a> to help me distil out some specific points that I want to make.</p>
<p>The article that started this chain of events is <a href="http://www.wired.com/opinion/2013/01/code-bugs-programming-why-we-need-specs/">Blueprints</a> in which Dr Lamport discusses the practice of coding. I found it thought provoking yet judging by the comments many felt it was irrelevant, out of date or simply wrong. Reading through those comments and a tweet discussion with <a href="https://twitter.com/nicferrier">Nic Ferrier</a> led me to a bunch of observations which appear below.</p>
<p><span class="s2">Foundations</span></p>
<p>It appears that the focus on “practical” aspects of systems building (e.g. knowing how to code in popular industry languages rather than the fundamentals that underpin them all) has significantly impacted the corpus of common knowledge. Specification as a practice is not well understood:</p>
<ul>
<li>It can be formal or informal &#8211; ultimately the end-goal determines what is appropriate. Formal specifications provide the opportunity for proof and verification which in critical systems is highly desirable. The relevance goes beyond critical systems though to any situation where high confidence in a piece of behaviour is required.</li>
<li>Specification in its various forms isn’t a theoretical exercise &#8211; there are a number of examples of its application in real systems. Google mention it in a variety of circumstances including <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.116.9219">Chubby</a> and <a href="http://research.google.com/archive/spanner.html">Spanner</a>.</li>
<li>Proving correctness can be done via formal specification, it cannot be achieved by testing. Imagine standing in a dark room with a pencil beam torch trying to establish what&#8217;s in the room, dimensions etc. This will take a long time and things are easily missed unless you cover the entire room with that pencil beam (which will take forever). Formal specs allow you to simply turn the light on in the room.</li>
<li>BDD, TDD and the like are testing processes and thus cannot be directly compared with specification (certainly not the formal variety) which is a tool.</li>
</ul>
<p><span class="s2">Abstraction</span></p>
<p>The ability to deal in abstraction is important for the disciplines of architecture and design. However there is a more fundamental need to satisfy when coding, the limits of the individual mind to retain and reason about detail. Our only tool for coping with systems of detail larger than can be held in an individual mind is abstraction. Abstraction creates coarser constructs that hide some of the detail and allow us to scale our reasoning to broader levels. It also makes it possible to communicate and test our reasoning with others.</p>
<p>Reading the responses to Lamport’s article shows that some of us are too literal in our interpretation not pausing to consider the more abstract possibilities. We are unbalanced in our view, focused on detail and specifics in a world where grey and uncertainty (a natural consequence of dropping detail for sake of abstraction) play a critical role. Some examples:</p>
<ul>
<li>Software systems have nothing in common with buildings. Consider for a moment the challenge of changing an old or large system to cope with a radical new requirement say going from <a href="http://www.cs.mcgill.ca/~carl/impossible.pdf">single machine to massively distributed</a>. Is that so different from taking an old Victorian-age school and putting in the trunking and cabling required for modern systems development? In both cases, there will be a desire for an understanding of the current structure (dare I say blueprints?), then some consideration of options, perhaps some testing out of tools and practices before actually doing the work (which undoubtedly will be iterative, component by component or room by room).</li>
<li class="li2">Skyscrapers are big systems, toolsheds are small ones. Lamport himself states otherwise in the article: “<i>While the specs I write are almost all informal, occasionally a piece of code is sufficiently subtle, or sufficiently critical, that it should be specified formally — either for precision or for using tools to check it. I’ve only had to do that about a half dozen times during the past dozen years.</i>” He’s clearly talking about pieces of code, could be one method or a couple of classes or indeed entire systems.</li>
</ul>
<p>Being too literal, ignoring the grey and reducing abstractions to strict constructs (ironic considering the vehement resistance to formal specification because it’s too constraining) has ramifications beyond design quality for aspects such as human communication, essential in any good team, agile or otherwise. We stop ourselves from considering the greater context, the bigger possibilities which might explain why some techies cannot easily relate to customer needs.</p>
<p><span class="s2">Research</span></p>
<p>A couple of Google searches reveals that Lamport has a <a href="http://gist.github.com/4617660/">body</a> <a href="http://research.microsoft.com/en-us/um/people/lamport/tla/tla-intro.html">of</a> <a href="http://research.microsoft.com/en-us/um/people/lamport/tla/hyperbook.html">work</a> (including <a href="http://research.microsoft.com/en-us/um/people/lamport/tla/tla.html#tools">tools</a>) related to reasoning about concurrent systems using specifications. This is notable because it focuses on the non-functional, something not <a href="http://dannorth.net/2012/05/31/bdd-is-like-tdd-if/">often</a> <a href="http://en.wikipedia.org/wiki/Specification_by_example">seen</a> in discussions pertaining to TDD or BDD. For example, have you ever run across a test specification like this?</p>
<p><i>Will sort n items, distributed in any of the following orders (already sorted, exponential etc), in (n log n) time subject to the availability of memory being sufficient to hold 4 * n.</i></p>
<p>Returning to Lamport&#8217;s work, how often would you see any explicit treatment of concurrency or parallelism in applications of BDD or TDD? Isn’t consideration of these non-functionals relevant?</p>
<p>Some have argued that Lamport as the author has the responsibility for including all of this in his article. Really? Are we saying that an audience has a right to a complete, finished work that they can just apply verbatim, without thought or further development? Do all the best films have a definitive ending? Of course not, because there&#8217;s value in allowing a viewer to invent and go further.</p>
<p>Lamport opened the door to an opportunity for personal development and an improvement in the quality of one&#8217;s work, maybe some innovation too. Those who sought no further reading (a mere google away) and pronounced what he was writing irrelevant or covered by BDD or TDD have missed out.</p>
<p>These failures to dig deeper and put into context lead to stagnation of our practice (ironic given the focus on “practical” aspects such as coding). Research is essential to learning and growth but seemingly is becoming a lost art to many.</p>
<p><span class="s2">Wrapping Up</span></p>
<p>Lamport wrote in his article: “<i>Few programmers write even a rough sketch of what their programs will do before they start coding. Most programmers regard anything that doesn’t generate code to be a waste of time.</i>”</p>
<p>Meanwhile, Wired headed the piece with this statement: “<i>With widespread access to free, online coding courses and tools, “coding” has become the new writing – the everyman’s skill.</i>”</p>
<p>Given many of the responses to the article it seems that the readership proved Lamport right and Wired wrong. Not everyone possesses the skill to code competently and those that think it’s just about code are missing the key factors to make themselves so.</p>
<style type="text/css"><!--
span.s2 {text-decoration: underline}
--></style>
<img src="http://feeds.feedburner.com/~r/dancres/sweh/~4/NL9xWECb6kw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dancres.org/blitzblog/2013/01/27/blueprints/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.dancres.org/blitzblog/2013/01/27/blueprints/</feedburner:origLink></item>
		<item>
		<title>Motivations</title>
		<link>http://feedproxy.google.com/~r/dancres/sweh/~3/bGzbnXoT5Dw/</link>
		<comments>http://www.dancres.org/blitzblog/2013/01/23/motivations/#comments</comments>
		<pubDate>Wed, 23 Jan 2013 13:50:20 +0000</pubDate>
		<dc:creator>Dan Creswell</dc:creator>
				<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[Architecture]]></category>

		<guid isPermaLink="false">http://www.dancres.org/blitzblog/?p=473</guid>
		<description><![CDATA[Almost as soon as the discussion of building services starts, there are questions about latency of remote calls and what is or is not a service I have a basic rule of thumb for remote call performance tradeoffs: if the total compute time (that includes background/offline work to keep things up to date) required to ...</p><p><a href="http://www.dancres.org/blitzblog/2013/01/23/motivations/" class="more-link">Continue reading &#8216;Motivations&#8217; &#187;</a>]]></description>
				<content:encoded><![CDATA[<p>Almost as soon as the discussion of building services starts, there are questions about latency of remote calls and what is or is not a service</p>
<p>I have a basic rule of thumb for remote call performance tradeoffs:</p>
<p><em>if the total compute time (that includes background/offline work to keep things up to date) required to service a remote request is greater than the round-trip time + a little fudge, it&#8217;s acceptable to make it a service.</em></p>
<p style="text-align: left;">However, there are other things to consider outside of local performance concerns. An optimised protocol running over even 100Mbps ethernet can manage round-trips of 1ms or less. To put that into perspective consider that <a href="http://www.igvita.com/slides/2012/webperf-crash-course.pdf">Google reports</a> average worldwide round-trips to their site are ~100ms and within US ~50-60ms. Google has presence across the globe and does all the right things in terms of content serving etc. Most other sites don&#8217;t do nearly that well. These numbers don&#8217;t cover mobile internet either which can have substantially worse performance. The point is that losing a couple of ms on a remote call in the context of round-trips outside the firewall is no big deal.</p>
<p style="text-align: left;">That said, many point to the fact that once we make a reasonable number of service calls, the latency of customer response can increase significantly. Which is true, if we choose a synchronous model of remote invocation where we wait for each call to complete before we dispatch the next. The thing is, that&#8217;s not necessary, asynchronous requests make more sense as they allow for better support of timeouts (important for failure and load handling) to protect thread pools and such. Further, asynchronous requests give us the ability to dispatch work in parallel which is exactly what Amazon does in order to ensure that the <a href="http://queue.acm.org/detail.cfm?id=1388773">100+ service calls</a> they make don&#8217;t severely impact page rendering.</p>
<p style="text-align: left;">Beyond round-trip performance there are a bunch of other motivations that contribute to a decision to make something a service or not. In my experience, these sorts of things rarely come up in architectural conversations but are absolutely essential to getting good results:</p>
<ul>
<li>Performance &#8211; a single application that contains disparate functionality can become difficult to tune meaningfully as load patterns interfere with each other &#8211; e.g. caching policies, storage performance characteristics, I/O versus compute intensive</li>
<li>Scalability &#8211; an architecture that supports substantial scale for one type of function may not work well for another &#8211; e.g. a read-mostly load can be served from disk with a filesystem cache. This is entirely inappropriate for volatile, rapidly-changing, transactional information. These conflicts are further exacerbated by geographical dispersion.</li>
<li>Availability &#8211; a single application containing disparate functionality can be brought to its knees by a single fault affecting a shared resource. The result is a situation where one functional problem ensures nothing functions at all &#8211; e.g. memory leaks, all data for all functions is retained in a single database or a single code fault generating exceptions that eventually exhausts thread-pools or causes a JVM to exit.</li>
</ul>
<p>All the above have implications for our ability to support high-quality SLAs</p>
<ul>
<li>Manageability &#8211; dependency chains become deep or wide or both such that it becomes impossible to accurately predict the consequences of a change. Refactoring is difficult, automated testing options are limited. Builds take longer and longer. A single update of a library requires all teams to co-ordinate the change to ensure all code is brought up to date. All technical staff become expert in all aspects and all technologies. This creates long training cycles, makes recruitment difficult or alternatively requires tight control/standardisation of tools to limit technological proliferation which in turn inhibits correct solution construction and innovation.</li>
<li>Data Management &#8211; some data is externally regulated and requires specific policy and infrastructure (e.g. PCI). When this data is mixed with other data (as is typical with monolithic codebases), the entire codebase and all data become subject to the same stringent requirements slowing development, increasing infrastructure costs etc.</li>
<li>Operational &#8211; as for development, ops staff are compelled to understand all aspects of everything for the purposes of diagnosis. The sheer quantity of information produced from the single codebase can make separating signal from noise challenging &#8211; e.g. log files containing all messages for all functionality. Releases must necessarily be slow and careful because there can never be much certainty of code quality and rollback is difficult. Further, staging out of updates in small chunks is made challenging by virtue of the number of things that can cause compatibility problems during upgrade. </li>
</ul>
<p>The basic force underlying all of this commentary is:</p>
<p><em>A one-size, fits all approach where everything sits in one big build and is replicated across many app-servers all backed by a single database fronted by caches (and other variant architectures) will only get you so far.</em></p>
<p>Thus there is a point in growth (load, features, infrastructure, availability demands etc) at which this one size fits all approach starts to hinder progress and increase cost. Smart techies will watch out for relevant trends and make plans to transition to a more service oriented style as appropriate.</p>
<p>[ Sidenote: I'd normally advocate not doing an SOA from the get-go. However, infrastructures with properties like EC2 make typical one-size fits all architectures less appealing. They don't cope well with failure or relocation like a decent SOA can. ]</p>
<img src="http://feeds.feedburner.com/~r/dancres/sweh/~4/bGzbnXoT5Dw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dancres.org/blitzblog/2013/01/23/motivations/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://www.dancres.org/blitzblog/2013/01/23/motivations/</feedburner:origLink></item>
		<item>
		<title>Postmortem</title>
		<link>http://feedproxy.google.com/~r/dancres/sweh/~3/fYJALVnc314/</link>
		<comments>http://www.dancres.org/blitzblog/2013/01/03/postmortem/#comments</comments>
		<pubDate>Thu, 03 Jan 2013 08:58:54 +0000</pubDate>
		<dc:creator>Dan Creswell</dc:creator>
				<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[hosting]]></category>

		<guid isPermaLink="false">http://www.dancres.org/blitzblog/?p=461</guid>
		<description><![CDATA[I&#8217;ve been noodling a little on placement of infrastructure to serve a worldwide customer base. One can house that infrastructure in ones own purpose built datacentres, various colos or a cloud provider. Cloud is hot right now, Amazon is thus a heavily favoured option for many. In terms of locations, they have pretty good coverage: US, ...</p><p><a href="http://www.dancres.org/blitzblog/2013/01/03/postmortem/" class="more-link">Continue reading &#8216;Postmortem&#8217; &#187;</a>]]></description>
				<content:encoded><![CDATA[<p>I&#8217;ve been noodling a little on placement of infrastructure to serve a worldwide customer base. One can house that infrastructure in ones own purpose built datacentres, various colos or a cloud provider. Cloud is hot right now, Amazon is thus a heavily favoured option for many.</p>
<p>In terms of locations, they have <a href="http://aws.amazon.com/about-aws/globalinfrastructure/">pretty good coverage</a>: US, South America, APAC, EU.</p>
<p>But there are skewing factors that mean these locations are not all equal:</p>
<ul>
<li>Some of them <a href="http://aws.amazon.com/about-aws/globalinfrastructure/regional-product-services/">don&#8217;t have all the services on the AWS menu</a></li>
<li>Some of them have limited capacity (2 EC2 Zones in South America vs 5 in US-EAST for example)</li>
<li>Some of them are lower down the pecking order in terms of new feature rollout</li>
<li>Cost is not uniform</li>
</ul>
<p>
<div>[Adrian Cockroft has a succinct <a href="https://twitter.com/adrianco/status/260777233503907840">tweet</a> on the subject]</div>
</p>
<div>Then there&#8217;s latency and <a href="http://www.igvita.com/2012/07/19/latency-the-new-web-performance-bottleneck/">its effects on the customer experience</a>. If you want to get truly global coverage, you&#8217;d probably want to put kit in all the locations available but that will get expensive, you&#8217;ll want to trim a little if you can. If the vast majority of your customers are US and Europe based, US-EAST and EU-WEST would appear to be the best choices.</div>
<p>
<div>Ah but we all that US-EAST is terrible for reliability, right? Well maybe, here&#8217;s a breakdown of failures from the last couple of years (I wouldn&#8217;t claim it&#8217;s complete as finding all the AWS postmortems is not straightforward):</div>
</p>
<div></div>
<table>
<tbody>
<tr>
<td>Date</td>
<td>Region</td>
<td>Issue</td>
</tr>
<tr>
<td><a href="http://aws.amazon.com/message/680587/">Dec 24 2012</a></td>
<td>US-EAST</td>
<td>Human error causes deletion of state, propagation of faulty configuration</td>
</tr>
<tr>
<td><a href="http://aws.amazon.com/message/680342/">Oct 22 2012</a></td>
<td>US-EAST</td>
<td>Hardware failure/replacement triggering memory-leak</td>
</tr>
<tr>
<td><a href="http://aws.amazon.com/message/67457/">Jun 29 2012</a></td>
<td>US-EAST</td>
<td>Generator failure in multiple zones leading to software failures induced by load and/or recovery actions</td>
</tr>
<tr>
<td>Jun 14 2012</td>
<td>US-EAST</td>
<td>Generator fan failure, incorrect breaker configuration, leading to software failures induced by recovery actions</td>
</tr>
<tr>
<td><a href="http://aws.amazon.com/message/2329B7/">Aug 7 2011</a></td>
<td>EU-WEST</td>
<td>Generator failure leading to software failures induced by load, recovery actions and a bug</td>
</tr>
<tr>
<td><a href="http://aws.amazon.com/message/65649/">Jun 13 2011</a></td>
<td>US-EAST</td>
<td>Power failure leading to software failures induced by load in Simple DB</td>
</tr>
<tr>
<td><a href="http://aws.amazon.com/message/65648/">Apr 21 2011</a></td>
<td>US-EAST</td>
<td>Human error causes network issue inducing substantial recovery action and spiralling load</td>
</tr>
</tbody>
</table>
<p>When you look at it, the actual root causes are such they could occur anywhere in any of the regions. They all run the same software (accepting there&#8217;s some skew in releases/fixes as pointed out by Adrian Cockroft, see above) which would no doubt suffer the same ailments given the same triggers. Some claim, the age of US-EAST is a factor but the root causes suggest otherwise (power supply infrastructure has performed about the same for years). There is of course, another key difference which is that most customers are in US-EAST alone. It&#8217;s the biggest zone, if you wanted to cover Europe and US using AWS for the least amount of cash latency would drive you here. If you were savvy about Availability Zones and Regions, you&#8217;d probably design in another region, perhaps EU-WEST but <a href="http://techblog.netflix.com/2012/12/a-closer-look-at-christmas-eve-outage.html">we know that&#8217;s rarefied air</a>.</p>
<p>That load going into US-EAST likely amplifies the effects of recovery storms in clusters in ways not seen in other regions. So is US-EAST such a bad-boy? I believe the answer is no but environmental factors make it appear so. If you wanted to avoid US-EAST, you&#8217;d probably select one or both US-WEST regions paired with EU but you&#8217;ll pay Amazon more and your engineering complexity will be higher.</p>
<img src="http://feeds.feedburner.com/~r/dancres/sweh/~4/fYJALVnc314" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dancres.org/blitzblog/2013/01/03/postmortem/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://www.dancres.org/blitzblog/2013/01/03/postmortem/</feedburner:origLink></item>
		<item>
		<title>Manifesto</title>
		<link>http://feedproxy.google.com/~r/dancres/sweh/~3/3I5qgfQzyc0/</link>
		<comments>http://www.dancres.org/blitzblog/2012/12/20/manifesto/#comments</comments>
		<pubDate>Thu, 20 Dec 2012 09:22:37 +0000</pubDate>
		<dc:creator>Dan Creswell</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[implementation]]></category>
		<category><![CDATA[soa]]></category>

		<guid isPermaLink="false">http://www.dancres.org/blitzblog/?p=452</guid>
		<description><![CDATA[For various reasons that will become apparent over the coming months I&#8217;m drafting up my third set of SOA kick-off bits and pieces. There&#8217;s loads of stuff that ultimately needs to be thrashed out and detailed but I&#8217;m not one for doing all that up front, it&#8217;s decidedly anti-agile. All tech teams are different as ...</p><p><a href="http://www.dancres.org/blitzblog/2012/12/20/manifesto/" class="more-link">Continue reading &#8216;Manifesto&#8217; &#187;</a>]]></description>
				<content:encoded><![CDATA[<p>For various reasons that will become apparent over the coming months I&#8217;m drafting up my third set of SOA kick-off bits and pieces. There&#8217;s loads of stuff that ultimately needs to be thrashed out and detailed but I&#8217;m not one for doing all that up front, it&#8217;s decidedly anti-agile. All tech teams are different as are the organisations they work in so the basic questions are always the same, the rest is always &#8220;as you find it&#8221;. Writing a big chunk of stuff up front just isn&#8217;t worth it then.</p>
<p>Okay, so my chosen minimum is a manifesto and thanks to Messrs Bezos and Yegge, a <span style="color: #0000ff;"><a href="https://plus.google.com/112678702228711889851/posts/eVeouesvaVX"><span style="color: #0000ff;">chunk of the work</span></a></span>has been done for me. I&#8217;ve added a couple of other points that reflect my experience. In some ways, they can be inferred from the other points if you have the right experience but people in that category wouldn&#8217;t need this manifesto to take aim at.</p>
<ol>
<li>All teams will henceforth expose their data and functionality through remote service interfaces.</li>
<li>Services must do all communication with each other through these interfaces.</li>
<li>There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team&#8217;s data store, no shared-memory model, no back-doors whatsoever.</li>
<li>Any state transitions relevant to a service user that happen asynchronously to their requests will be made available via an appropriate push mechanism.</li>
<li>It doesn&#8217;t matter what technology they use. HTTP, Corba, Pubsub, custom protocols &#8212; doesn&#8217;t matter.</li>
<li>All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world.</li>
<li>All services will be designed with operations in mind. That is they must log appropriately, provide relevant monitoring and be easily started, stopped, installed and removed.</li>
<li>All services must be designed to handle failure gracefully. They must define and enforce SLAs, throttle appropriately and account for failure (slowness, loss of endpoint etc) in downstream resources.</li>
</ol>
<p>[ <strong>Update:</strong> After some feedback (thanks Asher) I've added a point regarding state transitions. It is often tempting to make timing assumptions about state transitions that will not hold in the presence of challenges such as downstream failure or excessive load. ]</p>
<img src="http://feeds.feedburner.com/~r/dancres/sweh/~4/3I5qgfQzyc0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dancres.org/blitzblog/2012/12/20/manifesto/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://www.dancres.org/blitzblog/2012/12/20/manifesto/</feedburner:origLink></item>
		<item>
		<title>Pathology</title>
		<link>http://feedproxy.google.com/~r/dancres/sweh/~3/v5DGUHhlAF8/</link>
		<comments>http://www.dancres.org/blitzblog/2012/06/27/pathology/#comments</comments>
		<pubDate>Wed, 27 Jun 2012 15:58:39 +0000</pubDate>
		<dc:creator>Dan Creswell</dc:creator>
				<category><![CDATA[Business]]></category>

		<guid isPermaLink="false">http://www.dancres.org/blitzblog/?p=439</guid>
		<description><![CDATA[When the underlying demand for IT is cost reduction, the result will ultimately be increased cost as the business short-changes itself in a variety of arenas.   The core of any technology group is its people. It is their mindset, experience and capability that ultimately dictates the outcome of any work undertaken. It is essential ...</p><p><a href="http://www.dancres.org/blitzblog/2012/06/27/pathology/" class="more-link">Continue reading &#8216;Pathology&#8217; &#187;</a>]]></description>
				<content:encoded><![CDATA[<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;">When the underlying demand for IT is cost reduction, the result will ultimately be increased cost as the business short-changes itself in a variety of arenas.</p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;"> </p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;">The core of any technology group is its people. It is their mindset, experience and capability that ultimately dictates the outcome of any work undertaken. It is essential the right people are acquired and developed, tight budgets for training and wages will make this almost impossible. The best one can do with such an approach is to hire inexperienced, intelligent individuals such as undergraduates who will hopefully learn faster on the job than most. One might be tempted by outsourcing but similar issues exist. In absence of talent, how does one assure the level of talent and experience or indeed the quality of work produced by a company over which there is limited influence?</p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;"> </p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;">Many companies maintain approved supplier and technology lists with onerous processes for making changes. The justifications are many and varied including the desire for a single point of contact for issues, bulk purchase discounts and the belief that a mature product is more stable. The effect however is to constrain the options for tackling any particular challenge typically leading to inappropriate use of a technology, bending it to solve a problem it was never intended to address. The result is a sting-in-the-tail architectural compromise that saves money momentarily after which a potentially growing long-term cost is suffered. Other undesirable side effects are to discourage staff from experimenting with new technology thus limiting the potential for innovation.</p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;"> </p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;">An attitude of cost reduction often drives mis-guided attempts at automation where the intention is to eliminate staff. What follows is a death-march attempt to build an end-to-end system that implements all the rules of thumb, guidelines and exceptions that computers are so poorly equipped to handle. If one takes cost-reduction out of the equation, it becomes clear that a focus on eliminating waste and drudge from repetitive tasks freeing staff to be more creative and innovative in their contributions is the better option. Taiichi Ohno describes it thus:</p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;"> </p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;">&#8220;First, work and equipment improvement should be considered. Work improvement alone should contribute half or one-third of total cost reduction. Next autonomation or equipment improvement should be considered. I repeat that we should be careful not to reverse work improvement and equipment improvement. If equipment improvement is done first, cost will go up &#8211; not down.&#8221;</p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;"> </p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;">A climate of cost reduction will also lead to a variety of software development compromises under the umbrella of &#8220;do something quick and cheap now, pay the price later&#8221;. One obvious example would be when a bug fix is done crudely in the name of saving time rather than doing the necessary small amount of re-design. System maintainability is increasingly compromised such that development costs for any given change increase over time. Quality also suffers as testing and reviewing are cut back.</p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;"> </p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;">The list of undesirable consequences of a focus on cost reduction is very long, let&#8217;s cut to the chase with a quote from John Seddon: &#8221;If you manage costs, costs go up.&#8221;</p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;"> </p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;">What then do we do about this? We work with the board to develop a focus within the business on delivery of value to customers (e.g. innovative products, a reliable website and friendly support) which ultimately underpins revenue. Anything that does not improve the overall customer experience should be viewed as costly and ineffective.</p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;"> </p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;">Even the humble phone system can be evaluated in the context of customer value. Consider how it allows for more direct interaction with customers and provides a better environment for discussion of product ideas than, say, email. Importantly, the value-based approach makes it easier for staff to relate their work to the bigger picture with the potential for increased quality, engagement and motivation.</p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;"> </p>
<p style="font: normal normal normal 12px/normal Arial; color: #232323; font-family: 'Lucida Grande'; margin: 0px;">The value-based approach is exemplified by the likes of Jeff Bezos and the late Steve Jobs, can anyone say it doesn&#8217;t work?</p>
<p>[Originally published <a href="http://cioevent.wordpress.com/2012/06/26/the-pathology-of-cost-reduction/">here</a>]</p>
<img src="http://feeds.feedburner.com/~r/dancres/sweh/~4/v5DGUHhlAF8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dancres.org/blitzblog/2012/06/27/pathology/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.dancres.org/blitzblog/2012/06/27/pathology/</feedburner:origLink></item>
		<item>
		<title>Product</title>
		<link>http://feedproxy.google.com/~r/dancres/sweh/~3/7diwyYuNRQU/</link>
		<comments>http://www.dancres.org/blitzblog/2012/05/29/product/#comments</comments>
		<pubDate>Tue, 29 May 2012 17:27:38 +0000</pubDate>
		<dc:creator>Dan Creswell</dc:creator>
				<category><![CDATA[Engineering]]></category>
		<category><![CDATA[coding]]></category>
		<category><![CDATA[development]]></category>
		<category><![CDATA[product]]></category>

		<guid isPermaLink="false">http://www.dancres.org/blitzblog/?p=423</guid>
		<description><![CDATA[All developers love to code, me included, it&#8217;s a fun part of the job. I wouldn&#8217;t say it&#8217;s the most fun, I find that in the design work maybe or perhaps in shipping product. Shipping code is easy, shipping a product, that&#8217;s hard. You can write some code, shove it in an .msi or .jar ...</p><p><a href="http://www.dancres.org/blitzblog/2012/05/29/product/" class="more-link">Continue reading &#8216;Product&#8217; &#187;</a>]]></description>
				<content:encoded><![CDATA[<p>All developers love to code, me included, it&#8217;s a fun part of the job. I wouldn&#8217;t say it&#8217;s the most fun, I find that in the design work maybe or perhaps in shipping product.</p>
<p>Shipping code is easy, shipping a product, that&#8217;s hard. You can write some code, shove it in an .msi or .jar and install it on a box no problem but to me at least, that&#8217;s less than half the job. Shipping a product means dealing with a bunch of additional concerns driven by one simple requirement:</p>
<p style="text-align: center;"><em>We want our product in front of our customer, working well and rapidly getting better with each release.</em></p>
<p>This simple requirement has a bunch of consequences including&#8230;</p>
<p><strong>We need to build little and often.</strong> Code degrades over time, that includes unreleased code. The larger the body of code, the harder it is to get it back into shape if it&#8217;s been allowed to degrade. For example, if one leaves testing until late in the project there is a high chance of many errors being found, much re-writing and re-testing. This is incredibly wasteful and breaches our simple requirement to get product upgrades out of the door fast. We want small pieces of code tested in tight loops.</p>
<p>Functional testing in tight loops with code is not enough. We need to focus on the non-functional aspects as well, with a collection of automated performance tests and regular use of a profiler to check on memory usage.</p>
<p><strong>Problem resolution must be efficient.</strong> The first step in problem resolution? Answering the question &#8220;what changed?&#8221;. We need to know what we put out in a release. We might also engineer feature flags that allow us to slowly enable features and assess health as we go. If we start seeing problems, there&#8217;s a reasonable chance the last thing we enabled is the culprit. Obviously, releasing large bodies of code with all features enabled makes answering this question difficult leading to extended periods of problem resolution.</p>
<p>Having worked so hard to get our product in front of the customer, the last thing we want is to have failures persisting and disrupting their usage. That means we want substantial insight into what our product is doing, how it is being used and what&#8217;s wrong. For web applications that means we need quality monitoring data (covering function and performance) from all levels of the system, clean logging (we don&#8217;t want routine, harmless exceptions being logged as errors) and user activity logging (conveniently that helps us understand more about our customers and what they like or dislike &#8211; A/B testing anyone?). For a desktop application, &#8220;phone-home&#8221; type infrastructure is immensely powerful. One needn&#8217;t go as far as Microsoft <a href="http://research.microsoft.com/pubs/81176/sosp153-glerum-web.pdf">have</a> but some means to capture errors, reporting them back with relevant state and no customer involvement (fill in this bug form) is a useful minimum.</p>
<p>There are some other things we can do in this area, particularly in the case of online applications it&#8217;s possible to employ dark testing. More generally, limited user-testing via early releases can help here though one must be careful with customer data (corrupting files is a no no for example).</p>
<p>Performance &#8211; customers use things in all sorts of unintended ways, data-gathering on their exact habits can be limited. One must be artful in finding means to gain insight and react. One must also have in place some key indicators of system stress, think queueing theory as a way to identify flows and choke-points for measurement/tracking.</p>
<p><strong>Keep the release process simple.</strong> Having got to the point where you want to make the product available, the last thing you need is an onerous, error-prone mechanism by which this is achieved. Large quantities of configuration should be avoided, favour convention over configuration and where possible have the product figure things out for itself. Manual work should be substituted with automation as even the simplest of repetitive work is error prone with human involvement (humans aren&#8217;t robots). For desktop applications, we don&#8217;t want emails about upgrades and user downloads from websites, much better to do automated updates a la Chrome or Tower and others.</p>
<p><strong>Understand the environment.</strong> The interaction between software and hardware (in which I include network and storage) affects the overall consumer experience. One cannot design, build and test on machines with 16 cores, 64 Gb of memory and multiple cinema displays if the desktop of the average customer is 4 cores and 8 Gb of memory. To have such a difference is asking for trouble in algorithm selection, memory usage, storage utilisation and so on. Developers must spend time with the hardware their software will run on.</p>
<h2>A Word On The Nature Of The Beast</h2>
<p>A product is something we deliver into the hands of customers and once it&#8217;s out in the wild, our levels of control and insight are greatly reduced. We can&#8217;t make changes as and when suits us nor can we be sure that our product will be used in ways that match our expectations. Metaphorically speaking we screw all the bits and pieces together, do some testing, package them up and ship them off hoping that once the assembled product is in the hands of customers it&#8217;ll serve them well and they&#8217;ll be back for more.</p>
<h2>Real-World Examples</h2>
<p>Think about Apple, when they build a product, they intend for it to be low maintenance, almost a sealed unit with no user-serviceable parts. This isn&#8217;t dis-similar from having software in production, once it&#8217;s out there, it&#8217;s largely frozen, sealed from changes. When things break, just as with Apple, one must understand what went wrong through various forms of forensics, come up with a fix and rev out another version.</p>
<p>When one is not good at forensics, when one&#8217;s rollout of new versions is painful for customers or operational staff or both the result is poor reputation leading to reduced revenue and ultimately job losses.</p>
<p>Consider Formula 1 where the cars are covered in sensors to drive telematics that allow the pit crew to see what&#8217;s going on and start reacting to it quickly during testing, practice, qualifying and the race. There are considerably fewer outright failures these days as a result. The level of effort they go to in monitoring and testing is huge as they even <a href="http://www.evo.co.uk/features/features/226401/20_things_you_didnt_know_about_f1.html">sample fuel and oil over the course of a weekend</a> to identify the onset of engine or gearbox issues, check on wear rates etc.</p>
<img src="http://feeds.feedburner.com/~r/dancres/sweh/~4/7diwyYuNRQU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dancres.org/blitzblog/2012/05/29/product/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.dancres.org/blitzblog/2012/05/29/product/</feedburner:origLink></item>
		<item>
		<title>Messaging</title>
		<link>http://feedproxy.google.com/~r/dancres/sweh/~3/tG2sy_VlCtQ/</link>
		<comments>http://www.dancres.org/blitzblog/2012/05/16/messaging/#comments</comments>
		<pubDate>Wed, 16 May 2012 18:56:26 +0000</pubDate>
		<dc:creator>Dan Creswell</dc:creator>
				<category><![CDATA[Distributed Systems]]></category>
		<category><![CDATA[messaging]]></category>
		<category><![CDATA[recovery]]></category>

		<guid isPermaLink="false">http://www.dancres.org/blitzblog/?p=415</guid>
		<description><![CDATA[Assume we wish to consume an ordered stream of messages. Assume one or more messages may be lost from the stream. Assume that we wish to develop and retain some state built against those messages. Assume each message contains a monotonically increasing sequence number or some other mechanism by which ordering and gaps in sequence ...</p><p><a href="http://www.dancres.org/blitzblog/2012/05/16/messaging/" class="more-link">Continue reading &#8216;Messaging&#8217; &#187;</a>]]></description>
				<content:encoded><![CDATA[<p>Assume we wish to consume an ordered stream of messages.</p>
<p>Assume one or more messages may be lost from the stream.</p>
<p>Assume that we wish to develop and retain some state built against those messages.</p>
<p>Assume each message contains a monotonically increasing sequence number or some other mechanism by which ordering and gaps in sequence can be identified.</p>
<p>If we encounter a gap in the sequence of messages, we need to recover all the messages between the last message we saw and the one we&#8217;ve just seen that allowed us to deduce messages were missing.</p>
<p>The simplest way to recover these messages is to query the originator of the messages. If the number of messages required is substantial this can be a costly operation. We can reduce the cost of this operation by having the source maintain a checkpoint that summarises the message stream prior to a particular point in history. Each checkpoint includes the sequence number of the last message that is digested into it.</p>
<p>When we detect a gap in the sequence we query the source for the checkpoint. We compare the low end of our gap with the sequence number in the checkpoint. If the gap ends above this checkpoint, we request a replay of the message stream for the gap otherwise we recover the checkpoint and request replay of all messages from that point on.</p>
<p>This mechanism works well if we are for example, trying to generate a somewhat partial price history from a market. One can afford gaps in this circumstance and replay from a point in time (a checkpoint) is sufficient to get back in sync and recover a certain amount of lost history.</p>
<p>A complete history would require the ability to recover any range of log records since the earliest point in time the history is required. This presents the source with a problem where it may need to retain all records for the duration of a market&#8217;s life. A related problem is one where the consumer of the messages is constructing some form of aggregated state that is not available in the checkpoint such as a full order history for a market.</p>
<p>In such cases, it is desirable to have a number of replica consumers each of which independently produces an aggregate. These replicas shouldn&#8217;t all be located in the same area of the network as, if they are, it is highly possible they will all lose the same messages.</p>
<p>Each replica maintains a checkpoint of its aggregated state which, as per the case for the source, should contain the sequence number of the most recent message digested. A replica that finds itself out of date now contacts another replica for a recent checkpoint and requests replay of messages since the sequence number in the checkpoint from the originating source. The replica can then re-synthesise its state by applying the recovered messages to the checkpoint.</p>
<p>As a consumer detects a loss of messages from the arrival of later messages, natural pauses in the message stream (e.g. because market activity is absent) can result in the consumer being out of date for significant time. A source can assist in this situation by emitting a checkpoint (or heartbeat) message periodically during inactivity. If the consumer knows the frequency of checkpoints it can compute the maximum period of silence it should endure before seeing a message that is either a genuine update or a checkpoint. Should it see such an extended period of silence it can immediately assume messages are lost and set about recovery.</p>
<p> </p>
<img src="http://feeds.feedburner.com/~r/dancres/sweh/~4/tG2sy_VlCtQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.dancres.org/blitzblog/2012/05/16/messaging/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.dancres.org/blitzblog/2012/05/16/messaging/</feedburner:origLink></item>
	</channel>
</rss>
