<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>hstack</title>
	
	<link>http://hstack.org</link>
	<description>Blogging about the Hadoop software stack</description>
	<lastBuildDate>Wed, 25 May 2011 14:59:21 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/Hstack" /><feedburner:info uri="hstack" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Bucharest HBase/Hadoop Meetup with Lars George at Adobe</title>
		<link>http://feedproxy.google.com/~r/Hstack/~3/0GdqLp7AhgQ/</link>
		<comments>http://hstack.org/bucharest-hstack-meetup-with-lars-george/#comments</comments>
		<pubDate>Wed, 25 May 2011 14:59:21 +0000</pubDate>
		<dc:creator>Cosmin Lehene</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[meetup]]></category>

		<guid isPermaLink="false">http://hstack.org/?p=116</guid>
		<description><![CDATA[Next Tuesday (31st of May 2011) we&#8217;ll host a HBase/Hadoop meetup at the Adobe office in Bucharest. We&#8217;ll have Lars George &#8211; HBase committer, author of &#8220;HBase: The Definitive Guide&#8220;, Cloudera Solution Architect for Europe as a special guest. Our hope is to get to meet more HBase/Hadoop local users to share knowledge. So if [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fhstack.org%2Fbucharest-hstack-meetup-with-lars-george%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fhstack.org%2Fbucharest-hstack-meetup-with-lars-george%2F&amp;source=hstackdotorg&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Next Tuesday (31st of May 2011) we&#8217;ll host a HBase/Hadoop meetup at the Adobe office in Bucharest. We&#8217;ll have <a href="http://www.larsgeorge.com/">Lars George</a> &#8211;  HBase committer, author of &#8220;<a href="http://amzn.com/1449396100">HBase: The Definitive Guide</a>&#8220;, Cloudera Solution Architect for Europe as a special guest.</p>
<p>Our hope is to get to meet more HBase/Hadoop local users to share knowledge. So if you&#8217;re using HBase or Hadoop or plan to use them you&#8217;re welcome. </p>
<p>Leave a comment if you want to sign-up for an up to 10 minutes lightning talk. </p>
<p><strong>Agenda:</strong><br />
HBase intro &#8211; Lars George<br />
Big Data with HBase and Hadoop at Adobe<br />
Talk 3<br />
Lightning talks (10m each)<br />
HBase status and roadmap &#8211; Lars George<br />
Q&#038;A/Open discussion</p>
<p>After: beers at Rock&#8217;n Pasta or downtown</p>
<p><a href="http://hstack.eventbrite.com/">Register here</a></p>
<img src="http://feeds.feedburner.com/~r/Hstack/~4/0GdqLp7AhgQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://hstack.org/bucharest-hstack-meetup-with-lars-george/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://hstack.org/bucharest-hstack-meetup-with-lars-george/</feedburner:origLink></item>
		<item>
		<title>Hadoop/HBase automated deployment using Puppet</title>
		<link>http://feedproxy.google.com/~r/Hstack/~3/QPmj-2ODU8k/</link>
		<comments>http://hstack.org/hstack-automated-deployment-using-puppet/#comments</comments>
		<pubDate>Fri, 21 May 2010 14:38:46 +0000</pubDate>
		<dc:creator>Cristian Ivascu</dc:creator>
				<category><![CDATA[deployment]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[operations]]></category>
		<category><![CDATA[adobe]]></category>
		<category><![CDATA[devops]]></category>
		<category><![CDATA[high-availability]]></category>
		<category><![CDATA[hstack]]></category>
		<category><![CDATA[nosql]]></category>
		<category><![CDATA[puppet]]></category>
		<category><![CDATA[recipes]]></category>
		<category><![CDATA[zookeeper]]></category>

		<guid isPermaLink="false">http://hstack.org/?p=93</guid>
		<description><![CDATA[Introduction Deploying and configuring Hadoop and HBase across clusters is a complex task. In this article I will show what we do to make it easier, and share the deployment recipes that we use. For the tl;dr crowd: go get the code here. Cool tools Before going into how we do things, here is the [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fhstack.org%2Fhstack-automated-deployment-using-puppet%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fhstack.org%2Fhstack-automated-deployment-using-puppet%2F&amp;source=hstackdotorg&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<div id="introduction">
<h2>Introduction</h2>
<p>Deploying and configuring Hadoop and HBase across clusters is a complex task. In this article I will show what we do to make it easier, and share the deployment recipes that we use.</p>
<p>For the <a title="Too long; didn't read" href="http://en.wikipedia.org/wiki/Wikipedia:TLDR" target="_blank">tl;dr</a> crowd: go get the code <a title="Repository for the puppet recipes" href="http://github.com/hstack/puppet" target="_blank">here</a>.</p>
</div>
<div id="cool-tools">
<h2>Cool tools</h2>
<p>Before going into how we do things, here is the list of tools that we are using, and which I will mention in this article. I will try to put a link next to any tool-specific term, but you can always refer to its specific home-page for further reference.</p>
<ul>
<li><a href="http://hudson-ci.org">Hudson</a> &#8211; this is a great CI server, and we are using it to build Hadoop, HBase, Zookeeper and more</li>
<li>The <a href="http://wiki.hudson-ci.org/display/HUDSON/Promoted+Builds+Plugin">Hudson Promoted Builds Plug-in</a> &#8211; allows defining operations that run after the build has finished, manually or automatically</li>
<li><a href="http://www.puppetlabs.com/">Puppet</a> &#8211; configuration management tool We don&#8217;t have a dedicated operations team to hand off a list of instructions on how we want our machines to look like. The operations team helping us just makes sure the servers are in the rack, networked and powered up, but once we have a set of IPs (usually from <a href="http://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface">IPMI</a> cards) we&#8217;re good to go ourselves. We are our own <a href="http://www.jedi.be/blog/2010/02/12/what-is-this-devops-thing-anyway/">devops</a> team, and as such we try to automate as much as possible, where possible, and using the tools above helps a lot.</li>
</ul>
</div>
<div id="history">
<h2><span id="more-93"></span></h2>
<h2>History</h2>
<p>When we started with hstack, we did everything manually &#8211; created users, tweaked system properties and deployed Hadoop/HBase. By far the most time consuming part was making sure the configuration files were the same on all machines. We wrote a basic shell script that we ran on the machine where we modified the configuration files and it would use scp to copy the files to the rest of the machines. Once the initial excitement wore off, we looked for a better way. The tools we looked at first were <a href="http://wiki.opscode.com/display/chef/Home">Chef</a> and <a href="http://www.puppetlabs.com/">Puppet</a>. We eventually chose Puppet which at the time was richer and much better maintained. We went through their site wiki for documentation, used the fantastic <a href="http://apress.com/book/view/1590599780">Pulling Strings with Puppet</a> book and in under a week we had created puppet recipes that could take an empty machine and turn it into a hstack node. Any functionality that we decided to implement later on (like the NameNode High-availability recipe) got its own puppet manifests right from the start.</p>
</div>
<div id="how-we-use-puppet">
<h2>How we use Puppet</h2>
<p>We tend to overuse puppet a bit, by using it as both our configuration management and deployment tool. To illustrate, these are the steps we take when deploying a new version of hstack to our cluster:</p>
<ol style="list-style-type: decimal;">
<li>Trigger a build in Hudson for Hadoop, HBase or anything else we want to deploy.</li>
<li>We click a link next to the newly completed build to push the resulting archives to the Puppet Master repository.</li>
<li>Using ssh, we start up the Puppet clients on each machine. In the future we want to keep them running at all times, but since we&#8217;re doing development on the current cluster, a running Puppet tends to mess with our tests (restarting daemons when we kill them for testing is an example).</li>
<li>The last step is to trigger the change in the configuration. The Puppet Master pulls the configuration from a git repository, so we just have to change the version number and push the new file back to source control.</li>
</ol>
<p>The heavy lifting in all of this process is done by Puppet &#8211; it figures out what nodes need to be upgraded, pushes the new archives, changes configuration files and restarts services. All in all you can just do the steps above, go for a coffee and when you come back the cluster is running the new version.</p>
<p>Right now we are open-sourcing on GitHub, puppet recipes for:</p>
<ul>
<li>creating the user under which the entire hstack runs.</li>
<li>changing system settings, like the ssh keys, authorizing machines to talk to each other, aliases for hadoop and hbase executables, /tmp rules.</li>
<li>standalone puppet module to deploy <a href="http://hadoop.apache.org/">Hadoop</a></li>
<li>standalone puppet module to configure the Hadoop NameNode in High-Availability mode via <a href="http://www.drbd.org/">DRBD</a>, <a href="http://www.linux-ha.org/wiki/Heartbeat">heartbeat</a> and <a href="https://mon.wiki.kernel.org/index.php/Main_Page">mon</a>. For more details on this recipe check out the <a href="http://www.cloudera.com/blog/2009/07/hadoop-ha-configuration/">cloudera blog post on this topic</a>.</li>
<li>standalone puppet module to deploy <a href="http://hadoop.apache.org/hbase/">HBase</a></li>
<li>standalone puppet module to deploy <a href="http://hadoop.apache.org/zookeeper/">Zookeeper</a>.</li>
</ul>
<p>All of the above are what we use day-to-day, and have been tested on CentOS 5.3 with puppet 0.25.1 and 0.25.4. To use them you need a functional puppet master (see details on how to configure puppet&#8217;s daemons <a href="http://docs.puppetlabs.com/guides/configuring.html">here</a>). Once you have that working, drop the code from our repository on top of your puppet master, and read on for the various options of each module / recipe.</p>
</div>
<div id="usage">
<h2>Usage</h2>
<p>The repository contains the three typical puppet folders:</p>
<ul>
<li><em>manifests</em> &#8211; aside from the mandatory sites.pp and nodes.pp files, our repository also stores in this folder some recipes that weren&#8217;t big enough to warrant a module, as well as some utility definitions.</li>
<li><em>templates</em> &#8211; this folder stores configuration templates for the user and system. Hadoop and HBase templates are stored with their respective modules. Here you will have to add your own ssh keys and tweak system variables</li>
<li><em>modules</em> &#8211; the bulk of the recipes lies here. A Puppet module is a self-contained unit that does one thing only. We named each folder by its purpose: <em>hadoop, hbase, mon, zookeeper</em>.</li>
</ul>
<p>In order to do a full deployment using these recipes, your nodes have to include both the general user and system recipes, as well as the modules. To adapt the final outcome for your specific cluster setup there are a few variables that you can set, and which I will explain for each module in part.</p>
<div id="basic-node-setup">
<h3>Basic node setup</h3>
<p>In this brief section I will explain what needs to be defined in the puppet recipe in order to perform basic setup of a machine (getting a dedicated user account to run hstack and setting the correct ssh permissions). The recipes we&#8217;re going to use are stored in the <em>manifests/virtual.pp</em> and <em>manifests/environment.pp</em>.</p>
<p>First up, adding a new user to the system. The recipe for creating users does not work fully out of the box &#8211; you need to give it your own ssh keys to use. To do so, replace the files under <em>templates/ssh_keys/keys</em> with actual content (they&#8217;re just dummy files right now). You might also want to change the password &#8211; you need to write the already encrypted version into manifests/virtual.pp. The default password is <strong>hadoop</strong>, although it might not seem so.</p>
<p>Another customization you can make is to the system settings. These are controlled through the <em>environment.pp</em> file. In this file you can change the maximum number of file descriptors (default 200.000), aliases for commands and the tmpwatch(http://linux.about.com/library/cmd/blcmdl8_tmpwatch.htm) deamon configuration.</p>
<p>Once this is done, you can add your first node, run the configuration and enjoy the results &#8211; a brand new user ready to take on Hadoop. :)</p>
</div>
<div id="deploying-hadoop">
<h3>Deploying Hadoop</h3>
<p>The hadoop module has options and configurations you need to change to make it suit your setup. The most important ones you need to take care of are: in the <em>modules/hadoop/templates/conf/</em> folder you can create a sub-folder for each separate environment you want to manage. This allows you to keep completely different templates and use the same code to push them to their respective locations. This is where the hadoop (core, hdfs and map-reduce) configuration templates are stored. By default only the critical properties are filled in, some with values, but most are using variables (you can spot them by the &lt;%= %&gt; marks). These variables can be set on a per-node basis, but I just define a base-node with the common values and have all the other nodes in a cluster extend that. Some of the variables that you need to set are:</p>
<ul>
<li><code>$hadoop_version</code> &#8211; the version of hadoop to deploy. The recipes are assuming as version the part of the tar name after hadoop-. For example if your generated archive is called hadoop-core-0.21.0-31, the version is &#8220;core-0.21.0-31&#8243;. Another assumption is that the archive contains a folder with the same name, which will be unpacked in the user&#8217;s home (<em>/home/hadoop</em>) and then symlinked to the final <code>$HADOOP_HOME</code> destination. This tactic allows you to retain all versions ever deployed, and you can easily switch back between them by just re-creating the symbolic link to an older folder.</li>
<li><code>$hadoop_home</code> &#8211; final directory to which the unpacked hadoop will be symlinked to</li>
<li><code>$environment</code> &#8211; match this with the subfolder in the templates/conf that you want to use</li>
<li><code>$hadoop_default_fs_name</code> &#8211; the uri to the NameNode; this gets pushed into hdfs-site.xml and core-site.xml</li>
<li><code>$hadoop_namenode_dir</code> &#8211; the folder where the NameNode stores its files</li>
<li><code>$hadoop_datastore</code> &#8211; the folders where the datanode stores its files. This is expressed as a list: <code>["/var/folder1", "/var/folder2"]</code></li>
<li><code>$mapred_job_tracker</code> &#8211; uri to the Hadoop Map-Reduce job-tracker.</li>
<li><code>$hadoop_mapred_local</code> &#8211; the local folder where the TaskTrackers should store its files. This is also a list of folders.</li>
</ul>
</div>
<div id="hadoop-hdfs-high-avalability">
<h3>Hadoop HDFS High Avalability</h3>
<p>To use this you need to dedicate two machines to the NameNode role and set some other variables (on both machines):</p>
<ul>
<li><code>$virtual_ip</code> &#8211; the common IP address that both computers will be sharing</li>
<li><code>$hostname_primary | $hostname_secondary</code> &#8211; the hostnames of the two machines to use</li>
<li><code>$ip_primary | $ip_secondary</code> &#8211; the IP addresses of both machines</li>
<li><code>$disk_dev_primary | $disk_dev_secondary</code> &#8211; the device to use as the starting partition for drbd</li>
<li><code>$hadoop_namenode_dir</code> &#8211; where the namenode will store its files; this will also be the mount point for the newly created <code>/dev/drbd</code></li>
</ul>
</div>
<div id="deploying-zookeeper">
<h3>Deploying Zookeeper</h3>
<p>A standalone zookeeper deployment is recommended for HBase. You should have at least 3 servers running zookeeper &#8211; if your cluster is low on load, you can just use 3 of your regionservers. To define a zookeeper node, import the zookeeper module and set these variables:</p>
<ul>
<li><code>$zookeeper_version</code> &#8211; same as for hadoop, the version to deploy</li>
<li><code>$zookeeper_datastore</code> &#8211; where to store the zookeeper data files</li>
<li><code>$zookeeper_datastore_log</code> &#8211; where to store the zookeeper log</li>
<li><code>$zookeeper_myid</code> &#8211; this needs to be configured for each individual node, as it assigns an unique id to the machine</li>
</ul>
</div>
<div id="deploying-hbase">
<h3>Deploying HBase</h3>
<p>The HBase deployment is very similar to the Hadoop one. You need to adjust the HBase configuration stored under <em>modules/hbase/templates/conf/<code>$environment</code></em> to add or remove any properties specific to your installation. In the configuration files you will spot some of the variables you have to set, like:</p>
<ul>
<li><code>$hbase_version</code> &#8211; what version to deploy; similar to the Hadoop case, this is part of the archive and folder name. If you use the default build script, it should pack it just right</li>
<li><code>$hbase_home</code> &#8211; final directory to which the unpacked hbase will be symlinked to</li>
<li><code>$zookeeper_quorum</code> &#8211; comma separated list of zookeeper nodes</li>
<li><code>$hbase_rootdir</code> &#8211; HBase directory in HDFS, usualy <code>/hbase</code></li>
</ul>
<p>This was a quick run through some, but not all of the options that each recipe provides and requires. I strongly encourage you to go look in the provided <em>nodes.pp</em> which has a sample node configured with all of the options. Also, as a best practice, when adding your own properties to the template configuration files, you should try to use variables and set those on a per-node basis where applicable.</p>
</div>
</div>
<div id="conclusion">
<h2>Conclusion</h2>
<p>Using puppet to deploy the entire stack in an easy, predictable way has helped us a lot in not delaying cluster-wide upgrades just because it would be too hard to do. If you are deploying hstack on your custers, and decide to use these recipes, let us know if you find any bugs. If you&#8217;re using another way of pushing data and configuration to your cluster, we&#8217;d like to hear about it as well in the comments.</p>
</div>
<img src="http://feeds.feedburner.com/~r/Hstack/~4/QPmj-2ODU8k" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://hstack.org/hstack-automated-deployment-using-puppet/feed/</wfw:commentRss>
		<slash:comments>23</slash:comments>
		<feedburner:origLink>http://hstack.org/hstack-automated-deployment-using-puppet/</feedburner:origLink></item>
		<item>
		<title>HBase Performance Testing</title>
		<link>http://feedproxy.google.com/~r/Hstack/~3/ZOoDLM9bCJI/</link>
		<comments>http://hstack.org/hbase-performance-testing/#comments</comments>
		<pubDate>Mon, 26 Apr 2010 18:18:13 +0000</pubDate>
		<dc:creator>Andrei Dragomir</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[adobe]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[nosql]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://hstack.org/?p=51</guid>
		<description><![CDATA[Performance is one of the most interesting characteristics in a system&#8217;s behavior. It&#8217;s challenging to talk about it, because performance measurements need to be accurate and in depth. Our purpose is to share our reasons for doing performance testing, our methodology as well as our initial results, and their interpretation. Hopefully, this will come in [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fhstack.org%2Fhbase-performance-testing%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fhstack.org%2Fhbase-performance-testing%2F&amp;source=hstackdotorg&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Performance is one of the most interesting characteristics in a system&#8217;s behavior. It&#8217;s challenging to talk about it, because performance measurements need to be accurate and in depth.</p>
<p>Our purpose is to share our reasons for doing performance testing, our methodology as well as our initial results, and their interpretation. Hopefully, this will come in handy for other people.</p>
<p>The key take-aways here are that:</p>
<ul>
<li>Performance testing helps us determine the <strong>cost</strong> of our system; it helps size the hardware appropriately, so we don&#8217;t introduce hardware bottlenecks or spend too much money on expensive equipment.</li>
<li>A black-box approach (only the actual test results: average response time) is not enough. You need to validate the results by doing an in-depth analysis.</li>
<li>We test not only <strong>our</strong> software, but try to look at all the levels, including libraries and operating system. Don&#8217;t take anything for granted.</li>
</ul>
<p><span id="more-51"></span></p>
<div id="reasons">
<h2>Reasons</h2>
<p>So, why do we run performance tests? Apart from our peace of mind of whether our solution is good enough :D, here are some pragmatic reasons, that will hopefully explain what we measure and how:</p>
<p>To keep our customers happy, our service needs to work fast. The hardware has to be used efficiently (more bang for your buck); performance tests help us detect the limits of our entire stack (software / operating system / hardware), as well as estimate and plan capacity.</p>
<p>Unfortunately, testing only isolated components can hide issues that appear only in a distributed context. Cluster load tests are sometimes the only way to ensure that the system behaves properly.</p>
<p>When you provide a service, you need to think about KPIs and SLAs. You have to tell your prospective clients how fast your system is, and what performance they can expect under a specific load. You can only guarantee performance under a certain load, so you have to design your cluster for it.</p>
<p>We run these tests regularly, so we can compare performance. This allows us to maintain the initial performance baseline, or at least know how our changes impact it.</p>
</div>
<div id="our-view-on-types-of-performance-tests">
<h1>Our view on types of performance tests</h1>
<p>We run several kinds of performance tests:</p>
<ol style="list-style-type: decimal;">
<li><strong>Benchmark</strong>: we run these light tests automatically, on the <a href="http://hudson-ci.org">continuous integration server</a>. These give us &#8220;comparative&#8221; results between any pair of HStack builds / deployments. They give a high-level and inaccurate measurement of performance increase / decrease from commit to commit. So, if anybody adds a synchronized statement that serializes everything, we&#8217;ll know :).</li>
<li><strong>Load testing</strong>: tests should emulate a load similar to the normal or peak production conditions. We first run a performance test with 1 client. Then, when running with multiple clients, we can examine the behavior under concurrent access. Up to a threshold, the results should be the same.</li>
<li><strong>Stress testing</strong>: test with harder conditions than the normal load, and determine the limits of the system. Sometimes the resources of the system are intentionally reduced, to see the effect of the load. How does the system behave once you pull the plug on a server? Or three?</li>
<li><strong>Longevity tests</strong>: run a set of tests (for example, load tests), for a long time, to see how the system behaves under sustained load (for days). Does Java GC come into play? Will it swap?</li>
</ol>
<p>We have two main use cases we care about: <strong>real-time read / write</strong>, and <strong>processing(map/reduce jobs)</strong>.</p>
</div>
<div id="random-read--write-results">
<h1>Random Read / Write Results</h1>
<p>All of our tests have been performed on our 7 server cluster, using <a href="http://grinder.sourceforge.net">Grinder</a>. Each machine has dual Nehalem CPUs (8 cores), 32GB DDR3 and 24 x 146GB SAS 2.0 10K RPM disks.</p>
<p>The HBase test table is a <strong>long table</strong><a id="fnref1" class="footnoteRef" href="#fn1"><sup>1</sup></a>, <strong>3 billion records</strong>, in about <strong>6600 regions</strong>. The data size is between <strong>128-256</strong> bytes per row, spread in 1 to 5 columns. We&#8217;ll probably start testing with 1KB records for the next tests.</p>
<p>We use 4 injector servers (the thread count in the test results below is split equally between them). The injectors hit an F5 load balancer that sits in front of our cluster. We run get, put and combined tests (a mix of 70% gets and 30% puts).</p>
<p>We use a uniform random distribution to generate the keys for reads and writes, trying to hit a balanced amount of &#8220;hot&#8221; and &#8220;cold&#8221; data<a id="fnref2" class="footnoteRef" href="#fn2"><sup>2</sup></a>. We make sure that the data is at least one order of magnitude higher than the cluster’s RAM capacity in order to exercise the disks. Another good approach is to try to hit 100% IO-utilization on your cluster’s disks (or as high as possible) to guarantee that you understand the load of your system when dealing with real-world data set size.</p>
<p>The <strong>load tests</strong> were performed using Hbase 0.21 (TRUNK), the one available on 02.25.2010. Hbase was running on Hadoop 0.21 trunk (same thing, the version available around that date). We&#8217;ll publish updates here or in another post with updated results, with newer versions.</p>
<table style="height: 201px;" width="797">
<tbody>
<tr class="header">
<th style="width: 11%;" align="left">Type</th>
<th style="width: 9%;" align="left">Threads</th>
<th style="width: 11%;" align="left">Mean Time (ms)</th>
<th style="width: 17%;" align="left">Time Standard Deviation (ms)</th>
<th style="width: 17%;" align="left">90% percentile line (ms)</th>
<th style="width: 23%;" align="left">Throughput<a id="fnref3" class="footnoteRef" href="#fn3"><sup>3</sup></a> (req / s)</th>
<th style="width: 10%;" align="left">Number of Requests</th>
</tr>
<tr class="odd">
<td align="left">Get</td>
<td align="left">300</td>
<td align="left">17.87</td>
<td align="left">29.87</td>
<td align="left">38.793</td>
<td align="left">16788</td>
<td align="left">2999374</td>
</tr>
<tr class="even">
<td align="left">Put</td>
<td align="left">300</td>
<td align="left">7.50</td>
<td align="left">15.35</td>
<td align="left">12.096</td>
<td align="left">40000</td>
<td align="left">2999978</td>
</tr>
<tr class="odd">
<td align="left">Get / Put</td>
<td align="left">300</td>
<td align="left">14.62</td>
<td align="left">15.83</td>
<td align="left">33.976</td>
<td align="left">20520</td>
<td align="left">2999994</td>
</tr>
<tr class="even">
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr class="odd">
<td align="left">Get</td>
<td align="left">600</td>
<td align="left">55.68</td>
<td align="left">66.92</td>
<td align="left">161.571</td>
<td align="left">10776</td>
<td align="left">5999737</td>
</tr>
<tr class="even">
<td align="left">Put</td>
<td align="left">600</td>
<td align="left">9.96</td>
<td align="left">17.08</td>
<td align="left">15.812</td>
<td align="left">60240</td>
<td align="left">5999891</td>
</tr>
<tr class="odd">
<td align="left">Get / Put</td>
<td align="left">600</td>
<td align="left">26.48</td>
<td align="left">33.10</td>
<td align="left">64.336</td>
<td align="left">22659</td>
<td align="left">5999783</td>
</tr>
</tbody>
</table>
<p>Here&#8217;s how the mean request time distribution looks like overall:</p>
<p style="text-align: center;"><img class="size-full wp-image-53 aligncenter" title="gets_300_threads" src="http://hstack.org/wp-content/uploads/2010/04/gets_300_threads.png" alt="" width="600" height="500" /></p>
<p>And here&#8217;s how it looks like zoomed in on the requests that took less than 20 ms:</p>
<p style="text-align: center;"><img class="size-full wp-image-54 aligncenter" title="gets_zoomed_300_threads" src="http://hstack.org/wp-content/uploads/2010/04/gets_zoomed_300_threads.png" alt="" width="600" height="500" /></p>
<p>Mean request time by itself doesn&#8217;t say much about the <strong>real</strong> performance of your system. We also output data distribution indicators.</p>
<p>The <a href="http://en.wikipedia.org/wiki/Standard_deviation">standard deviation</a> can expose problems hidden by the average: big skew between request times (Take two datasets [5ms, 5ms] and [0.1ms, 9.9ms] . They both have the same average of 5ms, but the standard deviation is different).</p>
<p>We calculate the 90% percentile line<a id="fnref4" class="footnoteRef" href="#fn4"><sup>4</sup></a> with <a href="http://www.r-project.org/">R</a>. This eliminates the longest running 10% percent of requests from the data set, which can distort your average. R is used to generate the histograms with the entire distribution of request times as well.</p>
<p>Statistics can help you augment the initial results. A basic data set can be &#8220;enriched&#8221; and carved in different ways, that help you draw smarter conclusions. The least we can do is look at some indicators and decide if we are measuring the right things.</p>
<p>Overall, writes are faster than reads. See <a href="http://http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html">here</a> for an explanation on write-path in HBase. Basically, the WAL is an HDFS file, which is kept in memory on 3 replicas (depends on your replication factor), and flushed asynchronously. What we&#8217;d like to see here in terms of performance is single-row write latency approach the hard drive seek time.</p>
<p>For reads, the average request time increases with the number of threads<a id="fnref5" class="footnoteRef" href="#fn5"><sup>5</sup></a>. This is somewhat expected, but only from a certain number of threads. Again, here we&#8217;d like to see almost linear performance up to, let&#8217;s say 200 concurrent clients. In our case, the growth is pretty much linear. Corroborated with low utilization on the cluster, this leads us to believe that there are still contention points in the HBase code. <a href="#investigation">We solved some of them</a>, but there is still work to be done here.</p>
</div>
<div id="lies-damned-lies-and-statistics">
<h1>Lies, damned lies, and statistics</h1>
<p>In and of itself, saying &#8220;we can do X requests/sec with a 95% percentile line of Y ms&#8221; won&#8217;t tell you much about what&#8217;s happening under the covers. The kind of test that we showed above is much like a black box. The indicators you get or compute can tell you, at best, whether your system is &#8220;broken&#8221;. There are other thunkgs that we&#8217;d like to know, most important being hardware usage.</p>
<p>The numbers are the tip of the iceberg; things become <strong>really interesting</strong> once we start looking under the hood, and interpreting the results.</p>
<p>When investigating performance issues you have to assume that &#8220;everybody lies&#8221;. It is crucial that you don&#8217;t stop at a simple capacity or latency result; you need to investigate every layer: the performance tool, your code, their code, third-party libraries, the OS and the hardware. Here&#8217;s how we went about it:</p>
<p>The first potential liar is your test, then your test tool &#8211; they could both have bugs so you need to double-check.</p>
<p>After you make sure that your tests are correct, you need to determine the testing tool overhead. For example, in Grinder, you write a test function in a Jython class, which Grinder wraps and instruments in order to get performance statistics. We added our own &#8220;witness&#8221;, in the inner most loop of the test function to give us the test time with <code>java.lang.System.nanoTime</code> to make sure that it&#8217;s correct.</p>
<p>As I mentioned, our service is a thin layer, a shim over HBase that offers a thrift interface, and a couple of other things. We did the tests both with and without this layer(thrift), so we can spot any potential problems in it. It&#8217;s a good exercise to ask yourself: what&#8217;s the time added by each layer of my system? For us, right now, it&#8217;s negligible, but in the future, we can expect this overhead to grow, and we have a baseline for its value.</p>
<p><a id="investigation"></a></p>
<div id="down-and-up-again">
<h2>Down and up again</h2>
<p>At first we had erratic results: No two performance runs were the same, and our servers were almost idling. We started to get rid of the <strong>moving pieces</strong> in order to identify the hot spot. First we removed thrift, than the entire testing framework and wrote a simple test case (basically a long Java loop) which ran that on the cluster. Then we got rid of the cluster and ran the test on a single node. We profiled the code, and then got rid of the large table too. We ended up with running the test program, locally, on a table with one row (got rid of the I/O overhead in the equation, everything was served from memory).</p>
<p>Finally, after looking at the profiler logs and with the help of the guys in <a href="irc://irc.freenode.net:6667/hbase"><code>#hbase</code></a>, we got to <a href="https://issues.apache.org/jira/browse/HBASE-2180">HBASE-2180</a>. After applying the patch, we went back up the stack, and added all the layers of complexity back, one by one. In the end server load increased, mean request time decreased and the results were consistent.</p>
<p>One problem with this approach is that by trying to pinpoint the issue, you can lose track of the big picture and might even end up hiding the initial cause. You have to measure performance every step of the way to make sure the problem still reproduces.</p>
<p>The second big problem was that sometimes the throughput went down to 0 req/s for short periods of times (blackouts). We first looked at all logs on all the nodes, seeing some weird timeouts between them. At the time, the servers and network were not under heavy load. After a lot of debugging we turned to the OS logs (there was nothing left :): in the <code>dmesg</code> output, iptables was screaming out <code>ip_conntrack: table full, dropping packet.</code>. It turns out that on busy network servers, the firewall can get to a point where it will merrily drop connections <a id="fnref6" class="footnoteRef" href="#fn6"><sup>6</sup></a>, if the length of the connection queue is bigger than the default cap.</p>
<p>We were so fixated on identifying problems in the upper layers, that we totally forgot about the operating system. It usually <a href="http://www.codinghorror.com/blog/2008/03/the-first-rule-of-programming-its-always-your-fault.html">IS your code</a>, but don&#8217;t take anything for granted when identifying performance problems. Low-level components, like network and file system play a big part in performance tests.</p>
</div>
<div id="performance-numbers-hard-mode">
<h2>Performance numbers: Hard Mode</h2>
<p>How do you determine when you hit the limits of your system? More than that, how do you identify the actual problems starting from the initial results? Simply pushing on until something crashes or stops won&#8217;t do it. What&#8217;s worse, it won&#8217;t tell you anything about whether you are walking through the hardware sweet spot, or if you are spending money in the right place.</p>
<p>In order to identify the limits or bottlenecks you need a way to look inside the system &#8211; something like an x-ray. For code there are profilers, instrumentation and log files. Linux and derivatives have plenty of tools that you can use to get information on how CPU, memory and I/O cycles (both HDD and network) are spent. We use an ad-hoc combination of <a href="http://en.wikipedia.org/wiki/Top_%28Unix%29">top</a>, <a href="http://htop.sourceforge.net/">htop</a>, <a href="http://en.wikipedia.org/wiki/Iostat">iostat</a>, <a href="http://dag.wieers.com/home-made/dstat/">dstat</a>, <a href="http://en.wikipedia.org/wiki/Netstat">netstat</a>, <a href="http://iptraf.seul.org/">iptraf</a>, <a href="http://linux.die.net/man/1/sar">sar</a>. Most of these are frontends to <code>/proc</code>, and functionality is duplicated between them. We look at different things, like overall network bandwidth used, CPU %, number of page faults, I/O queues average length, etc.</p>
<div id="can-you-give-me-an-example">
<h3>Can you give me an example?</h3>
<p>Here are some interesting statistics obtained by real-time monitoring of our cluster when doing load tests:</p>
<ul>
<li>During writes, the most relevant statistic is the network bandwidth: 600 Mbps &#8211; 800 Mbps, peaking over 1000Mbps, ~670Mbps incoming traffic<a id="fnref7" class="footnoteRef" href="#fn7"><sup>7</sup></a>. The iostat service time ( <code>svctm</code> ) rises to a couple of milliseconds, CPU usage is split in half (50%-50%) between user and kernel space, and overall at 30% (or 70% idle).</li>
<li>Cumulative read/write statistics are about 300-600 Mbps, peaking at 900 Mbps.</li>
<li>During read tests, things are different; network bandwidth utilization is about 10 times less (60 Mbps) &#8211; a row is small, there is no region reassignment churn. Also iostat await time is about 7ms, and average queue size is about 1. Again, the CPU is not used.</li>
</ul>
<p>Locks, or rather sub-optimal locks in the code are an &#8220;invisible resource&#8221;. Utilization patterns where CPU is low, network bandwidth is low and the disks can keep up with the requests indicate some type of resource or lock contention in a system.</p>
<p>Another problem we identified using iostat was that on one hard drive, we had twice the number of requests and twice the wait time. It was due to a bug in our kickstart<a id="fnref8" class="footnoteRef" href="#fn8"><sup>8</sup></a> bootstrap scripts, which instead of setting up one partition on each drive, created two partitions on the last but one. Performance testing not only gives you a baseline but it helps <strong>validate</strong> the configuration of your system.</p>
</div>
</div>
</div>
<div id="map-reduce">
<h1>Map-Reduce</h1>
<p>Map-Reduce performance, and implicitly scan performance is very good. For testing, we are just running the map-reduce job itself. We have enough data and machines that we think we don&#8217;t need another test harness. We simply run the map-reduce jobs and compute statistics at the end.</p>
<p>Our real world jobs crunch through a large table, with ~ 530M large rows<a id="fnref9" class="footnoteRef" href="#fn9"><sup>9</sup></a>. For map/reduce, we are crunching through this data with about ~50000 req/s. We are able to process the entire table for statistics in about 2h and 40. For processing smaller rows (like reading all the rows in the 3billion records test table), we see 200000 req/s throughput (peaking at 400000 req/s).</p>
<p>Tight Hadoop integration is one of HBase&#8217;s core features, and we want to see this translated in great Map/Reduce performance. The most important statistic here is CPU usage, which is high (60%-70%), with all cores used, followed by network bandwidth utilization<a id="fnref10" class="footnoteRef" href="#fn10"><sup>10</sup></a>.</p>
<div id="cost">
<h2>Cost</h2>
<p>We are very careful what we spend our money on. Under-dimensioned hardware is a performance killer, but over-dimensioned one increases our costs. There&#8217;s no reason in sticking 20 hard drives in a server if the CPU / mainboard can&#8217;t handle it. In the same vein, having huge processing power and memory is useless when your network communication is strangled by a 100Mbps network board, or even worse, by another poorly dimensioned equipment down the road (core switch, etc). All the components of a machine (and by extension, of the cluster) have to have the correct size.</p>
<p>Our configuration uses 24 hard disks per server. Increasing the number of spindles significantly improves I/O by parallelization. But there&#8217;s always a push / pull relation between failover, performance and costs.</p>
<p>For lots of nodes, processing power (MapReduce) is higher, but you eat more power / network traffic in the data center (which translates in bigger costs). On the other hand, with smaller number of beefier servers, power utilization is smaller, but you run into the risk of having big churn during failover (a machine dies)<a id="fnref11" class="footnoteRef" href="#fn11"><sup>11</sup></a>, because there&#8217;s a lot of data that need to get re-balanced in HDFS.</p>
<p>The point is that we&#8217;re careful about costs. We try to be thorough, get an in-depth look on how our hardware is performing, and decide whether we need to upscale or downscale our components for optimal usage.</p>
<p>Therefore in our production cluster, we decided to go with 16 2.5&#8243; disks instead of 24, trading spindles for less power usage. 16 disks are enough for our current usage, and we can always add more when we run out of space.</p>
</div>
<div id="hammers-anvils-and-other-assorted-sundries">
<h2>Hammers, anvils and other assorted sundries</h2>
<p>We use a bunch of great tools that really made life easier for us:</p>
<p>The main performance tool that we use is <a href="http://grinder.sourceforge.net">Grinder</a>. Grinder lets you write your tests in Jython, and plug them in with whatever you want to test(DNS speed or weird protocols ? No problem). HTTP is built in; it has a console that lets you orchestrate your tests across multiple injector nodes. Grinder has a steeper learning curve, but it&#8217;s worth it, and we highly recommend it.</p>
<p>I already mentioned all the system monitoring tools that we use. When the performance tests are finished, we use <a href="http://www.r-project.org/">R</a> to load the data sets. R is great because it allows for an incremental approach to data analysis. It&#8217;s really fast, and you can dice the data, look at the distributions, eliminate outliers, etc. Also highly recommended.</p>
<p>More specific details on how we actually use everything, with recipes, in a later post. Also, now that <a href="http://wiki.github.com/brianfrankcooper/YCSB/">YCSB</a> just came out as open-source, we&#8217;ll give that a go as well.</p>
</div>
</div>
<div id="conclusions">
<h1>Conclusions</h1>
<p>We&#8217;re happy with our current results. We know that there are a lot of small things we need to do that will improve performance. Our current focus is to shape the whole load test methodology so that we can decide on the best cost / performance hardware configuration.</p>
<p>Andrei.</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1">because of the columnar storage, tables grow in both dimensions, like an excel sheet <a class="footnoteBackLink" title="Jump back to footnote 1" href="#fnref1">↩</a></li>
<li id="fn2">The initial 3 billion rows in the table were added with a map/reduce job. <a class="footnoteBackLink" title="Jump back to footnote 2" href="#fnref2">↩</a></li>
<li id="fn3">The throughput in the table is computed by using the formula <code>(1000 / average request time ms) * number of threads</code>. <a class="footnoteBackLink" title="Jump back to footnote 3" href="#fnref3">↩</a></li>
<li id="fn4">what is the smallest number bigger than 90% of my (ordered) data set ? <a class="footnoteBackLink" title="Jump back to footnote 4" href="#fnref4">↩</a></li>
<li id="fn5">the growth correlation is kept, we only show here some results. We have more, but it&#8217;s pretty much more of the same. <a class="footnoteBackLink" title="Jump back to footnote 5" href="#fnref5">↩</a></li>
<li id="fn6">you can easily increase the maximum number of tracked connections, but be aware that each tracked connection eats about 350 bytes of non-swappable kernel memory! <a class="footnoteBackLink" title="Jump back to footnote 6" href="#fnref6">↩</a></li>
<li id="fn7">These should be taken with a grain of salt, because we have 1Gbps network equipment, so theoretically, we should hit a practical limit at about 800 Mbps. The &#8220;over 1000Mbps&#8221; results are probably a glitch in iptraf. <a class="footnoteBackLink" title="Jump back to footnote 7" href="#fnref7">↩</a></li>
<li id="fn8">We have a script that generates kickstart files according to machine configuration. The kickstart files install the base CentOS image, and setup the partitions using the raid card command line tool. <a class="footnoteBackLink" title="Jump back to footnote 8" href="#fnref8">↩</a></li>
<li id="fn9">The rows represent photo meta data, and most of them have EXIF information, including EXIF comments, etc. <a class="footnoteBackLink" title="Jump back to footnote 9" href="#fnref9">↩</a></li>
<li id="fn10">(HBase does not yet have data locality, but after running the cluster for a while, due to compactions, most regions will be local to the regionserver) <a class="footnoteBackLink" title="Jump back to footnote 10" href="#fnref10">↩</a></li>
<li id="fn11">When a machine in an HBase cluster dies, the master reassigns the regions that were served by that machine elsewhere. All the machines run both a regionserver and an HDFS datanode, which also gets replicated when a machine dies. <a class="footnoteBackLink" title="Jump back to footnote 11" href="#fnref11">↩</a></li>
</ol>
</div>
<img src="http://feeds.feedburner.com/~r/Hstack/~4/ZOoDLM9bCJI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://hstack.org/hbase-performance-testing/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		<feedburner:origLink>http://hstack.org/hbase-performance-testing/</feedburner:origLink></item>
		<item>
		<title>Why we’re using HBase: Part 2</title>
		<link>http://feedproxy.google.com/~r/Hstack/~3/AHNXE2qFEoM/</link>
		<comments>http://hstack.org/why-were-using-hbase-part-2/#comments</comments>
		<pubDate>Tue, 16 Mar 2010 12:50:09 +0000</pubDate>
		<dc:creator>Cosmin Lehene</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[adobe]]></category>
		<category><![CDATA[hstack]]></category>
		<category><![CDATA[nosql]]></category>

		<guid isPermaLink="false">http://hstack.org/?p=27</guid>
		<description><![CDATA[The first part of this article is about our success with the technologies we have chosen. Here are some more arguments (by no means exhaustive :P) about why we think HBase is the best fit for our team. We are trying to explain our train of thought, so other people can at least ask the questions [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fhstack.org%2Fwhy-were-using-hbase-part-2%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fhstack.org%2Fwhy-were-using-hbase-part-2%2F&amp;source=hstackdotorg&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<div id="why-were-using-hbase-part-2">
<a href="http://hstack.org/why-were-using-hbase-part-1/">The first part of this article</a> is about our success with the technologies we have chosen. Here are some more arguments (by no means exhaustive :P) about why we think HBase is the best fit for our team. We are trying to explain our train of thought, so other people can at least <strong>ask</strong> the questions that we did, even if they don&#8217;t reach to the same conclusion.</p>
<div id="how-we-work">
<h2>How we work</h2>
<p>We usually develop against trunk code (for both Hadoop and HBase) using a mirror of the Apache Git repositories. We don&#8217;t confine ourselves to released versions only, because we implement <a href="https://issues.apache.org/jira/browse/HDFS-630">fixes</a>, and there are always new features we need or want to evaluate. We test a large variety of conditions and find a variety of problems &#8211; from HBase or HDFS corruption to data loss etc. Usually we <a href="https://issues.apache.org/jira/browse/HBASE">report them</a>, fix them and move on. Our latest headache from working with unreleased versions was <a href="https://issues.apache.org/jira/browse/HDFS-909">HDFS-909</a> that causes the corruption of the NameNode &#8220;edits&#8221; file by losing a byte. We were comfortable enough with the system to manually fix the &#8220;edits&#8221; binary file in a hex editor so we could bring the cluster back online quickly, and then track the actual cause by analyzing the code. It wasn&#8217;t a critical situation per se, but this kind of &#8220;training&#8221; and deep familiarity with the code gives us a certain level of trust regarding our abilities to handle real situations.</p>
<p>It&#8217;s great to see that it gets harder and harder to find critical bugs these days, however, we still brutalize our clusters and take all precautions when it comes to data integrity<sup><a id="fnref1" href="#fn1">1</a></sup>.<span id="more-27"></span></p>
<p>Testing distributed systems is <strong>hard</strong>. There aren&#8217;t many tools or resources. There are some promises for performance and scalability benchmarking tools, thanks to Yahoo! (we too await the open-sourcing of the YCSB tool), but right now you have to roll your own, and it takes time. There&#8217;s no clear test plan for distributed systems, no failover benchmarking tool, nothing on reliability, availability or data consistency.</p>
<p>Your system can fail no matter how well you thought you tested it, even if it&#8217;s sunny outside and you&#8217;re throwing a party (especially then). Google, Twitter, Amazon &#8211; all have had downtime. Everyone fails once in a while. It&#8217;s only a matter of time and users tend to tolerate a short downtime or performance degradations. On the other hand, what users will not tolerate is <strong>losing their data</strong><sup><a id="fnref2" href="#fn2">2</a></sup>. We are completely paranoid about losing data. If other failure scenarios resulting in degraded performance or even a little downtime are bearable, losing data is not.</p>
<p>We try to learn our systems by heart and be able to fix anything fast or even while the system is running. After more than a year, we&#8217;re OK with keeping data for our clients<sup><a id="fnref3" href="#fn3">3</a></sup>, but we&#8217;re still testing and taking precautions.</p>
<p>Really thorough testing of ALL the solutions that we use has paid off. It&#8217;s a lot of work, because you have to build the scaffolding for it, but going back and forth keeping up with the changes and pushing fixes is a sure way to know the system in depth.</p>
</div>
<div id="demystifying-hbase-data-integrity-availability-and-performance">
<h2>Demystifying HBase Data integrity, Availability and Performance</h2>
<p>It&#8217;s good to know the strengths of a system, but it&#8217;s more important to be <strong>aware</strong> of and <strong>understand</strong> its limitations. We have extensive suites of tests covering both of them. Testing performance is pretty straightforward, but testing data integrity is the hardest, and here we spend most our time.</p>
<div id="data-integrity">
<h3>Data Integrity</h3>
<p>Integrity implies the data has to reach &#8220;safety&#8221; before confirming the request to the client, regardless of any hardware failures that might happen.</p>
<p>HBase confirms a write after its write-ahead log reaches 3<sup><a id="fnref4" href="#fn4">4</a></sup> in-memory HDFS replicas<sup><a id="fnref5" href="#fn5">5</a></sup>. Statistically, it&#8217;s a rather small<sup><a id="fnref6" href="#fn6">6</a></sup> probability to lose 3 machines at the same time (unless all of your racks are on the same electrical circuit, power transfer switch, PDU or UPS). HDFS is rack aware so it will place your replicas on different racks. If you place and power your machines correctly, this is safe in most cases. If this won&#8217;t be enough for our clients, in certain critical applications, we will come up with stronger guarantees. (E.g. make sure that data is flushed to disk on all 3 replicas &#8211; not here today).</p>
<p>There are many questions that arise even if you do <strong>flush to disk</strong>. In a full power loss scenario, even if you flush to disk you need to consider OS cache, file system journaling, RAID cache and then disk cache. So it&#8217;s debatable whether your data is safe after a flush to disk. We use battery-backed write cache RAID cards that disable the disk caches. However, we&#8217;d rather make sure our racks are powered correctly than rely on disk flush.</p>
<p>Most of our development efforts go towards data integrity. We have a draconian set of failover scenarios. We try to guarantee every byte regardless of the failure and we&#8217;re committed to fixing any HBase or HDFS bug that would imply data corruption or data loss before letting any of our production clients write a single byte.</p>
</div>
<div id="availability">
<h3>Availability</h3>
<p>We feel the same about availability. When a machine dies, data served by that box will be unavailable for a short window<sup><a id="fnref7" href="#fn7">7</a></sup>. This is a current limitation, and while we know how to make the system 100% available, given you lose a box or two, it&#8217;s a matter of prioritizing our efforts that we have chosen not to put effort into it yet. The reality is we can afford having a short<sup><a id="fnref8" href="#fn8">8</a></sup> downtime for data partitions &#8211; as long as we don&#8217;t lose any data. Also we don&#8217;t expect to have machines failing too often.</p>
<p>So, for us, <a href="http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/">juggling with Consistency, Availability and Partition tolerance</a> is not as important as making sure that data is 100% safe.</p>
</div>
<div id="random-reads">
<h3>Random Reads</h3>
<p>HBase<sup><a id="fnref9" href="#fn9">9</a></sup> performance is good enough for us. That is, it&#8217;s more than we need right now. Would you strive to reach 5ms read performance instead of 10ms, or 1 second max unavailability instead of a few minutes when a server crashes, if you can&#8217;t guarantee data safety? We wouldn&#8217;t. Just as you&#8217;d accept a credit card failure once, you wouldn&#8217;t accept your accounts being wiped out anytime. So, we choose to spend our resources on ensuring data integrity.</p>
<p>Getting close to the 7ms average <strong>disk</strong> response time<sup><a id="fnref10" href="#fn10">10</a></sup>, for small records, is possible with the current architecture. As always, the devil is in the implementation details. The architecture promises linear scalability, but it&#8217;s the implementation that makes it reliable. Moreover, we all know that data isn&#8217;t accessed uniformly random &#8211; this is the worst case scenario. We get ~1ms reads for data in memory today, and the read performance and throughput can be improved 10 fold by adding caching. (see <a href="http://highscalability.com/how-google-taught-me-cache-and-cash">this article</a>)</p>
<p>Our performance results are notably better than the ones in the <a href="http://www.brianfrankcooper.net/pubs/ycsb.pdf">YCSB test paper</a>. That&#8217;s for another post though.</p>
<p>Availability and Random Read Performance are possible &#8220;limitations&#8221; that we are OK with (for now); we are extremely happy with <strong>random write</strong> and <strong>sequential read</strong> performance against billions of rows<sup><a id="fnref11" href="#fn11">11</a></sup>, however.</p>
</div>
<div id="writes">
<h3>Writes</h3>
<p>While you can cache for reads, scaling writes is harder. Write performance and sequential read performance enable two of the most important use cases: heavy write volumes<sup><a id="fnref12" href="#fn12">12</a></sup> and efficient distributed processing.</p>
<p>HBase has a great random-write performance. We are using HBase 0.21, which <strong>DOES</strong> sync the write-ahead log after every put call in the RegionServer, so the data is in the write buffer of at least 3 nodes<sup><a id="fnref13" href="#fn13">13</a></sup>. In an RDBMS for example, you can replicate the data for improved read performance, but you can&#8217;t scale writes and total data size, unless you partition it. And when you partition the data you lose the original properties such as transactions, consistency, and your operational costs can skyrocket.</p>
<p>As systems mature, great write performance will not be solely an HBase advantage; we expect other storages to reach this performance, just as we expect HBase to reach the optimal random-read performance.</p>
<p>But what use is in being able to keep such large amounts of data without being able to process them efficiently?</p>
</div>
<div id="sequential-reads-scans">
<h3>Sequential Reads (Scans)</h3>
<p>Again, our tests using MapReduce show great performance. HBase is built on the <a href="http://labs.google.com/papers/bigtable-osdi06.pdf">Bigtable</a> architecture, which was thought-out to work with <a href="http://labs.google.com/papers/mapreduce.html">MapReduce</a>, which makes it also a great fit for OLAP. Data location is deterministic and sequential rows are stored sequentially on disk, so HBase can read every 256 MB(configurable) of your table in a single request because data is not fragmented. It can do it in parallel too. So given enough processing power you can have each disk reading at full throttle.</p>
</div>
</div>
<div id="full-consistency">
<h2>Full Consistency</h2>
<p>HBase is an inherently consistent system. After you write something, modifications are immediately available. You can&#8217;t get stale data, or have to reason about <a href="http://en.wikipedia.org/wiki/Quorum_(Distributed_Systems)">quorum reads</a>. We think consistency is <a href="http://perspectives.mvdirona.com/2010/02/24/ILoveEventualConsistencyBut.aspx">good</a>, for a multitude of reasons: if you write an application over a consistent system, application logic is much simpler. One doesn&#8217;t have to take into account stale data, it&#8217;s just like single threaded programming: you&#8217;re going to read what you&#8217;ve written earlier<sup><a id="fnref14" href="#fn14">14</a></sup>. Also, consistency is a solid base to build more complex primitives: transactions and indexes, increment operations, test-and-set semantics, etc.</p>
<p>It all comes down to engineering choices: it&#8217;s a good exercise for the reader to determine if a system which defaults to eventual consistency, that can accept writes at any moment on any node (e.g. using consistent hashing) and has data fragmented across the cluster, can perform optimally when it comes to sequential reads. How much network chat is needed to do a table scan?</p>
<p>It&#8217;s all about what we think is a sensible default: availability and partition tolerance deal with relatively isolated scenarios: you can compute the probability of losing a node or getting your network split in two. It&#8217;s relatively low. However, consistency is something you deal with in <strong>every</strong> operation you do.</p>
<p>This is all getting a little philosophical, but here&#8217;s a list of questions (not rhetorical ones), related to this:</p>
<p>An eventual consistent system could be configured to support full consistency and/or data ordering. Would this impact or degrade other attributes like availability and performance? A system that can juggle with C, A and P, is quite flexible. But what part of CAP do you want to support by default, and what&#8217;s the impact when you change it afterwards?</p>
<p>Our assumption is that building on consistency is an appealing and sound decision, and any architecture that doesn&#8217;t handle this in its default design will lose the performance and availability when forcing it later on. Partition tolerance is not something that we think is worth handling within a single datacenter (redundant datacenter equipment investments are pretty much the norm for both electricity and networking). We do however <a href="http://issues.apache.org/jira/browse/HBASE-2223">care</a> about partitioning when doing multi-datacenter replication.</p>
<p>Which of C, A or P do you think will hold the greatest impact? (Hint: for us it&#8217;s C :D)</p>
</div>
<div id="hbases-edge-is-in-the-h">
<h2>HBase&#8217;s edge is in the <strong>H</strong></h2>
<p>We created a fair amount of tests that we maltreat our system with. It takes effort to implement correct fault tolerance and there&#8217;s an advantage in relying on Hadoop for it. Also, in the last 4 years, there was a large client base<sup><a id="fnref15" href="#fn15">15</a></sup> that validated Hadoop by using it in their production systems (especially Yahoo!). This had a real impact in the stability and fault tolerance of the system.</p>
<p>Now consider a different system, built from scratch. It has to enable and test all that Hadoop does starting from 0 (again, architecture is a promise, but implementation is what you USE). In a best case scenario, this system will gather a critical mass, a community will be created, and it will evolve organically, etc. Hadoop and HBase have that today, and it&#8217;s a big advantage.</p>
<p>We pride ourselves in keeping up with new technologies, but we think that Hadoop and HBase are over the &#8220;safety&#8221; threshold, for what we need to do.</p>
<p>HBase has an &#8220;edge&#8221; in Hadoop over other technologies, in that, just like Hadoop, it fills the gap between storage scalability, fast processing and cost-efficiency. Why was Hadoop successful? We think it&#8217;s because it didn&#8217;t rely on a narrow vertical need. Hadoop did not build something that was impossible before. We had NAS systems, and OLAP cubes for data processing, but Hadoop made this possible for any development group, with little initial<sup><a id="fnref16" href="#fn16">16</a></sup> investment, hence <strong>democratizing</strong> scalable data processing.</p>
</div>
<div id="about-support">
<h2>About support</h2>
<p>Many Hadoop developers are paid by companies which use Hadoop and see its value. Corporate sponsorship is a catalyst for progress in open-source systems (see MySQL, Eclipse etc.). Hadoop got started as a component in the <a href="http://lucene.apache.org/nutch/">Nutch search engine</a>, but it was Yahoo that invested resources, and helped make it a success story.</p>
<p>You can dig into Hadoop&#8217;s architecture and learn how it works (just like we did), or you could take advantage of the large ecosystem around it. The community report bugs and help people get started, there are books, and even paid support (look at <a href="http://www.cloudera.com">Cloudera</a>).</p>
<p>How does this relate to HBase? HBase is the Hadoop database. It has the best Hadoop integration. It uses HDFS for storage and MapReduce for distributed processing. Once you have a Hadoop cluster, you already have one half of an HBase cluster. It&#8217;s only natural that companies that are using Hadoop will be looking at HBase, if they aren&#8217;t already using it. And, following Hadoop&#8217;s model, they will invest resources and money, adding to HBase&#8217;s momentum. The <a href="http://wiki.apache.org/hadoop/Hbase/PoweredBy">companies</a> that use HBase today sustain the core HBase development team. We too, are contributing back to both HBase and Hadoop. It&#8217;s only natural to invest in something that supports your business.</p>
</div>
<div id="complexity">
<h2>Complexity</h2>
<p>HBase is more complex than other systems (you need Hadoop, Zookeeper, cluster machines have multiple roles). We believe that for HBase, this is not accidental complexity and that the argument that &#8220;HBase is not a good choice because it is complex&#8221; is irrelevant. The advantages far outweigh the problems. Relying on decoupled components plays nice with the Unix philosophy: do one thing and do it well. Distributed storage is delegated to HDFS, so is distributed processing, cluster state goes to Zookeeper. All these systems are developed and tested separately, and are good at what they do. More than that, this allows you to scale your cluster on separate vectors. This is not optimal, but it <strong>allows for incremental investment</strong> in either spindles, CPU or RAM. You don&#8217;t have to add them all at the same time.</p>
<p>The HStack can be a pain to deploy. We took some time to understand the problem, and now we have Puppet recipes for everything. We can set up a cluster completely unattended. We&#8217;ll try to push all these back to the community and help other users have it easier, so stay tuned.</p>
<p>Zookeeper, Hadoop etc., let us implement transactions, simple queries and data processing. We want to have a system with such capabilities (these are requirements for large applications). Yes, you can drop some of them but you can&#8217;t drop them all. We don&#8217;t want a tool that&#8217;s missing too much. We want the good parts from an RDBMS, like queries and transactions, while still having distributed processing and cheap scalability. We don&#8217;t drink the &#8220;NoSQL&#8221; kool-aid. We&#8217;re not running away <strong>from</strong> SQL, we&#8217;re running <strong>towards</strong> something that is built from scratch on the premises of scalability and high availability.</p>
</div>
<div id="about-community-and-leadership">
<h2>About Community and Leadership</h2>
<p>This is the biased part of the article, and it should be, because it&#8217;s about our relation with the HBase development team. Stack, Ryan, JD were always very receptive. They always help with the issues that we have, whether it&#8217;s a bug or a new feature that we need. There&#8217;s an open and democratic decision process when prioritizing work with HBase. The team is well balanced and there&#8217;s not a single company that drives HBase&#8217;s direction.</p>
<p>They are genuinely passionate about their work and strive to have it used by people. We attended one of their regular developer meetups and it was eye opening that developers coming from different backgrounds and companies are working together as a team. We think open-source projects benefit from good leadership and Michael Stack has done a great job with HBase.</p>
<p>Another aspect that appeals to us is the maturity of the development team. They focus on long term benefits. For example, the current focus is to improve the architecture of HMaster and multi-datacenter replication. However, in light of recent <a href="http://www.brianfrankcooper.net/pubs/ycsb.pdf">performance benchmark reports</a> they took the time to understand the situation, validated with the community that it&#8217;s OK to stick to the current plan and didn&#8217;t switch focus.</p>
<p>Maturity is also shown in the way that the team positions HBase in relation to other competing projects; they let facts speak rather than opinions. They don&#8217;t engage in holy wars and this, to us, seems the right way to build a healthy community.</p>
<p>In the end we&#8217;d hope technologies wouldn&#8217;t be dismissed based on superficial or biased perception, FUD, or tweets. We don&#8217;t like it when talks are based on assumptions without knowing ALL the details of a certain problem or technical choice, and this seems to become a vicious trend in some circles. Hopefully, the reasons explained in this article can help you make your own informed assessments, and see what works for you.</p>
<p>Cosmin.</p>
</div>
</div>
<div>
<hr />
<ol>
<li id="fn1">If anyone knows how to remotely &#8220;break&#8221; a network card, or RAM stick, please, let us know :) <a title="Jump back to footnote 1" href="#fnref1">↩</a></li>
<li id="fn2">Ever heard of Sidekick? <a href="http://www.sophos.com/blogs/gc/g/2009/10/13/catastrophic-data-loss-sidekick-users/">http://www.sophos.com/blogs/gc/g/2009/10/13/catastrophic-data-loss-sidekick-users/</a> <a title="Jump back to footnote 2" href="#fnref2">↩</a></li>
<li id="fn3">By clients we mean our <strong>internal</strong> clients. Even though they have public data, our system is not publicly available. <a title="Jump back to footnote 3" href="#fnref3">↩</a></li>
<li id="fn4">&#8220;3&#8243; is also something configurable, it&#8217;s the default replication factor in HDFS <a title="Jump back to footnote 4" href="#fnref4">↩</a></li>
<li id="fn5">This behavior is only available on HDFS version 0.21.0, or 0.20.4 with patches. Take a look at <a href="http://issues.apache.org/jira/browse/HDFS-826">HDFS-826</a> <a title="Jump back to footnote 5" href="#fnref5">↩</a></li>
<li id="fn6">By rather small, we mean that it&#8217;s that small that even if you use 4 replicas, the &#8220;cost&#8221; surpasses the benefit. <a title="Jump back to footnote 6" href="#fnref6">↩</a></li>
<li id="fn7">Depends on cluster load and configuration. It takes ~40 seconds for 800 regions out of 5300 when 1 out of 7 regionserver dies in our test. We used <code>hbase.regions.percheckin</code> at 100. We&#8217;ll do some thorough measuring as well and document it. <a title="Jump back to footnote 7" href="#fnref7">↩</a></li>
<li id="fn8">it&#8217;s usually a minute or two, but depends on how many regions need to be reassigned by HMaster. We&#8217;ll get back with some metrics on this too. <a title="Jump back to footnote 8" href="#fnref8">↩</a></li>
<li id="fn9">HBase 0.20 and 0.21 <a title="Jump back to footnote 9" href="#fnref9">↩</a></li>
<li id="fn10">It&#8217;s a fact that if you have more data than RAM, uniform random read latency approaches the storage latency, at best. Our average response time is <strong>7ms for a 10K RPMS SATA</strong> disk. We haven&#8217;t tested SSDs yet, because they are not economically viable for us right now. <a title="Jump back to footnote 10" href="#fnref10">↩</a></li>
<li id="fn11">We tested with approx 3B rows (approx two orders of magnitude more data than available RAM &#8211; so data wasn&#8217;t served from the cache). See the cluster configuration in this <a href="http://hstack.org/why_we...">article</a> <a title="Jump back to footnote 11" href="#fnref11">↩</a></li>
<li id="fn12">Why do you need heavy write performance? See <a href="http://perspectives.mvdirona.com/2010/02/13/ScalingFarmVille.aspx">here for a description of Farmville&#8217;s architecture</a> <a title="Jump back to footnote 12" href="#fnref12">↩</a></li>
<li id="fn13">This behavior is only available on HDFS version 0.21.0, or 0.20.4 with patches. Take a look at <a href="http://issues.apache.org/jira/browse/HDFS-826">HDFS-826</a> <a title="Jump back to footnote 13" href="#fnref13">↩</a></li>
<li id="fn14">We don&#8217;t want to push the analogy too far, but multi-threaded programming does not yet offer a simple and clear programming model: threading, actors, STM, etc. There is no clear winner, and they all make the application code complex. <a title="Jump back to footnote 14" href="#fnref14">↩</a></li>
<li id="fn15">See the Hadoop <a href="http://wiki.apache.org/hadoop/PoweredBy">&#8220;Powered by&#8221;</a>, as well as the Hadoop Summit proceedings for more companies that are using Hadoop : Visa, IBM, Reuters, NY Times, etc. <a title="Jump back to footnote 15" href="#fnref15">↩</a></li>
<li id="fn16">Of course, TANSTAAFL, if you want to do heavy processing, you need beefy machines, and lots of them. <a title="Jump back to footnote 16" href="#fnref16">↩</a></li>
</ol>
</div>
<img src="http://feeds.feedburner.com/~r/Hstack/~4/AHNXE2qFEoM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://hstack.org/why-were-using-hbase-part-2/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		<feedburner:origLink>http://hstack.org/why-were-using-hbase-part-2/</feedburner:origLink></item>
		<item>
		<title>Why we’re using HBase: Part 1</title>
		<link>http://feedproxy.google.com/~r/Hstack/~3/ni36w3d7nKs/</link>
		<comments>http://hstack.org/why-were-using-hbase-part-1/#comments</comments>
		<pubDate>Tue, 16 Mar 2010 12:45:19 +0000</pubDate>
		<dc:creator>Cosmin Lehene</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[adobe]]></category>
		<category><![CDATA[hstack]]></category>
		<category><![CDATA[nosql]]></category>

		<guid isPermaLink="false">http://hstack.org/?p=24</guid>
		<description><![CDATA[Our team builds infrastructure services for many clients across Adobe. We have services ranging from commenting and tagging to structured data storage and processing. We need to make sure that data is safe and always available; the services have to work fast regardless of the data volume. This article is about how we got started [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fhstack.org%2Fwhy-were-using-hbase-part-1%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fhstack.org%2Fwhy-were-using-hbase-part-1%2F&amp;source=hstackdotorg&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<div id="why-were-using-hbase-part-1">
<p>Our team builds infrastructure services for many clients across Adobe. We have services ranging from commenting and tagging to structured data storage and processing. We need to make sure that data is safe and always available; the services have to work fast regardless of the data volume.</p>
<p>This article is about how we got started using HBase and where we are now. More in depth reasoning can be found in the <a href="http://hstack.org/why-were-using-hbase-part-2/">second part of the article</a></p>
<div id="lucky-shot">
<h2>Lucky shot</h2>
<p>If one would have asked me a couple of days ago why or how we chose <a href="http://hadoop.apache.org/hbase/">HBase</a>, I would have answered in a blink that it was about reliability, performance, costs, etc.(a bit brainwashed after answering &#8220;correctly&#8221; and &#8220;objectively&#8221; too many times). However, as the subject has become rather popular lately<sup><a id="fnref1" class="footnoteRef" href="#fn1">1</a></sup>, I reflected deeper about &#8220;how&#8221; and &#8220;why&#8221;.</p>
<p>The truth is that, in the beginning, we were attracted to working with bleeding edge technology and it was fun. It was a projection of the success we were hoping to have that motivated us. We all knew stories about Google File System, Bigtable, GMail and what made them possible. I guess we wanted a piece of that, and Hadoop and HBase were one logical step to reach that.<span id="more-24"></span></p>
<p>We didn&#8217;t even have a cluster when we started. I begged and bribed for hardware from teams that had extra cycles on their testing machines. We were going to use them just as <a href="http://setiathome.berkeley.edu">SETI@Home</a> does, well sort of. Once we got 7 machines, we had a cluster<sup><a id="fnref2" class="footnoteRef" href="#fn2">2</a></sup> running Hadoop and HBase stack (HStack<sup><a id="fnref3" class="footnoteRef" href="#fn3">3</a></sup>). We even went on and refurbished some old broken machines to work as extra test agents, besides our own laptops.</p>
<p>Technology driven decisions tend to fall over when assessed from a business perspective. I never thought about costs, data loss, etc. We were somehow assuming that these were all fine. If others ran it why wouldn&#8217;t we be able to do it? We knew this architecture would enable scalability, but we didn&#8217;t challenge whether the implementation actually works.</p>
<p>Scalability and performance &#8220;lured&#8221; us in. But in reality it&#8217;s the implementation that dictates <strong>costs</strong>, <strong>consistency</strong>, <strong>availability</strong> and <strong>performance</strong>. A good and scalable architecture is just a long term promise, unless it is backed up by the implementation. In our case the architecture choice paid off, as you&#8217;ll see.</p>
<p>Once we realized the potential of HBase through early experiments, we subjected it to a full analysis. It took a while to get an objective opinion, but after all the tests, we really knew we were on to something.</p>
</div>
<div id="the-40m-mark">
<h2>The 40M mark</h2>
<p>We had already scaled MySQL, so denormalization, data partitioning and replication weren&#8217;t all that new to us. When mid-2008 one of our clients asked us to provide a service that could handle 40M records, real time access, aggregation and all that, we thought we had an answer. This was our first step towards doing &#8220;big data&#8221;.</p>
<p>There were no benchmarking reports<sup><a id="fnref4" class="footnoteRef" href="#fn4">4</a></sup> then, no &#8220;NoSQL&#8221; moniker, therefore no hype :). We had to do our own objective assessment. The list was (luckily) short: HBase, Hypertable and Cassandra were on the table. MySQL was there as a last resort.</p>
<p>We abstracted the implementation details, and made a stub (so that the clients could start developing), and we started testing each technology stack.</p>
<p>Cassandra was out first. It had just come out, was barely usable, lacked any decent resources or active mailing lists and it could keep only one table per instance. This has changed a lot in time, but we had a deadline then.</p>
<p>Funny enough, HBase was the second one out :).</p>
<p>When we started pushing 40 million records, HBase<sup><a id="fnref5" class="footnoteRef" href="#fn5">5</a></sup> squeaked and cracked. After 20M inserts it failed so bad it wouldn&#8217;t respond or restart, it mangled the data completely and we had to start over. It was performing bad and seemed to lose data.</p>
<p>I literally dreamt logs for a week, trying to identify the issues. It was looking as if we would discard HBase, but I insisted we should be able to switch later even if we went ahead with Hypertable. The team agreed, even though it didn&#8217;t look as a viable choice in comparison with Hypertable that handled the data rather well.</p>
<p>HBase community turned out to be great, they jumped and helped us, and upgrading to a new HBase version fixed our problems. Hypertable<sup><a id="fnref6" class="footnoteRef" href="#fn6">6</a></sup> on the other hand seemed to perform better.</p>
</div>
<div id="plan-for-failure">
<h2>Plan for failure</h2>
<p>When testing failover scenarios, HBase started to gain ground, handling node failures gracefully and consistently. Hypertable had problems bringing data back up after node failures and required manual intervention. We were, once more, left with a single option.</p>
<p>But there were more questions to be answered, concerns that we never had with MySQL:</p>
<p>What&#8217;s the guarantee that every bit you put in comes back in the same form no matter what?</p>
<p>We had to be able to detect corruption and fix it. As we had an encryption expert in the team (who authored a <a href="http://adal.chiriliuc.com/bc_iv_flaw.php">watermarking attack</a>), he designed a model that would check consistency on-the-fly with CRC checksums and allow versioning. The thrift serialized data was wrapped in another layer that contained both the inner data types and the HBase location (row, family and qualifier). (He&#8217;s a bit paranoid sometimes, but that tends to come in handy when disaster strikes :). Pretty much what <a href="http://hadoop.apache.org/avro/">Avro</a> does.</p>
<p>Another scenario we had to cover was a total failure of the Hadoop Distributed File System (HDFS) that HBase relies on.</p>
<p>We adapted an outdated HBase export tool to ensure data consistency during and after backup. We kept backups in 3 places: locally, HDFS and another distributed file system. We also had it prepared in MySQL format so we could switch to a MySQL cluster in case of disaster. We had scripts that would take the latest backup and bring up an alternative storage cluster (HBase or MySQL).</p>
<p>Things went pretty smoothly. We created <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce</a> jobs to compute recommendations out of the data that was stored, we had an automated backup, and a mechanism for disaster recovery.</p>
<p>In October 2008 our system went live on time.</p>
</div>
<div id="disasters-will-happen">
<h2>Disasters WILL happen</h2>
<p>On the 3rd of December 2008, around midnight<sup><a id="fnref7" class="footnoteRef" href="#fn7">7</a></sup>, sanity alerts started pouring in: the service was running in degraded mode. Our HBase cluster would write data but couldn&#8217;t answer correctly to reads. Following the procedure, I was able to make another backup and restore it on a MySQL cluster. Then I enabled the MySQL cluster backup. We were paranoid enough to have a backup procedure for the MySQL &#8220;backup&#8221; cluster as well :)</p>
<p>We were up and running about half an hour later. MySQL had saved the day, and we congratulated ourselves. Except that our thorough, tested backup plan had a glitch: the <a href="http://www.howtoforge.com/mysql_master_master_replication">master-master replication</a> was not setup for the new tables. We shortly got two different data sets on each server. We had to stop the replication, switch to a single node and fix the consistency issue. Master-master replication is pretty much a hack and needs proper care. Otherwise, it&#8217;s pretty easy to screw up your data.</p>
<p>After a thorough investigation<sup><a id="fnref8" class="footnoteRef" href="#fn8">8</a></sup>, a postmortem and a couple of patches that brought both HBase and HDFS to the latest released version, we switched back to HStack<sup><a id="fnref9" class="footnoteRef" href="#fn9">9</a></sup> on the 5th of December. We had no data loss and our clients only experienced a short interruption. Our client team congratulated us for the fast response.</p>
<p>Most importantly, this was our first reality check and a new lesson learnt. The system still runs today and we had no problems with it ever since. After just a few months our biggest client switched direction and never got to the 40 million records.</p>
<p>The system never reached its planned capacity. Even though we took up new clients on board and implemented new services on top of it, we weren&#8217;t yet the &#8220;stars&#8221; we&#8217;d imagined to be.</p>
<p>In reality, all we had would have been easy to handle with a MySQL cluster and just a little operational overhead.</p>
</div>
<div id="the-billion-mark">
<h2>The Billion Mark</h2>
<p>We decided to switch focus in the beginning of 2009. We were going to provide a generic, real-time, structured data storage and processing system that could handle any data volume. By this time we caught the attention of bigger potential clients and the requirements changed a bit. 40 million became <em>1 billion</em>, with access times under 50ms and serious processing power. All this with no downtime and definitely <em>no data loss</em>.</p>
<p>This time, we were going to do it right:</p>
<p>We wrote down all the failure scenarios<sup><a id="fnref10" class="footnoteRef" href="#fn10">10</a></sup> that we could think of: bad disk, bad memory, packet loss, network card failure, machine and rack power failure, disk failure, raid controller failure, etc. Nothing was &#8220;sacred&#8221;. At one point, in sprint demos we would ask for two random numbers (two machines in the cluster) and unplugged the power cable, unplugged the network cable, or just randomly took out hard drives while running.</p>
<p>We used <a href="http://www.drbd.org/">DRBD</a> and <a href="http://www.linux-ha.org/wiki/Heartbeat">Heartbeat</a> to have <a href="http://www.cloudera.com/blog/2009/07/hadoop-ha-configuration/">HA for the Hadoop NameNode</a>, because this was the last Single Point of Failure (SPOF) in our system.</p>
<p>We initially failed to reach 1B records as we filled up all the disks after less than 500M records. However, the performance and resilience to failures under rough conditions helped us &#8220;bootstrap&#8221; for something bigger. So, we got a new cluster<sup><a id="fnref11" class="footnoteRef" href="#fn11">11</a></sup> installed, that could handle the necessary capacity.</p>
<p>We also looked at the operational overhead; we always try to automate as much as possible. Our latest deployment system took any number of barebone machines (using the IPMI interface) and would deploy the OS, partition everything and then deploy and configure Hadoop, HBase along with our systems, completely unattended. Everything was set up using a combination of Kickstart scripts and Puppet for service deployment. Even now, deploying a new cluster is basically a <code>git push</code> away.</p>
<p>We had a 3B record table, on which we ran all our benchmarks: cold start (no memory caches), reads, writes, combined tests in different proportions. Random reads under 15 ms, huge throughput, all the cool stuff you&#8217;d imagine.</p>
<p>We started contributing to HBase, HDFS and Map-Reduce to support all our failover and performance scenarios and make appropriate fixes when necessary.</p>
<p>Right now, our system is currently &#8220;beta&#8221; inside the organization. We&#8217;re still testing it, but already supports distributed data import, random access, distributed data processing, we have the APIs, web interfaces and it&#8217;s running fast.</p>
<p>This is a historical account on how we started working with Hadoop and HBase. The second part of this article will show more practical and objective aspects on why we stick with this technology stack.</p>
<p>Cosmin.</p>
</div>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1">Especially in the &#8220;NoSQL&#8221; community. Check out a <a href="http://www.google.com/trends?q=nosql">Google trend</a> and a <a href="http://www.ubervu.com/nosql/social-media/">social media trend</a> <a class="footnoteBackLink" title="Jump back to footnote 1" href="#fnref1">↩</a></li>
<li id="fn2">We&#8217;ve been watching Hadoop since 2007, but it was <a href="http://techcrunch.com/2008/02/20/yahoo-search-wants-to-be-more-like-google-embraces-hadoop/">this article</a> that triggered me to move from playing to actually deploy it on a cluster. On the 16th of April 2008 our first Hadopp/HBase cluster (called HStack) was operational. <a class="footnoteBackLink" title="Jump back to footnote 2" href="#fnref2">↩</a></li>
<li id="fn3">Our umbrella term for Hadoop, HBase, Zookeeper and friends. <a class="footnoteBackLink" title="Jump back to footnote 3" href="#fnref3">↩</a></li>
<li id="fn4">Like <a href="http://www.brianfrankcooper.net/pubs/ycsb.pdf">this</a>. Mind you, we have sensible differences to these results, stay tuned :). <a class="footnoteBackLink" title="Jump back to footnote 4" href="#fnref4">↩</a></li>
<li id="fn5">Our first real tests were running against HBase-0.2.0. HBase later changed it&#8217;s versioning scheme to mirror Hadoop&#8217;s versions, so HBase-0.3 became HBase-0.18. <a class="footnoteBackLink" title="Jump back to footnote 5" href="#fnref5">↩</a></li>
<li id="fn6">Hypertable-0.9.27 <a class="footnoteBackLink" title="Jump back to footnote 6" href="#fnref6">↩</a></li>
<li id="fn7">11:51PM (GMT+2), to be more exact :) <a class="footnoteBackLink" title="Jump back to footnote 7" href="#fnref7">↩</a></li>
<li id="fn8">The problem had started at least a month before, when a bug silently disabled <code>.META.</code> compactions. So, the <code>.META.</code> table had a lot of <a href="http://github.com/apache/hbase/blob/trunk/core/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java">StoreFiles</a> and, from a point onward HDFS started to throttle open file handles. A full cluster restart would have temporarily fixed it and a proper log monitoring would have alerted us long before it got too late. <a class="footnoteBackLink" title="Jump back to footnote 8" href="#fnref8">↩</a></li>
<li id="fn9">Our umbrella term for Hadoop, HBase, Zookeeper and friends. <a class="footnoteBackLink" title="Jump back to footnote 9" href="#fnref9">↩</a></li>
<li id="fn10">Like <a href="http://labs.google.com/papers/disk_failures.pdf">this</a>. This kind of research comes in very handy, when you try to see how <strong>expensive</strong> a failure is. <a class="footnoteBackLink" title="Jump back to footnote 10" href="#fnref10">↩</a></li>
<li id="fn11">7 machines; each dual quad core hyper-threaded CPUs, 32GB RAM, 24 10K RPMS SATA disks, battery backed raid controllers, and IPMI. Commodity hardware does <strong>NOT</strong> mean crappy hardware. When you want performance, you need lots of spindles and memory. <a class="footnoteBackLink" title="Jump back to footnote 11" href="#fnref11">↩</a></li>
</ol>
</div>
<img src="http://feeds.feedburner.com/~r/Hstack/~4/ni36w3d7nKs" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://hstack.org/why-were-using-hbase-part-1/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		<feedburner:origLink>http://hstack.org/why-were-using-hbase-part-1/</feedburner:origLink></item>
	</channel>
</rss>

