<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0"><channel><title>Latest posts for Acunu</title><link>http://www.acunu.com/blogs/</link><description>Latest posts for Acunu</description><language>en-gb</language><copyright>Copyright (c) Acunu Ltd. 2012. All rights reserved.</copyright><lastBuildDate>Mon, 21 May 2012 00:00:00 +0100</lastBuildDate><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/BigDataInsights" /><feedburner:info uri="bigdatainsights" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item><title>Tell Me Something I Don't Already Know</title><link>http://feedproxy.google.com/~r/BigDataInsights/~3/7iRmdMrwckE/</link><description>&lt;p&gt;We should expect answers to more interesting questions from the current generation of NoSQL databases. Having told it things we know, why can't they tell us new things that we don't already know? Machine learning techniques like clustering, classification and collaborative filtering are becoming more mainstream, which can learn subtle structure and make predictions about as-yet-unknown entities, by processing very large amounts of data.&lt;/p&gt;
&lt;p&gt;But these are advanced techniques, and hard questions to answer. Today's NoSQL databases lack the ability to efficiently answer even slightly sophisticated questions, like "what's the sum of all clicks recorded in my data?" This is a useful kind of question, which asks for a fact that wasn't explicitly recorded into the database. It is maybe the simplest analytics-style query imaginable. The RDBMS, using SQL queries, could answer this question with an aggregate query, albeit not at the same scale that NoSQL databases operate in. But what about analytics in the age of NoSQL databases and huge data?&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;LOOKING FOR HAYSTACKS IN BIGGER HAYSTACKS&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;It's not just that it would be &lt;i&gt;nice &lt;/i&gt;if NoSQL databases could answer basic analytic queries like this; these may be the only kind of query that matters for truly huge data sets.&lt;/p&gt;
&lt;p&gt;Consider a database of a million customers. A million customers is a lot of people. But a million is not a large number, for databases. Even sophisticated relational databases can easily handle a million things. When storing a million things, whatever they are, chances are that each one matters, as a unit of information that might be accessed by itself; it's probable that this database will be asked for one customer's information at some point. And an RDBMS or NoSQL store can certainly return one datum.&lt;/p&gt;
&lt;p&gt;It seems like a point almost too trivial to make until you consider a database of a trillion things. Imagine 10,000 servers recording status on 10 metrics each, once a second. Over about 4 months, this will have generated about a trillion data points. It's unlikely that it particularly matters that server 2340, on January 13, at 09:50:01, reported a certain value for metric 9, even if it would be nice to retrieve this information. It is much more likely that the average value, or max value, of the metrics matter, across all servers, by minute or by hour.&lt;/p&gt;
&lt;p&gt;That is, when storing a trillion things, whatever they are, chances are that aggregate results over the data matter, rather than individual data. Accessing the data that was inserted becomes less important than accessing computed statistics about subsets of that data. &lt;i&gt;Simple analytics &lt;/i&gt;becomes more important than simple queries for data.&lt;/p&gt;
&lt;p&gt;So, why do today's NoSQL databases have primitives for simple queries, but no primitives for analytics?&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;IN FAVOR OF INCREMENTALISM&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;At first glance, it's not apparent why it should be difficult to simply count or sum or average data in a NoSQL database. It isn't difficult, and, to count a million things takes little time. Counting a trillion things isn't difficult in theory either, it just takes a very long time, and that makes it difficult in practice.&lt;/p&gt;
&lt;p&gt;Apache Hadoop has become popular for its ability to parallelize operations as simple as counting across many machines, and, Hadoop can be deployed to compute aggregate statistics from a large data set, like sums and averages. In fact, the Apache Hive project's whole purpose is to provide a SQL-like language for querying large data sets (like NoSQL database) via Hadoop. Still, Hadoop is built for large long-running processes that complete in at least minutes, and more usually hours or days. A simple ad-hoc query in Hive may not finish for a week! And, this is no fault of Hive or Hadoop; it just takes a long time to do anything with a huge amount of data.&lt;/p&gt;
&lt;p&gt;Thankfully, analytics is not usually about ad-hoc queries. Analytics, in contrast, is often concerned with tracking the same metrics across time or other dimensions -- for example, tracking ad clicks on a&amp;nbsp;web site, by geography and by minute. The queries are known in advance, even as the data is being collected, and do not change. This is the good news, because if the desired statistics are known ahead of time, it's possible to compute them and simply update them as new data arrives, rather than start from scratch when an ad-hoc query arrives. In this example, each click simply adds 1 to a running count, broken down by geography and minute. At each moment the aggregated result is up-to-date and easy to access quickly.&lt;/p&gt;
&lt;p&gt;&lt;i&gt;Incremental&lt;/i&gt; analytics is entirely viable even in the context of a massive data set, and a torrent of new input. Incremental analytics will be the way in which NoSQL databases start doing something more interesting in 2012.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;ACUNU (INCREMENTAL) ANALYTICS&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;And Acunu's already there, hard at work on Acunu Analytics, an incremental analytics engine built from Acunu's NoSQL store based on Cassandra, and elements of the familiar Apache Hive. There's so much to explain about it that Andrew Byde has already made it into a presentation. Rather than repeat it, stop now to view his slides:&lt;/p&gt;
&lt;p&gt;&lt;a href="http://www.slideshare.net/acunu/acunu-analytics"&gt;http://www.slideshare.net/acunu/acunu-analytics&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;IN CONCLUSION&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;The era of huge data is here, and NoSQL databases have gained popularity in their ability to prioritize efficient scalability. Yet they are only able to repeat the data they've been given, and so still lack even the basic analytic capability of an RDBMS. And, it is exactly analytic queries that are most relevant at the huge data scales that NoSQL databases promise to support. These databases will need to evolve to support an incremental approach to computing analytics to offer timely answers to the interesting questions we need to ask of our data. Fortunately, Acunu Analytics is ready to provide such an incremental analytics system for your big data analytics needs.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/BigDataInsights/~4/7iRmdMrwckE" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Sean Owen</dc:creator><pubDate>Mon, 21 May 2012 00:00:00 +0100</pubDate><guid isPermaLink="false">http://www.acunu.com/blogs/sean-owen/tell-me-something-i-dont-already-know/</guid><feedburner:origLink>http://www.acunu.com/blogs/sean-owen/tell-me-something-i-dont-already-know/</feedburner:origLink></item><item><title>Acunu is pleased to announce v2 of the Acunu Data Platform!</title><link>http://feedproxy.google.com/~r/BigDataInsights/~3/p2SRJ8Ivte4/</link><description>&lt;p&gt;Acunu Data Platform v2 brings major new features and many improvements across the board, designed to make it make it the easiest to use, highest performance Big Data database available.&lt;/p&gt;
&lt;p&gt;&lt;br /&gt; Acunu Data Platform v2 includes Cassandra 1.0, Acunu Storage Engine&amp;nbsp;and Acunu Control Center in a conveniently packaged OS image, for both hardware and the Amazon cloud.&lt;br /&gt; &lt;br /&gt; Acunu Data Platform v2 has received over 3.5 thousand machine days of extensive testing and is in production use with some of the world&amp;rsquo;s largest companies.&lt;br /&gt; &lt;br /&gt; New features in v2 include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.acunu.com/blogs/dr-andrew-byde/faster-disk-rebuilds/"&gt;Fast rebuild&lt;/a&gt;: safely use large SATA disks; rebuild 2TB in minutes, not hours&lt;/li&gt;
&lt;li&gt;Advanced distributed versioning, allowing cluster-wide snapshots and clone even in the event of partition or down nodes&lt;/li&gt;
&lt;li&gt;Control Center Alerts: get preemptively told about common&amp;nbsp;configuration and operation issues&lt;/li&gt;
&lt;li&gt;Munin, Nagios &amp;amp; Syslog integration - making it easier to incorporate Acunu into existing infrastructure&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br /&gt; Other improvements include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Re-designed Control Center: makes deploying, managing and monitoring your cluster easy!&lt;/li&gt;
&lt;li&gt;Improved performance: across the board improvements include&amp;nbsp;order-of-magnitude faster get_slice queries, faster inserts, increased in cache performance and enhanced prefething for large range queries&lt;/li&gt;
&lt;li&gt;Improved space utilization: removes more than 2x overheads, allowing use of more of your disks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;br /&gt;&lt;b&gt;&lt;a href="https://www.acunu.com/download/acunu-cassandra/"&gt; Click here to get your hands on the Acunu Data Platform&lt;/a&gt;&lt;/b&gt;. The&amp;nbsp;Acunu Data Platform is available free for 90 days, and is free for non-production use. Acunu offers both 8x5 and 24x7 support and training and consulting services.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/BigDataInsights/~4/p2SRJ8Ivte4" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Acunu</dc:creator><pubDate>Mon, 07 May 2012 00:00:00 +0100</pubDate><guid isPermaLink="false">http://www.acunu.com/blogs/acunu/acunu-pleased-announce-v2-acunu-data-platform/</guid><feedburner:origLink>http://www.acunu.com/blogs/acunu/acunu-pleased-announce-v2-acunu-data-platform/</feedburner:origLink></item><item><title>Acunu Analytics Ready to Preview!</title><link>http://feedproxy.google.com/~r/BigDataInsights/~3/-PvI8YCi3K8/</link><description>&lt;h2&gt;Acunu Analytics is ready to preview!&lt;/h2&gt;
&lt;p&gt;We're delighted to announce that Acunu Analytics is ready for its preview release!&lt;/p&gt;
&lt;h3&gt;What is it?&lt;/h3&gt;
&lt;p&gt;Acunu Analytics solves the problem of computing simple analytics over large datasets, incrementally, and in realtime. For many common use cases of Hadoop, it answers queries in milliseconds rather than the minutes needed by Hadoop. It has a simple JSON interface and supports a basic range of queries. One defines a "query template" in advance (a query where some parameters are variables, see below) and then instantiates the variables at query time. Updates are either posted as JSON objects to an HTTP REST endpoint, or use the Flume plugin provided. Acunu Analytics does all the work so that queries can be answered with different parameters in realtime, as the data set changes, and takes care of things like cluster management, failure tolerance, load balancing, and so on. Acunu Analytics was inspired by&amp;nbsp;&lt;a href="http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011"&gt;Twitter's Rainbird&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Use cases&lt;/h3&gt;
&lt;p&gt;We'd love to get feedback on great use cases for Acunu Analytics - currently, we know about&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tracking trending topics (eg top-10 by category)&lt;/li&gt;
&lt;li&gt;Network monitoring (eg looking for infected nodes)&lt;/li&gt;
&lt;li&gt;Analytics dashboards (eg for tracking ad impressions)&lt;/li&gt;
&lt;li&gt;Operational intelligence (eg infrastructure monitoring)&lt;/li&gt;
&lt;li&gt;and more...&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Get the preview release and let us know what you use it for.&lt;/p&gt;
&lt;h3&gt;How does it compare to Hadoop?&lt;/h3&gt;
&lt;p&gt;&lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt; is a great example of distributed, batch-oriented computing. You have lots of data, spread over a cluster (in this case, HDFS), and want to answer some query (eg "Show me the top 10 IP addresses, ordered by the number of distinct destination IP addresses they contacted in the last hour"). Tools such as &lt;a href="http://hive.apache.org/"&gt;Hive&lt;/a&gt; make it easy to express this sort of query, but each query still needs minutes or hours to complete - it runs in time linear in the amount of data processed.&lt;/p&gt;
&lt;p&gt;In contrast, Acunu Analytics runs on Acunu's&lt;a href="http://www.acunu.com/acunu-data-platform/"&gt; next-generation distribution for Cassandra&lt;/a&gt;. Instead of handling arbitrary queries, it targets common use cases where the "query template" is known in advance. The following query template allows us to find source ip addresses, ordered by the number of distinct destinations contacted within some time period, possibly grouped by time buckets and prefixes of the source ip address.&lt;/p&gt;
&lt;p&gt;&lt;code class="inline"&gt;ex_schema:&amp;nbsp;{&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code class="inline"&gt;&amp;nbsp;types: &amp;nbsp;src_ip PATH(.), dest_ip PATH(.), bytes LONG, timestamp TIME (1ms,1s,1m,1h,1d),&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code class="inline"&gt;&amp;nbsp;select: src_ip, COUNT(DISTINCT dest_ip),&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;}&lt;/p&gt;
&lt;p&gt;We can now submit queries with WHERE and GROUP fields. For example, we can query "SCHEMA: ex_schema, WHERE: timestamp=[100,1000], src_ip=128.1, GROUP: timestamp(1s), src_ip". This returns a list of src_ip, count(distinct dest_ip) pairs during the given timestamps, where src_ip begins with 128.1 and the count distincts are grouped by 1 second timestamps.&amp;nbsp;See the&amp;nbsp;&lt;a href="http://techdocs.acunu.com/analytics/"&gt;documentation&lt;/a&gt;&amp;nbsp;for more details.&lt;/p&gt;
&lt;p&gt;The crucical difference to a regular DB query is the following. Instead of reevaluating the query for each evaluation (as Hadoop does), Acunu Analytics continuously keeps up-to-date &lt;i&gt;all possible&lt;/i&gt; parameterizations of each query template, as new updates are performed. This means that different evaluations of the same query template can be answered by performing only a few lookups into the datastore, potentially &lt;a href="http://www.acunu.com/blogs/andy-twigg/log-file-systems-and-ssds-made-each-other/"&gt;stored on SSDs&lt;/a&gt; if you want even lower latency. As a result, &lt;i&gt;queries can be answered in milliseconds instead of minutes.&lt;/i&gt; More precisely, queries are now logarithmic in the amount of data stored.&lt;/p&gt;
&lt;h3&gt;Where can I find out more details?&lt;/h3&gt;
&lt;p&gt;See Andrew Byde's&lt;a href="http://www.slideshare.net/acunu/acunu-analytics"&gt;&amp;nbsp;presentation at Cassandra EU&amp;nbsp;&lt;/a&gt;and the current&amp;nbsp;&lt;a href="http://techdocs.acunu.com/analytics/"&gt;documentation&lt;/a&gt;&amp;nbsp;(it has a tutorial).&lt;/p&gt;
&lt;h3&gt;What's the future roadmap / will you support &amp;lt;x&amp;gt; ?&lt;/h3&gt;
&lt;p&gt;We are still extending the range of supported queries. Currently it permits data types including numerics, strings, hierarchical paths,&amp;nbsp;hierarchical time buckets and functions&amp;nbsp;count, count distinct, max, min, sum, sum_squares, average, variance, std_dev. We are also working on dynamically-changing schemas. See the&amp;nbsp;&lt;a href="http://techdocs.acunu.com/analytics/"&gt;documentation&lt;/a&gt;&amp;nbsp;for more details.&lt;/p&gt;
&lt;p&gt;Our aim is to support, in due course, the majority of use cases of Apache Hive, but incrementally rather than batch-oriented. We are not currently targeting complex analytics such as &lt;a href="http://www.acunu.com/blogs/sean-owen/scaling-cassandra-and-mahout-hadoop/"&gt;machine learning&lt;/a&gt; and &lt;a href="http://www.acunu.com/blogs/sean-owen/recommending-cassandra/"&gt;recommender systems&lt;/a&gt;&amp;nbsp;- for that, systems such as &lt;a href="http://mahout.apache.org/"&gt;Apache Mahout&lt;/a&gt;, &lt;a href="http://www.skytreecorp.com/"&gt;SkyTree&lt;/a&gt; and &lt;a href="https://bigml.com/"&gt;BigML&lt;/a&gt; are good choices. In due course, we hope to integrate with some of these analytics products in order to make the fundamental power of Acunu Analytics available to a wider range of use cases.&lt;/p&gt;
&lt;h3&gt;Okay, I'm interested! What next?&lt;/h3&gt;
&lt;p&gt;While Acunu Analytics is in preview release, we're rolling it out to a limited number of users, so that we can be sure to give them the support they need and we can make the most of their feedback. Please send an email to &lt;a href="mailto:analytics@acunu.com"&gt;analytics@acunu.com&lt;/a&gt; to register your interest, and we'll get in touch.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/BigDataInsights/~4/-PvI8YCi3K8" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andy Twigg</dc:creator><pubDate>Wed, 28 Mar 2012 00:00:00 +0100</pubDate><guid isPermaLink="false">http://www.acunu.com/blogs/andy-twigg/acunu-analytics-preview/</guid><feedburner:origLink>http://www.acunu.com/blogs/andy-twigg/acunu-analytics-preview/</feedburner:origLink></item><item><title>Big Data meets Big Opportunities at SXSW</title><link>http://feedproxy.google.com/~r/BigDataInsights/~3/3v6L1A7EY-0/</link><description>&lt;p&gt;Crowds of 16,000 people, scorching heat, dehydration, 15 miles walking, blistering feet: you'd think people would be nuts to subscribe to this daily routine but such is the appeal of SXSW Interactive, one of the world's largest interactive festival taking stage in Austin, Texas.&lt;/p&gt;
&lt;p&gt;So why all the fuss? Simple. This is an oasis of opportunities, be it&amp;nbsp;new business&amp;nbsp;opportunities, fundraising, increased recognition, or entertainment. SXSW is where Twitter and Foursquare rose to fame. Listening to some of the veterans at the SXSW pre-mission briefing in London boasting that this year will be their twentieth-plus visit, you can sense the atmosphere is highly addictive.&lt;/p&gt;
&lt;p&gt;Acunu is proudly returning for the second time; selected as one of the best UK companies to be part of the&lt;a href="http://chinwag.com/blogs/lauren-cotton/ukti-mission-sxsw-2012-companies-announced" target="_blank"&gt;&amp;nbsp;UK Trade and Investment Digital Mission&lt;/a&gt;. We're thrilled to be waving the British flag, but also to be fortifying our presence in the US. In addition to our Texan outpost, we recently opened&amp;nbsp;&lt;a href="http://www.itpro.co.uk/638626/big-data-boosts-acunu" target="_blank"&gt;offices in San Francisco&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This year we're looking to make an even greater impact at SXSW as we'll be presenting with Twitter on Mar 13. Ryan King, Tech lead on the Storage team at Twitter, and Tom Wilkie, Acunu's VP of Engineering, are stealing the show with a riveting talk on Cassandra:&amp;nbsp;&lt;a href="http://schedule.sxsw.com/2012/events/event_IAP13044" target="_blank"&gt;"Freaking Fast Cassandra, How do they do it?"&lt;/a&gt;. Tue Mar 13th, 3:30pm to 4:30pm in Ballroom BC, AUstin Convention Center.&amp;nbsp;If you'll be around in the area, make sure to drop by and say hi or if you want to schedule a time to meet with Tom, email&amp;nbsp;&lt;a href="mailto:contact@acunu.com" target="_blank"&gt;contact@acunu.com&lt;/a&gt;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/BigDataInsights/~4/3v6L1A7EY-0" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Acunu</dc:creator><pubDate>Wed, 29 Feb 2012 00:00:00 +0000</pubDate><guid isPermaLink="false">http://www.acunu.com/blogs/acunu/big-data-meets-big-opportunities-sxsw/</guid><feedburner:origLink>http://www.acunu.com/blogs/acunu/big-data-meets-big-opportunities-sxsw/</feedburner:origLink></item><item><title>Welcome to the party, DynamoDB</title><link>http://feedproxy.google.com/~r/BigDataInsights/~3/KfrzkxzrasQ/</link><description>&lt;p&gt;Amazon's recent announcement of &lt;a href="http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html"&gt;DynamoDB&lt;/a&gt; was noted with positive interest by a number of observers, and rightly so, in my view.  It's a technically interesting achievement, for one.  But more importantly, it's another step towards mainstream maturity for the Big Data database market, and another sign that the problems that we're trying to solve are important.&lt;/p&gt;
&lt;p&gt;What effect might DynamoDB have on the Big Data / NoSQL space?&lt;/p&gt;
&lt;p&gt;At Acunu our customers fall into two broad groups.  The first are traditional enterprises that have a system deployed on relational technology, often Oracle, backed by shared storage, that hits a cost pain-point when it comes to scalability.  The second group consists of start-ups who see Big Data as an opportunity, rather than a challenge: they are building their core business around an infrastructure for collecting, analysing and serving data.&lt;/p&gt;
&lt;h2&gt;In the enterprise&lt;/h2&gt;
&lt;p&gt;Enterprises considering NoSQL have found that their legacy relational systems cannot deliver the necessary combination of performance and capacity at a price point that justifies the use-case of the data.  It's at this point that they may look to alternative solutions.  An increasing number are finding that NoSQL databases fit their need -- at Acunu, we're experiencing first-hand a real wave-front of adoption of Big Data technology in enterprises that you may not have considered as early adopters.&lt;/p&gt;
&lt;p&gt;So what will DynamoDB mean for enterprise folk? I expect not much in the short term, until some key issues are solved. First, the sources and consumers of data are still on-site.  These guys are tackling a specific technical limitation, not necessarily looking to re-architect their wider systems, which are often complex and inter-dependent.  Second, security and regulatory concerns may need addressing.  Third, the TCO needs to stack up. A quick and dirty back of the envelope calculation suggests that although it's free to get started with DynamoDB, for the sort of deployment sizes we're seeing, DynamoDB works out  considerably more expensive than alternatives like Acunu deployed on hardware (even after accounting for typical full costing for outsourced data centers).&lt;/p&gt;
&lt;p&gt;In time, these issues may be addressed. This great &lt;a href="http://reports.informationweek.com/abstract/5/8637/Cloud-Computing/research-state-of-database-technology.html"&gt;Information Week report on database technology&lt;/a&gt; (registration required) surveyed 'technology professionals' and found that the starting point is low: 2% are "using the cloud for a fully managed database service," a description that matches what Amazon is providing with DynamoDB.&lt;/p&gt;
&lt;h2&gt;In start-ups&lt;/h2&gt;
&lt;p&gt;Big Data start-ups are more likely to be interested in a cloud NoSQL offering, because small (initial) scale and a lack of legacy infrastructure makes the transition to the cloud much cheaper.  Here, the pain around using MySQL or Oracle is more anticipated than real: without a "legacy" data-set, start-ups have the opportunity to pick their tools and design for scale from the outset.&lt;/p&gt;
&lt;p&gt;Acunu aims to help these organisations focus on building their services, and reduces the effort to maintain the platform in production. This is clearly Werner Vogels' motivation with DynamoDB -- and one market where I think it's likely to get traction quickly.&lt;/p&gt;
&lt;p&gt;To be successful here, DynamoDB will still need to prove it can offer benefits over and above technologies like Cassandra which are advancing rapidly. Packaged solutions including Acunu already let you deploy on EC2 and other cloud providers, so avoiding CapEx and operations effort but without risking &lt;a href="http://www.forbes.com/sites/joemckendrick/2011/11/20/cloud-computings-vendor-lock-in-problem-why-the-industry-is-taking-a-step-backwards/"&gt;cloud lock-in&lt;/a&gt;. And other Platforms-as-a-Service are doubtless not far behind.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Even though DynamoDB is still in beta, Amazon's effort confirms that Big Data distributed databases are the way forward.  So, welcome to the party DynamoDB!&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/BigDataInsights/~4/KfrzkxzrasQ" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Tim Moreton</dc:creator><pubDate>Fri, 20 Jan 2012 00:00:00 +0000</pubDate><guid isPermaLink="false">http://www.acunu.com/blogs/tim-moreton/welcome-party-dynamodb/</guid><feedburner:origLink>http://www.acunu.com/blogs/tim-moreton/welcome-party-dynamodb/</feedburner:origLink></item><item><title>The Math of Secret Santa</title><link>http://feedproxy.google.com/~r/BigDataInsights/~3/5f64e7yQVSw/</link><description>&lt;p&gt;&lt;img src="http://media.acunu.com/library/santa.png" border="0" alt="Andy Ormsby IS Santa" title="Andy Ormsby IS Santa" width="219" height="230" style="float: left; margin: 0px 10px; border: 1px solid black;" /&gt;This Christmas, my housemates and I took it in turns to pick names out of a hat for our annual &lt;a href="http://en.wikipedia.org/wiki/Secret_Santa"&gt;Secret Santa&lt;/a&gt;.&amp;nbsp; We agreed that if someone picked their own name we would put all the names back and try again.&amp;nbsp; It took us 5 attempts before we found a 'collision-free' assignment.&amp;nbsp; This made me think, is this to be expected?&amp;nbsp; And how many times would it take us with more people?&lt;/p&gt;
&lt;p&gt;This problem can be rephrased as counting the number of permutations that have no fixed points.&amp;nbsp; Such permutations are called &lt;i&gt;derangements&lt;/i&gt; and can be calculated with a reasonably simple recurrence, as follows.&lt;/p&gt;
&lt;p&gt;Imagine there are &lt;i&gt;n&lt;/i&gt; people, and use the notation &lt;i&gt;!n&lt;/i&gt; to denote the number of derangements of &lt;i&gt;n&lt;/i&gt; objects.&amp;nbsp; The first person chooses a name, and has &lt;i&gt;n-1&lt;/i&gt; names to choose from without finding their own; the person they choose either takes the first name or doesn't.&amp;nbsp; In the first case, the remaining &lt;i&gt;n-2&lt;/i&gt; people have to choose from their &lt;i&gt;n-2&lt;/i&gt; names, so can do this in &lt;i&gt;!(n-2)&lt;/i&gt; ways.&amp;nbsp; In the second case, there are &lt;i&gt;!(n-1)&lt;/i&gt; possibilities, since each remaining person's name is still in the hat and they must not choose it for a valid assignment.&amp;nbsp; This gives &lt;i&gt;!n = (n-1)(!(n-1) + !(n-2))&lt;/i&gt;, with the base cases &lt;i&gt;!0 = 1&lt;/i&gt; (by convention) and &lt;i&gt;!1 = 0&lt;/i&gt;.&lt;/p&gt;
&lt;p&gt;Solving this, which is simple to prove by induction, we find the following:&lt;/p&gt;
&lt;p&gt;&lt;img class="img_block" src="http://media.acunu.com/library/secret-santa-eq.png" border="0" alt="!n = n! \sum_{i=0}^n \frac{(-1)^i}{i!}" width="157" height="63" style="display: block; margin-left: auto; margin-right: auto; border: 0pt none;" /&gt;&lt;/p&gt;
&lt;p&gt;You may recognize the sum as the first &lt;i&gt;n+1&lt;/i&gt; terms in the &lt;a href="http://en.wikipedia.org/wiki/Taylor_series#Examples"&gt;Taylor expansion&lt;/a&gt; of &lt;i&gt;e^x&lt;/i&gt; evaluated at &lt;i&gt;-1&lt;/i&gt;.&amp;nbsp; This sum converges extremely quickly, so the probability of a successful Secret Santa assignment rapidly approaches &lt;i&gt;1/e&lt;/i&gt; &amp;asymp; 37% as the number of people grows. The number of trials until we find a successful assignment is thus &lt;i&gt;Geometric(1/e)&lt;/i&gt;, which has expectation &lt;i&gt;e&lt;/i&gt;&amp;nbsp;&amp;asymp; 2.72.&amp;nbsp;Indeed, for &lt;i&gt;n=5&lt;/i&gt;, the exact sum already gives expectation 2.7.&lt;/p&gt;
&lt;p&gt;So what about our 5 attempts? &lt;a href="http://en.wikipedia.org/wiki/Markov's_inequality"&gt;Markov's inequality&lt;/a&gt; says that the probability that a random variable X deviates from its mean by more than a factor &lt;i&gt;k&lt;/i&gt; is at most &lt;i&gt;1/k&lt;/i&gt;, so the probability that we need more than 5 attempts is at most &lt;i&gt;e&lt;/i&gt;/5 =&amp;nbsp;	0.543. A better, and easier, bound can be obtained for the geometric, since the probability that we need more than 5 attempts is the probability that we fail 5 times in succession, which is (1-1/e)&lt;sup&gt;5&lt;/sup&gt;&amp;nbsp;&amp;asymp; 0.10, and this decreases exponentially rather than linearly in the number of attempts.&amp;nbsp;(Question for reader: how many rounds do we need to succeed with probability &amp;gt; 0.99 ?) Most importantly, this is independent of the number of people &lt;i&gt;n&lt;/i&gt; involved so&amp;nbsp;in Acunu's Secret Santa, we should expect a similar number of rounds, even as Acunu continues to double in size every year!&lt;/p&gt;
&lt;p&gt;If you eat this sort of thing for breakfast, &lt;a href="http://www.acunu.com/careers/"&gt;you should join us!&lt;/a&gt;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/BigDataInsights/~4/5f64e7yQVSw" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Richard Low</dc:creator><pubDate>Thu, 15 Dec 2011 00:00:00 +0000</pubDate><guid isPermaLink="false">http://www.acunu.com/blogs/richard-low/secret-santa/</guid><feedburner:origLink>http://www.acunu.com/blogs/richard-low/secret-santa/</feedburner:origLink></item><item><title>CQL talk at Cassandra NYC 2011</title><link>http://feedproxy.google.com/~r/BigDataInsights/~3/CHsBEbHRGuo/</link><description>&lt;p&gt;Last week we travelled to New York for Cassandra NYC. The conference, which spun off from Cassandra SF, drew a crowd of nearly 200. &lt;a href="http://www.acunu.com/team/eric-evans/"&gt;Eric Evans&lt;/a&gt;, primary developer of CQL and a Cassandra committer, delivered a talk on CQL. Slides are now available:&lt;/p&gt;
&lt;div style="width: 425px;"&gt;&lt;b&gt;&lt;a href="http://www.slideshare.net/jericevans/cql-sql-in-cassandra" target="_blank" title="CQL: SQL In Cassandra"&gt;CQL: SQL In Cassandra&lt;/a&gt;&lt;/b&gt; &lt;iframe frameborder="0" height="355" marginheight="0" marginwidth="0" scrolling="no" src="http://www.slideshare.net/slideshow/embed_code/10489017" width="425"&gt;&lt;/iframe&gt;
&lt;div style="padding: 5px 0 12px;"&gt;View more &lt;a href="http://www.slideshare.net/" target="_blank"&gt;presentations&lt;/a&gt; from &lt;a href="http://www.slideshare.net/jericevans" target="_blank"&gt;Eric Evans&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/BigDataInsights/~4/CHsBEbHRGuo" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Acunu</dc:creator><pubDate>Tue, 13 Dec 2011 00:00:00 +0000</pubDate><guid isPermaLink="false">http://www.acunu.com/blogs/acunu/cassandra-nyc/</guid><feedburner:origLink>http://www.acunu.com/blogs/acunu/cassandra-nyc/</feedburner:origLink></item><item><title>CQL benchmarking</title><link>http://feedproxy.google.com/~r/BigDataInsights/~3/mWrmmLWhe_0/</link><description>&lt;h2&gt;The CQL Value Proposition&lt;/h2&gt;
&lt;p&gt;CQL (Cassandra Query Language) is the relatively new SQL-like query language for Apache Cassandra.  It's meant as a user-friendly alternative to Cassandra's Thrift-based RPC interface, which, frankly, sucks.  The case for CQL is that of greater environmental stability and ease-of-use, two areas that Cassandra has rightly been criticized for in the past.  To demonstrate the latter of these, look at the code (in Java) for writing a single column using the RPC interface:&lt;/p&gt;
&lt;div style="background-color: #e0f0ff;"&gt;
&lt;pre&gt;  Column col = new Column(ByteBuffer.wrap(&amp;ldquo;name&amp;rdquo;.getBytes()));
  col.setValue(ByteBuffer.wrap(&amp;ldquo;value&amp;rdquo;.getBytes()));
  col.setTimestamp(System.currentTimeMillis());

  ColumnOrSuperColumn cosc = new ColumnOrSuperColumn();
  cosc.setColumn(col);

  Mutation mutation = new Mutation();
  Mutation.setColumnOrSuperColumn(cosc);   
   
  List&amp;lt;Mutation&amp;gt; mutations = new ArrayList&amp;lt;Mutation&amp;gt;();   
  mutations.add(mutation);    

  Map mutations_map = new HashMap&amp;lt;ByteBuffer, Map&amp;lt;String, List&amp;lt;Mutation&amp;gt;&amp;gt;&amp;gt;(); 
  Map cf_map = new HashMap&amp;lt;String,List&amp;lt;Mutation&amp;gt;&amp;gt;(); 
  cf_map.set(&amp;ldquo;Standard1&amp;rdquo;, mutations); 
  mutations.put(ByteBuffer.wrap(&amp;ldquo;key&amp;rdquo;.getBytes()), cf_map);
&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;In contrast, the corresponding query in CQL looks like this:&lt;/p&gt;
&lt;div style="background-color: #e0f0ff;"&gt;
&lt;pre&gt;  INSERT INTO Standard1 (KEY, name) VALUES (key, value) &lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;CQL is the obvious winner here from a usability standpoint, and it may seem obvious to some, but the reason is that it adds a lot of server-side abstraction.  It may help to think of the query as a graph.&lt;/p&gt;
&lt;p&gt;&lt;img class="img_block" src="http://media.acunu.com/library/query_graph1.png" border="0" alt="Query graph" title="Query graph" width="65%" style="display: block; margin-left: auto; margin-right: auto; border: 0pt none;" /&gt;&lt;/p&gt;
&lt;p&gt;In the case of the RPC interface (the first code snippet above), the application developer is being asked to construct the query graph manually, in a form that is directly consumable by Cassandra.  The same query graph is represented in the CQL query, only in a human-readable text format, which is then parsed by Cassandra to create the structures it needs to process the request.  One of the more frequent suppositions regarding CQL is that because of this query string parsing, it ''must'' perform poorly compared to the Thrift RPC.  I've frequently cautioned people against jumping to such conclusions, and pointed out that it's not a strict game of performance figures.  Developer time being the most valuable of project resources, it's only imporant that it performs ''well enough''.&lt;/p&gt;
&lt;h2&gt;Benchmarking CQL against Thrift RPC&lt;/h2&gt;
&lt;p&gt;Performance testing in Cassandra is typically done with the &lt;i&gt;stress&lt;/i&gt; utility (located in &lt;i&gt;tools/&lt;/i&gt;).  I recently found some time to &lt;a href="https://issues.apache.org/jira/browse/CASSANDRA-2268"&gt;extend the stress utility for CQL&lt;/a&gt;, and now have some concrete results.&lt;/p&gt;
&lt;h3&gt;Test #1: Insert 20M rows x 5 columns&lt;/h3&gt;
&lt;p&gt;When run against the RPC interface, an insert is performed using a &lt;i&gt;batch_mutate()&lt;/i&gt; call with one row each.  For CQL, this translates to a single &lt;i&gt;UPDATE&lt;/i&gt; statement.  These tests stuck mostly to the defaults, so column names are ascii and 2 characters long, keys and values are binary of 7 and 34 bytes in length respectively.&lt;/p&gt;
&lt;p&gt;&lt;img src="http://media.acunu.com/library/insert_20mx5_noidx_t50.png" border="0" alt="Insert 20M rows x 5 columns (no index)" title="Insert 20M rows x 5 columns (no index)" width="100%" style="display: block; margin-left: auto; margin-right: auto; border: 0pt none;" /&gt;&lt;/p&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;th&gt;&lt;br /&gt;&lt;/th&gt;&lt;th&gt;Average OP Rate&lt;/th&gt;&lt;th&gt;Average Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RPC&lt;/td&gt;
&lt;td&gt;20,953/s&lt;/td&gt;
&lt;td&gt;1.6 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CQL&lt;/td&gt;
&lt;td&gt;19,176/s (-8%)&lt;/td&gt;
&lt;td&gt;1.7 ms (+9%)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Test #2: Insert 10M rows x 5 columns with KEYS-type secondary index&lt;/h3&gt;
&lt;p&gt;This test is identical to the previous one with one exception, there is an index on one of the five test columns.  The gap in the results is a bit closer here (-6% versus -8%), I believe this is because updating the index has the node more I/O bound for this test.&lt;/p&gt;
&lt;p&gt;&lt;img src="http://media.acunu.com/library/insert_10mx5_keysidx_t50.png" border="0" alt="Insert 10M rows x 5 columns (KEYS index)" title="Insert 10M rows x 5 columns (KEYS index)" width="100%" style="display: block; margin-left: auto; margin-right: auto; border: 0pt none;" /&gt;&lt;/p&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;th&gt;&lt;br /&gt;&lt;/th&gt;&lt;th&gt;Average OP Rate&lt;/th&gt;&lt;th&gt;Average Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RPC&lt;/td&gt;
&lt;td&gt;9,850/s&lt;/td&gt;
&lt;td&gt;5.3 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CQL&lt;/td&gt;
&lt;td&gt;9,290/s (-6%)&lt;/td&gt;
&lt;td&gt;5.5 ms (+4%)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Test #3: Counter increments for 10M rows x 5 columns&lt;/h3&gt;
&lt;p&gt;Like the insert tests, the stress tool performs a &lt;i&gt;batch_mutate()&lt;/i&gt; of one row each when benching RPC, and for CQL uses an &lt;i&gt;UPDATE&lt;/i&gt; statement with the &lt;i&gt;&amp;lt;name&amp;gt; = &amp;lt;name&amp;gt; + 1&lt;/i&gt; syntax.&lt;/p&gt;
&lt;p&gt;&lt;img src="http://media.acunu.com/library/counter-add_10mx5_t50.png" border="0" alt="Counter increment, 10M rows x 5 columns" title="Counter increment, 10M rows x 5 columns" width="100%" style="display: block; margin-left: auto; margin-right: auto; border: 0pt none;" /&gt;&lt;/p&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;th&gt;&lt;br /&gt;&lt;/th&gt;&lt;th&gt;Average OP Rate&lt;/th&gt;&lt;th&gt;Average Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RPC&lt;/td&gt;
&lt;td&gt;18,052/s&lt;/td&gt;
&lt;td&gt;1.7 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CQL&lt;/td&gt;
&lt;td&gt;17,635/s (-2%)&lt;/td&gt;
&lt;td&gt;1.7 ms&lt;br /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Test #4: Read 20M rows x 5 columns&lt;/h3&gt;
&lt;p&gt;For RPC, this test translates to a &lt;i&gt;get_slice()&lt;/i&gt; with open-ended start and end columns, limited to 5 results.  For CQL, &lt;i&gt;stress&lt;/i&gt; performs a &lt;i&gt;SELECT&lt;/i&gt; in the form of &lt;i&gt;SELECT FIRST 5 ".." FROM...&lt;/i&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src="http://media.acunu.com/library/read_20mx5_t50.png" border="0" alt="Read 20M rows x 5 columns" title="Read 20M rows x 5 columns" width="100%" style="display: block; margin-left: auto; margin-right: auto; border: 0pt none;" /&gt;&lt;/p&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;th&gt;&lt;br /&gt;&lt;/th&gt;&lt;th&gt;Average OP Rate&lt;/th&gt;&lt;th&gt;Average Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RPC&lt;/td&gt;
&lt;td&gt;22,726/s&lt;/td&gt;
&lt;td&gt;2.0 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CQL&lt;/td&gt;
&lt;td&gt;20,272/s (-11%)&lt;/td&gt;
&lt;td&gt;2.3 ms (+10%)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Taking a closer look&lt;/h2&gt;
&lt;p&gt;First off, there are a few things which may have influenced these tests (against CQL), for example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;i&gt;stress&lt;/i&gt; was written specifically with the RPC interface in mind and so makes the assumption that column names and values will always need to be converted to bytes.  This means that the CQL tests are enduring unnecessary round-trips from &lt;i&gt;String&lt;/i&gt;, to &lt;i&gt;ByteBuffer&lt;/i&gt;, and back to &lt;i&gt;String&lt;/i&gt; (which is actually quite expensive).&lt;/li&gt;
&lt;li&gt;Query compression wasn't implemented in &lt;i&gt;stress&lt;/i&gt;, and might have some effect on the larger queries.&lt;/li&gt;
&lt;li&gt;Client and server were run on the same machine.  Since CQL query parsing tends to push CPU usage a bit higher, a truer test would have the benchmark running on a different host.  That said, it isn't difficult to find some CQL-specific hot-spots when running Cassandra in a profiler.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Term parsing&lt;/h2&gt;
&lt;p&gt;A term here means things like keys and column names and values.  Imagine a sample query from Test #1 above:&lt;/p&gt;
&lt;div style="background-color: #e0f0ff; font-size: 8;"&gt;
&lt;pre&gt; UPDATE Standard1 USING CONSISTENCY ONE SET
    C1=d41d8cd98f00b204e9800998ecf8427ed41d8cd98f00b204e9800998ecf8427efd34,
    C2=03c7c0ace395d801803c7c0ace395d80182db07ae2c30f0342db07ae2c30f03407ae,
    C3=3691308f2a4c2f6983f2880d32369133691308f2a4c2f6983f2880d32e29c8408f2a,
    C4=9f6e6800cfae7749eb6c49f6e689f6e6800cfae7749eb6c486619254b9c00cfae774,
    C5=8f60c8102d29fcd52518f60c8102d298f60c8102d29fcd525162d02eed4566bfcd51
    WHERE KEY=00000000000001
&lt;/pre&gt;
&lt;/div&gt;
&lt;p&gt;Column names are 2-character wide ASCII types.  Column values are 34 byte binary, which for CQL means they have been hex-encoded.  The key is also binary, so it is hex-encoded as well.  Of the 11 terms that need to be parsed in the above statement, you might think the bulk of the time was spent in parsing the longish binary values from hex-encoded string, but surprisingly its the 5 ASCII column names (C1, C2, etc) that are most expensive.  The time spent is attributable to Java's &lt;i&gt;String.getBytes(Charset)&lt;/i&gt;.  I don't know what to make of that.&lt;/p&gt;
&lt;h3&gt;Copying and conversion&lt;/h3&gt;
&lt;p&gt;There is also a not inconsiderable amount of time that is spent in copying and conversion to and from bytes.  One example that I found quite surprising is the conversion of the query from &lt;i&gt;ByteBuffer&lt;/i&gt; to &lt;i&gt;String&lt;/i&gt; upon receiving the request.  Despite this being a once-per-query event, it ranks relatively high.  This could turn out to be important, since the only reason the query argument is binary (as opposed to a UTF8 string) is to support compression.  It &lt;a href="https://issues.apache.org/jira/browse/CASSANDRA-1707"&gt;remains to be seen&lt;/a&gt; what the benefits of compression are, and it's possible this conversion could offset some or all of them.  One way or another, this will likely boil down to an argument in favor of a &lt;a href="http://stg.acunu.com/admin/web/blogpost/29/&amp;quot;https:/issues.apache.org/jira/browse/CASSANDRA-2478|"&gt;custom wire protocol&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;So is it worth it?&lt;/h2&gt;
&lt;p&gt;Personally, I'm pretty pleased with these results.  It's easy to fall into the trap of making it all about the performance numbers, but the difference here is small enough that it hardly seems like reasonable justification for choosing one interface over the other.  Worst-case, this is a requirement of an additional node in a medium-sized cluster, and that can't possibly be more costly than the developer time you save from using CQL.  Also keep in mind that for these tests, the node is parsing the same query string for each and every request.  &lt;a href="https://issues.apache.org/jira/browse/CASSANDRA-2475"&gt;Prepared statements&lt;/a&gt; will allow us to parse a statement just once, and send along the columns and keys for subsequent requests.  If there were no other optimizations available, this one alone would surely be enough.&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/BigDataInsights/~4/mWrmmLWhe_0" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Eric Evans</dc:creator><pubDate>Mon, 12 Dec 2011 00:00:00 +0000</pubDate><guid isPermaLink="false">http://www.acunu.com/blogs/eric-evans/cql-benchmarking/</guid><feedburner:origLink>http://www.acunu.com/blogs/eric-evans/cql-benchmarking/</feedburner:origLink></item><item><title>How to rebuild 2TB disks in 30mins</title><link>http://feedproxy.google.com/~r/BigDataInsights/~3/BW89Yn-1BZ4/</link><description>&lt;div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;p&gt;One of the advantages of the Acunu Data Platform over any database relying on RAID for its underlying data redundancy, is disk rebuild speed. ADP brings a new alternative to RAID which is much faster. Here's a graph showing how long it takes from the moment a disk fails to the moment that data is once again protected, showing that with ADP, your disk rebuild will be up to&amp;nbsp;&lt;b&gt;5 times faster&lt;/b&gt;.&amp;nbsp;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;br /&gt;&lt;img src="http://media.acunu.com/library/rda.png" border="0" alt="RDA vs RAID rebuild performance" /&gt;&lt;/p&gt;
&lt;p&gt;And now for the all-important small-print!&amp;nbsp;&lt;/p&gt;
&lt;p&gt;We compared a vanilla Cassandra with Linux &lt;i&gt;md &lt;/i&gt;RAID to a pre-release v2 of our product, which uses a layout known as Randomised Duplicate Allocation (RDA).&amp;nbsp; In the 2-RDA mode, each block of data is duplicated, and the 2 copies allocated at random among the available devices (other schemes can use more than 2 copies, or a variable number of copies depending on the popularity of the data and space constraints).&lt;/p&gt;
&lt;p&gt;For RAID we rebuild to a new hot-spared disk (i.e. from 7 to 8 disks in the case of an 8 disk test).&amp;nbsp; Acunu uses a &lt;i&gt;distributed&lt;/i&gt; hot-spare model: when a disk fails, the blocks that were on it are duplicated elsewhere within the remaining disks, so that once the process is finished all blocks are again on 2 disks.&amp;nbsp;&amp;nbsp; In either case, the rebuild time reported is the window during which a second disk failure will cause data loss.&lt;/p&gt;
&lt;table align="center" border="0" cellpadding="5" cellspacing="5"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;th align="right"&gt;Redundancy&lt;/th&gt;
&lt;td&gt;RAID-10&lt;/td&gt;
&lt;td&gt;RAID-5&lt;/td&gt;
&lt;td&gt;2-RDA&lt;/td&gt;
&lt;td&gt;2-RDA&lt;/td&gt;
&lt;td&gt;2-RDA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;th align="right"&gt;Disks&lt;/th&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;th align="right"&gt;Disk size /TB&lt;/th&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;th align="right"&gt;Total Utilisation&lt;/th&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;td&gt;44%&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;th align="right"&gt;Total data / TB&lt;/th&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;th align="right"&gt;Exposure /hrs&lt;/th&gt;
&lt;td&gt;4:03&lt;/td&gt;
&lt;td&gt;3:38&lt;/td&gt;
&lt;td&gt;0:48&lt;/td&gt;
&lt;td&gt;0:44&lt;/td&gt;
&lt;td&gt;0:22&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Vanilla Cassandra at 100% capacity using RAID-10 would use 25% of the disk space for unique data -- 50% lost through duplication and 50% of what's left lost through Cassandra's requirement to keep disks half full at most.&amp;nbsp; Likewise, vanilla Cassandra using RAID-5 on 8 disks uses 44% of disk space for unique data.&lt;br /&gt;&lt;br /&gt;Acunu Cassandra at 100% capacity (using 2-RDA) would use 50% of the disk space for unique data -- 50% lost through duplication, but only trivial additional losses for merge overheads, since we do in-place merges.&lt;br /&gt;&lt;br /&gt;Utilisation is an important consideration for RDA because unlike RAID we only rebuild areas of disk that are used for storing data.&amp;nbsp; And in contrast to RAID, the data that needs to be duplicated can be distributed to all remaining disks -- thus is not limited by the bandwidth of a single device.&amp;nbsp; As a consequence, while RAID rebuild times are proportional to the&amp;nbsp;&lt;b&gt;size&lt;/b&gt;&amp;nbsp;of the failed device, RDA rebuilds are proportional to the&amp;nbsp;&lt;b&gt;amount of data&lt;/b&gt;&amp;nbsp;on the failed device, and inversely proportional to the&amp;nbsp;&lt;b&gt;number of devices&lt;/b&gt;&amp;nbsp;that remain.&amp;nbsp; This is why rebuild time is roughly the same for columns 3 and 4 above -- the amount of data per device has doubled, but so has the number of devices -- and smaller for the fifth column, in which we examine rebuild time for a node at only 50% capacity.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/BigDataInsights/~4/BW89Yn-1BZ4" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Byde</dc:creator><pubDate>Sat, 03 Dec 2011 00:00:00 +0000</pubDate><guid isPermaLink="false">http://www.acunu.com/blogs/dr-andrew-byde/faster-disk-rebuilds/</guid><feedburner:origLink>http://www.acunu.com/blogs/dr-andrew-byde/faster-disk-rebuilds/</feedburner:origLink></item><item><title>Acunu Data Platform v1.2 released!</title><link>http://feedproxy.google.com/~r/BigDataInsights/~3/0w1T-DneRPU/</link><description>&lt;p&gt;We're excited to announce the release of version 1.2 of the Acunu Data Platform, incorporating Apache Cassandra -- the fastest and lowest-risk route to building a production-grade Cassandra cluster.&lt;/p&gt;
&lt;p&gt;The Acunu Data Platform (ADP) is an all-in-one distributed database solution, delivered as a software appliance for your own data center or an Amazon Machine Image (AMI) for cloud deployments. It includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A hardened version of Apache Cassandra that is 100% compatible with existing Cassandra applications&lt;/li&gt;
&lt;li&gt;The Acunu Core, a file system and embedded database designed from the ground-up for Big Data workloads&lt;/li&gt;
&lt;li&gt;A web-based management console that simplifies deployment, monitoring and scaling of your cluster.&lt;/li&gt;
&lt;li&gt;Your standard Linux Centos&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This release lets you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Build a production-grade cluster quickly. Our installer configures the whole machine for you, from database down to disks. Then point a web browser at the management UI that runs on every machine to manage or grow your cluster. With the Acunu Core, your Cassandra cluster is inherently tuning-free, reducing the risk and lowering the barrier to entry to dependable production deployments.&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li&gt;Deliver consistent performance, even under high load. Thanks to the Acunu Core, ADP&amp;nbsp;eliminates garbage collection pauses and optimises caching and prefetching. It delivers performance on slow SATA disks that is blisteringly fast but consistent and dependable under sustained peak load.&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li&gt;Scale up, as well as scale out. ADP is horizontally scalable, meaning you can add nodes to meet additional capacity or performance needs. But scaling out is only half the story. ADP delivers higher utilisation of hardware resources than even enterprise storage appliances. This means you need fewer nodes to meet your needs.&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li&gt;Monitor your whole stack at once. It ships complete with cluster-wide management tools that significantly simplify deploying and running a tuned, production-grade Cassandra cluster.&lt;/li&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;/ul&gt;
&lt;ul&gt;
&lt;li&gt;Robust and tested end-to-end: This release has undergone more than a thousand hours of automated and manual testing, including multi-terabyte workloads on hundreds of billions of keys. We test on EC2 and on hardware, to make Acunu more robust than any other Big Data or NoSQL database.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Other new features in version 1.2 include improved redistribution of data when adding new nodes to the cluster, better handling of distributed counters, and a new backup utility.&lt;/p&gt;
&lt;p&gt;To find out more, &lt;a href="http://www.acunu.com/company/contact/"&gt;contact us&lt;/a&gt; or &lt;a href="https://www.acunu.com/download/acunu-cassandra/"&gt;download&lt;/a&gt; the release now!&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/BigDataInsights/~4/0w1T-DneRPU" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Manu Marchal</dc:creator><pubDate>Tue, 22 Nov 2011 00:00:00 +0000</pubDate><guid isPermaLink="false">http://www.acunu.com/blogs/manu-marchal/acunu-data-platform-version-1-2-released/</guid><feedburner:origLink>http://www.acunu.com/blogs/manu-marchal/acunu-data-platform-version-1-2-released/</feedburner:origLink></item><item><title>The Hadoop Universe</title><link>http://feedproxy.google.com/~r/BigDataInsights/~3/bd_1TSgZ9oA/</link><description>&lt;div&gt;
&lt;p&gt;&lt;b&gt;Making Sense of the Big Data Universe&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Part 1: In the orbit of Apache Hadoop&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;When I started as a software engineer, forever ago in the early 2000s, the world was simpler. Building a web-based application at a bank? You could choose any architecture you wanted, as long as it was a three-tier architecture, and any language you wanted to from the following list: Java 2 Enterprise Edition. Picking your tools was like ordering from a prix fixe menu at a restaurant:&lt;/p&gt;
&lt;div style="text-align: center;"&gt;
&lt;p&gt;&lt;img src="http://media.acunu.com/library/tier.png" border="0" width="200" /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;But today, with Big Data all the rage, it's hard to know where to begin to make sense of the possible architectures, let alone tools. If you attend a tech meet up, you're likely to hear the strangest sentences in conversation:&amp;nbsp;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;i&gt;"... yeah we're running Hive on Hadoop and writing to Cassandra ..."&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;"... oh have you looked into Mongo? We are having some success with Couch though &amp;hellip;"&lt;/i&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;"... it's a graph so we're looking at Hama or maybe Neo4J &amp;hellip;"&amp;nbsp; &amp;nbsp;&lt;/i&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There is still uncertainty, even amongst practicing engineers, about just what one does with these wonderful, free tools that have sprung up, promising easy answers to ... something or other related to lots of data.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;In particular, there is still confusion about what to think of the biggest names thrown around in Big Data -- Hadoop and NoSQL. Do I store my data in Hadoop? Or is it that NoSQL makes my database queries faster? Or can MapReduce speed up my web server? (No, to all three, by the way.)&lt;/p&gt;
&lt;p&gt;In two articles, I will try to survey these two quite different domains -- Hadoop, and NoSQL -- and a few interesting insights about what to make of the most popular open-source tools in each of them. After this, you'll be able to name-drop with the best of the big data geeks at your next cocktail party!&lt;/p&gt;
&lt;p&gt;It's not comprehensive: there are even more projects out there, some gaining a great deal of momentum -- not to mention a number of excellent proprietary products. In two years the landscape will no doubt be quite different. We're undergoing a Cambrian explosion of ideas, projects and tools -- we don't yet know what the surviving "species" in this world are likely to be. But among the dominant creatures in 2011, and the topic of this first part, is &amp;hellip;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Apache Hadoop&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;Apache Hadoop is often described as an implementation of MapReduce, which is a distributed computation paradigm popularized and honed inside of Google. Many articles and tutorials have been written about MapReduce itself; I won't repeat them here.&lt;/p&gt;
&lt;p&gt;But in reality, Hadoop is more than MapReduce. It is the center of a miniature solar system of open-source projects from Apache, orbiting the core Hadoop projects. A diagram may explain it better than anything:&lt;/p&gt;
&lt;p&gt;&lt;img src="http://stg-media.acunu.com/library/apachehadooporbit.png" border="0" alt="The Hadoop Universe" title="The Hadoop Universe" width="664" height="496" /&gt;&lt;/p&gt;
&lt;p&gt;Hadoop itself is, in fact, at least two sub-projects: MapReduce and HDFS. &lt;b&gt;Hadoop MapReduce&lt;/b&gt; manages &lt;i&gt;computation&lt;/i&gt; in the MapReduce paradigm. It concerns starting computation tasks and overseeing their progress. It does not, by itself, have anything to do with storing the data that is input to or output from the MapReduce job.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;This has traditionally been the role of&lt;b&gt; Hadoop HDFS&lt;/b&gt;, or the Hadoop Distributed File System. As its name implies, HDFS acts like a file system, but one that is by nature distributed over many machines. Chunks of data are replicated across several computers, for reliability and performance. HDFS, like other file systems, represents files as many chunks of bytes; in the case of HDFS's case, these chunks are huge (64MB or more) compared to your computer's file system, as it stores files whose size is measure in terabytes.&amp;nbsp;&lt;/p&gt;
&lt;p&gt;It is good for rapid, sequential reads through big files -- which is conveniently exactly how Hadoop MapReduce's workers like to read and write data. HDFS files can't be changed after being written; its files as write-once and append-only. HDFS is the default and perhaps best choice for storing data that will be used with MapReduce. It excels at distributed storage of massive unstructured data like logs files.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Apache HBase&lt;/b&gt; is also a storage system, with roots in Hadoop, from which it gets its "H". Though HBase uses HDFS for underlying storage, HBase is designed much more for fast and frequent access to blobs of binary data. It is an example of what most would call a NoSQL column-oriented store; it holds semi-structured values for keys. More on this in the next article; storage is a topic unto itself, as are the relative merits of each platform and what you might use them for.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Apache Cassandra&lt;/b&gt; is a prominent and popular feature of the Hadoop landscape. It originated at Facebook, and in turn has its roots in Amazon's Dynamo project. Architecturally, it has more in common with something like HBase than HDFS. That is, it is not a distributed file system, but is also a NoSQL-style store that specializes in quick access to relatively small pieces of data. In comparison to HBase, Cassandra emphasizes tolerating, for example, network failures. Cassandra is also a column-oriented type of store, and again -- this and more deserves its own discussion, next time.&lt;/p&gt;
&lt;p&gt;Sitting in between some of these storage systems are two separate projects with a similar purpose: &lt;b&gt;Apache Avro&lt;/b&gt; and &lt;b&gt;Apache Thrift&lt;/b&gt;. These are not servers or paradigms but rather serialization systems. They provides an easy way to serialize compact data types to bytes for storage in HDFS (or, perhaps, other NoSQL stores) and processing in Hadoop. It is analogous to Google's Protocol Buffers. If you are storing or transmitting complex, structured data types in your Big Data system that is not merely primitive types like integers or strings, you will probably enjoy the convenience of letting a system like Avro or Thrift manage the details of moving those objects around for you, rather than write your own serialization and deserialization.&lt;/p&gt;
&lt;p&gt;Apache Pig and Hive are two projects that layer on top of Hadoop, and provide a higher-level language for using Hadoop's MapReduce library. &lt;b&gt;Apache Pig&lt;/b&gt; provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data -- exactly the operations that MapReduce was originally designed for. Rather than expressing these operations in thousands of lines of Java code that uses MapReduce directly, Pig lets users express them in a language not unlike a bash or perl script. Pig is excellent for prototyping and rapidly developing MapReduce-based jobs, as opposed to coding MapReduce jobs in Java itself.&lt;/p&gt;
&lt;p&gt;If Pig is "scripting for Hadoop", then Hive is "SQL queries for Hadoop". &lt;b&gt;Apache Hive&lt;/b&gt; offers an even more specific and higher-level language, for querying data by running Hadoop jobs, rather than directly scripting step-by-step the operation of several MapReduce jobs on Hadoop. The language is, by design, extremely SQL-like. Hive is still intended as a tool for long-running batch-oriented queries over massive data; it's not "real-time" in any sense. Hive is an excellent tool for analysts and business development types who are accustomed to SQL-like queries and Business Intelligence systems; it will let them easily leverage your shiny new Hadoop cluster to perform ad-hoc queries or generate report data across data stored in storage systems mentioned above.&lt;/p&gt;
&lt;p&gt;Higher-level still are two projects that build specific applications on top of Hadoop's MapReduce infrastructure. &lt;b&gt;Apache Mahout&lt;/b&gt; implements machine learning algorithms at large scale, including clustering, classification and collaborative filtering. It provides implementations of complete ready-to-run Hadoop jobs for these techniques (as well as some implementations that do not use Hadoop). Mahout is also an excellent way to extract some value out of that data you've been hoarding, and that new Hadoop cluster you brought online: businesses use it to sell more intelligently to customers, for example. &lt;b&gt;Apache Chukwa&lt;/b&gt; provides a means to collect, store and analyze logs using Hadoop -- enough said!&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Apache Zookeeper&lt;/b&gt; is best described as a coordination server. It's useful when processes across many different machines need to coordinate access to a shared resource. In this sense, think of it as like the distributed analog of Java's "synchronized" primitives and concurrency libraries. It can also be used to share small amounts of data for configuration purposes. It is frequently used in systems involving Hadoop, and, is used by HBase directly. Zookeeper should probably be used by more distributed systems out there; anytime one component needs to signal another to do something, such as update data, Zookeeper can be useful.&lt;/p&gt;
&lt;p&gt;And finally, &lt;b&gt;Apache Giraph&lt;/b&gt; is an early-stage project in Apache's incubator system that, like Hadoop MapReduce, offers a distributed computing paradigm; it is graph-oriented. (&lt;b&gt;Apache Hama&lt;/b&gt; is a different project which implements the same sort of graph-oriented view of distributed computation.) Imagine a massive network of small nodes, each of which can conceptually run some computation, emit messages to other neighbors in the network, and receive messages from others. Some algorithms are simply much easier to express in a graph-oriented paradigm like this than in MapReduce -- Google's PageRank is an example. Giraph and Hama are more exotic tools, but are especially appropriate and powerful in cases where your data is related to a graph of some kind -- a social network for example. Expressing processes and algorithms to analyze these networks is likely far more natural using these frameworks.&lt;/p&gt;
&lt;p&gt;This concludes a fly-by of the orbit of Apache Hadoop ecosystem. We didn't even visit some of the interesting smaller asteroids like Apache Vaidya. Hopefully it is clear that only a small slice of Hadoop concerns data storage. But in the follow-on article, we'll take a look at many of the very many "NoSQL" stores out there ready to store your Big Data.&lt;/p&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/BigDataInsights/~4/bd_1TSgZ9oA" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Sean Owen</dc:creator><pubDate>Tue, 15 Nov 2011 00:00:00 +0000</pubDate><guid isPermaLink="false">http://www.acunu.com/blogs/sean-owen/hadoop-universe/</guid><feedburner:origLink>http://www.acunu.com/blogs/sean-owen/hadoop-universe/</feedburner:origLink></item><item><title>CQL Quick Reference Card</title><link>http://feedproxy.google.com/~r/BigDataInsights/~3/4XXb6vUNcYo/</link><description>&lt;div&gt;
&lt;p&gt;Following the popularity of Acunu's free distribution of Cassandra CQL Quick Reference cards at &lt;a href="http://www.acunu.com/admin/web/blogpost/49/www.apachecon.com" target="_blank"&gt;ApacheCon&lt;/a&gt;, Acunu is sharing this small wonder with the developer world!&lt;/p&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;p&gt;CQL Quick Reference cards are an ideal companion to any developer or software engineer using or thinking about using the&amp;nbsp;Cassandra Query Language. Written by Cassandra Committer&amp;nbsp;&lt;a href="https://twitter.com/#!/jericevans"&gt;Eric Evans&lt;/a&gt;, there is no better source for quick hand knowledge on CQL. Enjoy!&lt;/p&gt;
&lt;/div&gt;
&lt;div style="text-align: center;"&gt;
&lt;p&gt;&lt;a href="http://media.acunu.com/library/cql_quick_reference_card.pdf"&gt;&lt;img src="http://media.acunu.com/library/cql-quick-reference-card.png" border="0" alt="CQL Quick Reference Card" title="CQL Quick Reference Card" width="50%" style="border: 0px initial initial;" /&gt;&lt;br /&gt;Click to download the pdf version&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/BigDataInsights/~4/4XXb6vUNcYo" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Acunu</dc:creator><pubDate>Wed, 09 Nov 2011 00:00:00 +0000</pubDate><guid isPermaLink="false">http://www.acunu.com/blogs/acunu/cql-quick-reference-cards/</guid><feedburner:origLink>http://www.acunu.com/blogs/acunu/cql-quick-reference-cards/</feedburner:origLink></item><item><title>Data Modelling with Cassandra</title><link>http://feedproxy.google.com/~r/BigDataInsights/~3/mD57Pf00f2M/</link><description>&lt;div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;p&gt;Denormalisation is essential at scale, and Cassandra's read/write tradeoff is well-adapted for it.&amp;nbsp; In this article we work through an example use case showing how this works in practice.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;I will consider an asymmetric messaging application and go through the steps explaining how such a system might be implemented in Cassandra, and the properties of Cassandra that we can exploit in order to design a performant system.&lt;/p&gt;
&lt;p&gt;Asymmetric messaging is a messaging model popularised most famously by Twitter's follower model, but could represent any publish-subscribe system whereby messages are broadcast from one user to one or more other users, without the sender having to explicitly specify the recipients.&lt;/p&gt;
&lt;h2&gt;Relational Model&lt;/h2&gt;
&lt;p&gt;In our model we have three entities: Users, Edges and Messages. A standard normalised relational model might look something like this:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;User(id) &amp;lt;-1--*-&amp;gt; Edge(follower_id, followee_id) &amp;lt;-1--*-&amp;gt; Message(id, user_id, msg)&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;An edge is a directional link representing the asymmetric (follower) relationship between two users.&lt;/p&gt;
&lt;p&gt;The operations we will allow are subscribe (add an edge), broadcast (add a message), and read-timeline (list all messages from users that a given user follows). In our relational model, these operations would be implemented as follows:&lt;/p&gt;
&lt;h4&gt;subscribe&lt;/h4&gt;
&lt;p&gt;User A subscribes to user B:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;INSERT INTO Edge VALUES (A, B)&lt;/code&gt;&lt;/p&gt;
&lt;h4&gt;broadcast&lt;/h4&gt;
&lt;p&gt;User B broadcasts message M:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;INSERT INTO Message(user_id, msg) VALUES (B, M)&lt;/code&gt;&lt;/p&gt;
&lt;h4&gt;read-timeline&lt;/h4&gt;
&lt;p&gt;User A reads all messages from users to whom he is subscribed:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;SELECT userid, msg FROM Message m, Edge e WHERE e.follower_id = 'A' AND m.user_id = e.followee_id&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;We can see that with this schema, we must perform a JOIN on Edge and Message tables in order to retrieve the timeline for a given user.&lt;/p&gt;
&lt;p&gt;Performing a table join is an expensive operation, as it requires some amount of random disk IO. Assuming we have indexes on both Edge.follower_id and Message.user_id, the number of users followed by user A is ''F'', and the average number of messages per followee is ''M'', we will need to do * ''O(1)'' seeks to look up A in the Edge index and obtain users he follows * ''O(F)'' seeks to look up the messages from each followee in the Messages index * ''O(FM)'' seeks to retrieve the messages. This means that to populate a timeline we will be doing approximately one seek per message on average.&lt;/p&gt;
&lt;p&gt;A disk doesn't do many IO operations per second (IOPS) and the ballpark figure we work with is 100 IOPS per disk. This means that for a modest timeline of 100 messages, we can only read 1 user's timeline per second per disk! On the other hand, disk bandwidth is somewhere of the order of 100MBps. If a message is somewhere in the region of 140 bytes (to choose a number completely at random), we can calculate that the upper bound for reading messages should be 750,000 messages (or 7,500 user timelines) per second per disk.&lt;/p&gt;
&lt;h3&gt;Denormalisation&lt;/h3&gt;
&lt;p&gt;By denormalising the schema we can make much better use of our disk bandwidth by making our reads sequential. Cassandra allows us to have extremely wide rows (up to 2 billion columns) and we can take advantage of this to design a denormalised schema that gives us much better read performance for the 'read-timeline' operation.&lt;/p&gt;
&lt;h2&gt;Cassandra Model&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;Column Family
    Row key
        Column key

Edges: {
    follower_id: {
        user1: 1,
        user2: 1,
        ...
    }
}

UserMessages: {
    user1: {
        timestamp1: message1,
        timestamp2: message2,
        ...
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the denormalised model, the Users and Messages entities have been joined together to form the UserMessages column-family where each row-key is a user ID, each column name in that row is a time-stamp, and each column value is a message. This model allows us to query the entire timeline for a user with ''O(1)'' disk seeks to find the user's row, followed by sequential reads of the timeline.&lt;/p&gt;
&lt;p&gt;The cost of denormalisation is duplication of data, and in this case we have duplicated each message by copying it to the timeline of every user who follows the author of that message. This means we incur a write cost when broadcasting a message, since we must insert the same message multiple times. Luckily for us, Cassandra is optimised for high write throughput (writes perform only sequential I/O) and it is this performance profile of Cassandra that allows us to trade some write speed for increased read throughput.&lt;/p&gt;
&lt;p&gt;Another slightly less obvious cost of denormalisation is that we must now read the Edges column family when performing a write, since we must find out which users the message author is followed by. In practice it should be possible to fit this column family in cache even for a large number of users, so this look-up will only rarely hit disk.&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;We have seen how to exploit some of the properties of Cassandra - near-infinite row width and a heavily write optimised performance profile - to perform denormalisation and greatly improve our read throughput.&lt;/p&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/BigDataInsights/~4/mD57Pf00f2M" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Sam Overton</dc:creator><pubDate>Mon, 07 Nov 2011 00:00:00 +0000</pubDate><guid isPermaLink="false">http://www.acunu.com/blogs/sam-overton/cassandra-data-modelling/</guid><feedburner:origLink>http://www.acunu.com/blogs/sam-overton/cassandra-data-modelling/</feedburner:origLink></item><item><title>Cassandra Drivers Released!</title><link>http://feedproxy.google.com/~r/BigDataInsights/~3/phm_V_ResGk/</link><description>&lt;p&gt;Within the&amp;nbsp;&lt;a class="external text" href="http://cassandra.apache.org/" title="http://cassandra.apache.org"&gt;Apache Cassandra Project&lt;/a&gt;, the status of client code has been evolving over time. The default position has been that client code is something to be maintained by third-parties. There were a number of reasons for taking this position, not least of which was a desire to allow innovation: if a standard is needed, better to let it emerge rather than anoint one.&lt;/p&gt;
&lt;p&gt;&lt;a name="Dude.2C_Where.27s_My_Driver.3F"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Dude, Where's My Driver?&amp;nbsp;&lt;/h2&gt;
&lt;p&gt;Database drivers however were kept within the project itself. The reasoning was that the concept of "driver" is meant to constitute the lowest level bits, things like parameter substitution, protocol abstraction, and connection pooling, the elements common to any client library. As they have progressed however, these drivers have evolved into higher-level abstractions, typically adopting a standard for database connectivity in the target platform (JDBC, DB-API 2, etc). This had the effect of putting those of us working on the Cassandra project into the business of maintaining clients, something we'd taken pains to avoid before.&lt;/p&gt;
&lt;p&gt;Many of these drivers may have evolved past the point where it makes sense to maintain them within Cassandra, but they still fall quite short of high-level clients like Hector or Pycassa. And, unlike projects such as Hector or Pycassa, competing implementations don't make as much sense since they are based on standards that leave little room to differentiate. A compromise was needed, something between maintenance inside Cassandra, and a complete disavowal of All Things Client.&lt;/p&gt;
&lt;p&gt;&lt;a name="Apache_Extras"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Apache Extras&lt;/h2&gt;
&lt;p&gt;The compromise that was struck was to spin the drivers off into their own projects hosted on&amp;nbsp;&lt;a class="external text" href="http://code.google.com/a/apache-extras.org/hosting/" title="http://code.google.com/a/apache-extras.org/hosting/"&gt;Apache Extras&lt;/a&gt;. Each of these new projects will have the same code as before, and the same people working on them, but each will have it's own project page, bug tracking, and source control as well. Each project is free to choose it's own members, and to make it's own decisions.&lt;/p&gt;
&lt;p&gt;A set of common conventions (project naming, licensing, etc) and the Apache Extras branding will help to identify drivers with a close association, and an as-of-yet-finalized list of testing and acceptance criteria will be used by the Cassandra project to "certify" qualifying drivers.&lt;/p&gt;
&lt;p&gt;When Cassandra developers and client maintainers need to coordinate, the&amp;nbsp;&lt;a class="external text" href="mailto:client-dev-subscribe@cassandra.apache.org" title="mailto:client-dev-subscribe@cassandra.apache.org"&gt;client-dev@cassandra.apache.org&lt;/a&gt;&amp;nbsp;mailing list will provide the venue, and commits for each of the drivers are copied to&amp;nbsp;&lt;a class="external text" href="mailto:commits-subscribe@cassandra.apache.org" title="mailto:commits-subscribe@cassandra.apache.org"&gt;commits@cassandra.apache.org&lt;/a&gt;&amp;nbsp;to keep everyone in the loop.&lt;/p&gt;
&lt;p&gt;It's been an awkward, and at times frustrating trip to get to this point, but I'm pleased with this new direction, and confident that this brings us closer to ideal.&lt;/p&gt;
&lt;p&gt;&lt;a name="JDBC_and_DBAPI2_Live"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;JDBC and DBAPI2 Live&lt;/h2&gt;
&lt;p&gt;As particular examples of this brave new world, I'm pleased to announce that with the completion of&amp;nbsp;&lt;a class="external text" href="https://issues.apache.org/jira/browse/CASSANDRA-3180" title="https://issues.apache.org/jira/browse/CASSANDRA-3180"&gt;CASSANDRA-3180&lt;/a&gt;&amp;nbsp;and&amp;nbsp;&lt;a class="external text" href="https://issues.apache.org/jira/browse/CASSANDRA-3300" title="https://issues.apache.org/jira/browse/CASSANDRA-3300"&gt;CASSANDRA-3300&lt;/a&gt;, the Python DB-API2 and JDBC driver moves to&amp;nbsp;&lt;a class="external text" href="http://code.google.com/a/apache-extras.org/hosting/" title="http://code.google.com/a/apache-extras.org/hosting/"&gt;Apache Extras&lt;/a&gt;&amp;nbsp;are now complete. You can find the code for each at &lt;a href="http://code.google.com/a/apache-extras.org/p/cassandra-dbapi2"&gt;here&lt;/a&gt;&amp;nbsp;and &lt;a href="http://code.google.com/a/apache-extras.org/p/cassandra-jdbc"&gt;here&lt;/a&gt;&amp;nbsp;respectively.&lt;/p&gt;
&lt;p&gt;Keep your eye on these projects for updates in the weeks to come, and if you're holding out for Ruby and PHP, keep your eye&amp;nbsp;&lt;a class="external text" href="http://code.google.com/a/apache-extras.org/p/cassandra-ruby" title="http://code.google.com/a/apache-extras.org/p/cassandra-ruby"&gt;here&lt;/a&gt;&amp;nbsp;and&amp;nbsp;&lt;a class="external text" href="http://code.google.com/a/apache-extras.org/p/cassandra-pdo" title="http://code.google.com/a/apache-extras.org/p/cassandra-pdo"&gt;here&lt;/a&gt;&amp;nbsp;as well.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;14th October, 2011 | Eric Evans | eric@acunu.com&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/p&gt;&lt;img src="http://feeds.feedburner.com/~r/BigDataInsights/~4/phm_V_ResGk" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Eric Evans</dc:creator><pubDate>Fri, 14 Oct 2011 00:00:00 +0100</pubDate><guid isPermaLink="false">http://www.acunu.com/blogs/eric-evans/cassandra-drivers-released/</guid><feedburner:origLink>http://www.acunu.com/blogs/eric-evans/cassandra-drivers-released/</feedburner:origLink></item><item><title>Big Data news from Strata</title><link>http://feedproxy.google.com/~r/BigDataInsights/~3/tuDXX9AHf3w/</link><description>&lt;p&gt;&lt;img src="http://media.acunu.com/library/strata.png" border="0" alt="Strata" width="200" height="114" /&gt;&lt;/p&gt;
&lt;div class="nH"&gt;
&lt;div class="nH hx"&gt;
&lt;div class="nH"&gt;
&lt;div class="h7  "&gt;
&lt;div class="Bk"&gt;
&lt;div class="G3 G2"&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div class="HprMsc mNrSre"&gt;
&lt;div class="gs"&gt;
&lt;div class="ii gt"&gt;
&lt;div&gt;The Big Data world continues to grow and O'Reilly's "Strata" conference in New York provided plentiful evidence of this. &amp;nbsp;The conference was a full five days: a one day "Strata Jumpstart" session, a two day Summit aimed at a business perspective and the final two day Strata Conference itself.&lt;br /&gt;&lt;br /&gt;I attended the last two days. These were supposed to focus on the "nuts and bolts" of Big Data but such a description fails to do justice to the wide range of people present. &amp;nbsp;There were significant numbers of data scientists, both current and aspiring and in case you are wondering what skills an aspiring data scientist needs, John Rauser's inspirational keynote talk ("What is a Career in Big Data?") listed maths, engineering, writing, scepticism and curiosity. You can see his talk here:&amp;nbsp;&lt;a href="http://www.youtube.com/watch?v=0tuEEnL61HM" target="_blank"&gt;http://www.youtube.com/watch?v=0tuEEnL61HM&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Another talk that seemed to engage the audience and that I enjoyed hugely was Martin Maden's "First, Firster, Firstest" which managed to be compelling, amusing and informative. It touched on a range of issues concerned with data storage and management from the Elizabethan era onwards, touching on library classifications, taxonomies, schemas. Who else could get Francis Bacon and NoSQL in the same presentation? &amp;nbsp;And who knew that page rank was invented in the 1930s? &amp;nbsp;&lt;a href="http://www.youtube.com/watch?v=Qv0yF47L8WE" target="_blank"&gt;http://www.youtube.com/watch?v=Qv0yF47L8WE&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;From a technology standpoint, it is clear that Hadoop remains at the top of a lot of people's list. It has instant name recognition now and lots of commercial take-up beyond Cloudera, the company that first brought commercial support for Hadoop to market. The downside of widespread recognition is that the Hadoop marketplace is now really quite crowded and understanding the nuances of what each of the players are offering is a little complex.&lt;br /&gt;&lt;br /&gt;Why the fuss? It's simply that Hadoop helps people solve a whole series of big data analytics problems relatively easily and very cost-effectively. &amp;nbsp;Abhishek Mehta from Tresata's talk was a great example. Abhishek focused on what Hadoop means for banking: A mere 1-2% of the 10-50 petabytes of data in a typical bank is subject to analysis; Hadoop enables more of this data to be used effectively and that as a result, banks have an opportunity price assets more effectively, offer real personalisation of services rather than coarse segmentation and use outliers to inform models rather than stress them by looking at populations rather than samples.&lt;br /&gt;&lt;br /&gt;Even Tresata would agree that in the real-time area, Hadoop is not the answer. In his talk on the "Big Data Pipeline", Acunu's CEO Tim Moreton talked more generally about how emerging requirements for real-time analytics require solutions that are distinct from the current generation of Hadoop solutions which tend to focus on batch analytics. Tim's slides are here:&lt;a href="http://assets.en.oreilly.com/1/event/63/Navigating%20the%20Data%20Pipeline%20Presentation.pdf" target="_blank"&gt;http://assets.en.oreilly.com/1/event/63/Navigating%20the%20Data%20Pipeline%20Presentation.pdf&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;There's also an interview with Tim Moreton here:&amp;nbsp;&lt;a href="http://www.youtube.com/watch?v=pXcxG1ItgxM" target="_blank"&gt;http://www.youtube.com/watch?v=pXcxG1ItgxM&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The next Strata Conference is scheduled to start on February 28th in Santa Clara and if it is anything like as good as the New York edition, it should be worth attending. You can find out more here:&amp;nbsp;&lt;a href="http://strataconf.com/strata2012" target="_blank"&gt;http://strataconf.com/strata2012&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;img src="http://feeds.feedburner.com/~r/BigDataInsights/~4/tuDXX9AHf3w" height="1" width="1"/&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andy Ormsby</dc:creator><pubDate>Tue, 27 Sep 2011 00:00:00 +0100</pubDate><guid isPermaLink="false">http://www.acunu.com/blogs/andy-ormsby/news-strata/</guid><feedburner:origLink>http://www.acunu.com/blogs/andy-ormsby/news-strata/</feedburner:origLink></item></channel></rss>

