<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Big Data Craft</title>
	
	<link>http://bigdatacraft.com</link>
	<description>Passionate about data...</description>
	<lastBuildDate>Thu, 06 Dec 2012 12:10:56 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/InsightCrew" /><feedburner:info uri="insightcrew" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:emailServiceId>InsightCrew</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><item>
		<title>Network virtualization for the Cloud: Open vSwitch study</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/cA9M2Unvi-0/441</link>
		<comments>http://bigdatacraft.com/archives/441#comments</comments>
		<pubDate>Tue, 02 Oct 2012 18:14:48 +0000</pubDate>
		<dc:creator>Constantine Peresypkin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=441</guid>
		<description><![CDATA[<p>In face of the current reality of ten thousand node data-centers and all the BigData jazz it seems like the network guys were slightly forgotten. We have enough hardware virtualization solutions but until now the network was left on the outskirts of the cloud hype. Let&#8217;s see what we can use right now and <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/441">Network virtualization for the Cloud: Open vSwitch study</a></span>]]></description>
				<content:encoded><![CDATA[<p>In face of the current reality of ten thousand node data-centers and all the BigData jazz it seems like the network guys were slightly forgotten. We have enough hardware virtualization solutions but until now the network was left on the outskirts of the cloud hype. Let&#8217;s see what we can use right now and if it will get better in the future.</p>
<p>When people talk about network virtualization nowadays one name immediately springs into mind: Nicira, they invented <a href="http://www.openflow.org/wp/learnmore/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.openflow.org/wp/learnmore/?referer=');">OpenFlow</a>, <a href="http://openvswitch.org/" target="_blank" onclick="pageTracker._trackPageview('/outgoing/openvswitch.org/?referer=');">Open vSwitch</a> (OVS) and&#8230; were acquired by VMware.</p>
<p>Why Nicira? They essentially designed the current state of network virtualization. OpenFlow is implemented in physical hardware and OVS is used by a lot of people to drive the software network stack in virtualized environments.</p>
<p>But is it any good? Let&#8217;s see. If you open the specification for OpenFlow it looks simple: let&#8217;s cut the hardware intervention at the Ethernet level and implement all other features in software. We essentially write a program (handler) that matches some fields in packet and acts according to simple rules: forward to port, drop, pass to other handler. But then, how do you install these handlers inside the switch? The solution is also not that complicated: you just write another more complex software that runs on something generic (like PC). It chooses handlers for particular flows by issuing a command to the switch, when switch encounters something it does not have handler for, it just passes it to this PC (controller) and controller either chooses a new handler for the switch or processes the packet internally.</p>
<p><a href="http://bigdatacraft.com/wp-content/uploads/2012/10/switch.png"><img class="alignnone size-full wp-image-442" style="border: 0px none;" title="Open Flow switch" src="http://bigdatacraft.com/wp-content/uploads/2012/10/switch.png" alt="Open Flow switch" width="670" height="319" /></a></p>
<p>&nbsp;</p>
<p>What do we see here? It looks like there is an execution platform inside the hardware for running the network handlers and a controller which chooses the handler for each state. It looks very promising and flexible, and can be implemented not only in hardware but also in software only. And the same guys implemented it in OVS, shall we peek inside? Yes, I&#8217;ve answered to myself, and downloaded the OVS source.</p>
<p>When I looked inside the code I was little&#8230; how should I put it&#8230; surprised? OVS code has everything but a kitchen sink inside: serializeable hash table, JSON parser, VLAN tagging, several QoS implementations, STP implementation, Unix socket server, RPC over SSL server and the icing on the cake: their own database with a key/value + columnar storage engine. And everything implemented from scratch (or so it seems).</p>
<p>Ok, they have enough shock value already, but how does this thing work? It turns out that the operation is not that different from the hardware I&#8217;ve described above. It just has a kernel module instead of actual hardware and the flow handlers are just some functions inside the module. It looks like this.</p>
<p><a href="http://bigdatacraft.com/wp-content/uploads/2012/10/vswitch1.png"><img class="alignnone size-full wp-image-445" title="Open vSwitch" src="http://bigdatacraft.com/wp-content/uploads/2012/10/vswitch1.png" alt="Open vSwitch" width="491" height="308" /></a></p>
<p>Daemon uses netlink to talk to kernel module and install the handlers, database stores the configuration, controller talks to daemon via OpenFlow or plain JSON.</p>
<p>So, we got a software stack for network, why is it good for virtualization?</p>
<p>Short answer: because everybody uses Linux. And when your hypervisor runs on Linux why not use some of its capabilities for a nice network boost. But why OpenFlow/OVS?</p>
<p>The OVS docs describe it like this:</p>
<blockquote><p>Open vSwitch is targeted at multi-server virtualization deployments, a landscape for which the previous stack is not well suited. These environments are often characterized by highly dynamic end-points, the maintenance of logical abstractions, and (sometimes) integration with or offloading to special purpose switching hardware.</p></blockquote>
<p>But Linux always had a good network stack, easily manageable and extendible. What are the advantages? There are some.</p>
<ul>
<li>OVS can migrate the network state with the bridge port (connected to the VM instance, for example). You will loose some packets, but the connections may still stay intact.</li>
<li>OVS can use metadata to tag frames (VM instance UUID, for example).</li>
<li>OVS can install event triggers inside the daemon to notify the controller on the state change (VM reboot, for example).</li>
</ul>
<p>Does it justify a new kernel module + new DB engine + new RPC protocols? Maybe not. But you can&#8217;t get these capabilities from any other open source software anyway.</p>
<p>The only question I have right now is why Nicira did not implement <a href="http://tools.ietf.org/html/draft-davie-stt-00" target="_blank" onclick="pageTracker._trackPageview('/outgoing/tools.ietf.org/html/draft-davie-stt-00?referer=');">STT</a> in OVS, but certainly did in its proprietary NVP software?</p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/cA9M2Unvi-0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/441/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/441</feedburner:origLink></item>
		<item>
		<title>What does BigData mean?</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/Glm9r33W3Xs/426</link>
		<comments>http://bigdatacraft.com/archives/426#comments</comments>
		<pubDate>Mon, 24 Sep 2012 07:44:39 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=426</guid>
		<description><![CDATA[<p></p> <p>The full deck available at http://cci.uncc.edu/sites/cci.uncc.edu/files/media/pdf_files/Stonebreaker-charlotte.pdf</p> ]]></description>
				<content:encoded><![CDATA[<p><iframe style="border: 1px solid #CCC; border-width: 1px 1px 0; margin-bottom: 5px;" src="http://www.slideshare.net/slideshow/embed_code/11690496?rel=0" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" width="427" height="356"></iframe></p>
<p>The full deck available at <span id="more-426"></span> <a href="http://cci.uncc.edu/sites/cci.uncc.edu/files/media/pdf_files/Stonebreaker-charlotte.pdf" onclick="pageTracker._trackPageview('/outgoing/cci.uncc.edu/sites/cci.uncc.edu/files/media/pdf_files/Stonebreaker-charlotte.pdf?referer=');">http://cci.uncc.edu/sites/cci.uncc.edu/files/media/pdf_files/Stonebreaker-charlotte.pdf</a></p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/Glm9r33W3Xs" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/426/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/426</feedburner:origLink></item>
		<item>
		<title>Apache Drill Design Meeting</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/3_n2RhU-lSI/415</link>
		<comments>http://bigdatacraft.com/archives/415#comments</comments>
		<pubDate>Sat, 15 Sep 2012 03:18:36 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Apache Drill]]></category>
		<category><![CDATA[Dremel]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=415</guid>
		<description><![CDATA[<p>MapR folks invited me to participate in Apache Drill design meeting. Meetup site indicates that 60 people have been participated which sounds about right.</p> <p>Tomer Shiran started the meeting with the overview of Apache Drill project. Then I (Camuel here) presented our team view for Apache Drill architecture. Jason Frantz of MapR continued touching <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/415">Apache Drill Design Meeting</a></span>]]></description>
				<content:encoded><![CDATA[<p>MapR folks invited me to participate in <a href="http://www.meetup.com/Bay-Area-Apache-Drill-User-Group/photos/10701162/158594882/#158655982" onclick="pageTracker._trackPageview('/outgoing/www.meetup.com/Bay-Area-Apache-Drill-User-Group/photos/10701162/158594882/_158655982?referer=');">Apache Drill design meeting</a>. Meetup site indicates that 60 people have been participated which sounds about right.</p>
<p>Tomer Shiran started the meeting with the overview of Apache Drill project. Then I (Camuel here) presented <a href="http://www.slideshare.net/CamuelGilyadov/apache-drill-14224727" onclick="pageTracker._trackPageview('/outgoing/www.slideshare.net/CamuelGilyadov/apache-drill-14224727?referer=');">our team view for Apache Drill architecture</a>. Jason Frantz of MapR <a href="http://www.slideshare.net/jasonfrantz/drill-architecture-20120913" onclick="pageTracker._trackPageview('/outgoing/www.slideshare.net/jasonfrantz/drill-architecture-20120913?referer=');">continued</a> touching technical aspects in follow on discussion. After a pizza break, Julian Hyde presented his <a href="http://www.slideshare.net/julianhyde/optiq-drill2012" onclick="pageTracker._trackPageview('/outgoing/www.slideshare.net/julianhyde/optiq-drill2012?referer=');">view</a> on logical/physical query plan separation and suggested using <a href="http://www.hydromatic.net/optiq/" onclick="pageTracker._trackPageview('/outgoing/www.hydromatic.net/optiq/?referer=');">optiq framework</a> for DrQL optimizer.</p>
<p>Overall my take away are as follows:</p>
<ol>
<li>There is very healthy interest in interactive querying for BigData. </li>
<li>There were not even a single voice calling on making up vanilla Hadoop for this task.</li>
<li>There is a general consensus on plurality of query languages and plurality of data formats.</li>
<li>There is a general consensus that user always should be given freedom to supply manually authored physical query plan for execution, bypassing optimizer altogether and as opposed to hardcore hinting.</li>
<li>Except me no one tried to challenge &#8220;common logical query model&#8221; concept. Since there are no real joins in Dremel and no indexes and only one data source with exactly one possibility &#8211; a single full table scan, I cannot see the justification for the complexity of optimizers and the logical query model. Dremel is an antidote concept to all this. </li>
</ol>
<p>Thank you &#8211; MapR, for the Drill initiative, the great design meeting and the invitation.</p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/3_n2RhU-lSI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/415/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/415</feedburner:origLink></item>
		<item>
		<title>Hadoop on OpenStack Swift: experiments</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/8oYifkiPSW0/406</link>
		<comments>http://bigdatacraft.com/archives/406#comments</comments>
		<pubDate>Sat, 08 Sep 2012 18:52:10 +0000</pubDate>
		<dc:creator>Constantine Peresypkin</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[OpenStack]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=406</guid>
		<description><![CDATA[<p>Some time has passed since our initial post on Hadoop over OpenStack Swift implementation. A couple of things have changed (Rackspace finally implemented range requests in their Cloudfiles library) others remained the same (still no built-in support for Hadoop in OpenStack / CloudFiles).</p> <p>We got a lot of feedback and questions regarding the integration <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/406">Hadoop on OpenStack Swift: experiments</a></span>]]></description>
				<content:encoded><![CDATA[<p>Some time has passed since our initial <a href="http://bigdatacraft.com/archives/349">post</a> on Hadoop over OpenStack Swift implementation. A couple of things have changed (Rackspace finally implemented range requests in their Cloudfiles library) others remained the same (still no built-in support for Hadoop in OpenStack / CloudFiles).</p>
<p>We got a lot of feedback and questions regarding the integration but not always had the time or patience to properly address them, sorry for that. But one of our readers, Zheng Xu, did a great job by putting together a <a href="http://www.slideshare.net/xz911jp/2012-0908josugjeff" onclick="pageTracker._trackPageview('/outgoing/www.slideshare.net/xz911jp/2012-0908josugjeff?referer=');">slide deck</a> on the exact procedure.</p>
<p>But there are still some points I need to address regarding the procedure he assembled there. It mostly boils down to Cloudfiles: although current Cloudfiles implementation has HTTP range support, our implementation uses our own code for the latter, therefore I really encourage ether using <a href="https://github.com/zerovm/java-cloudfiles" onclick="pageTracker._trackPageview('/outgoing/github.com/zerovm/java-cloudfiles?referer=');">our Cloudfiles distribution</a> (with patches) or changing <a href="https://github.com/zerovm/hadoop-common" onclick="pageTracker._trackPageview('/outgoing/github.com/zerovm/hadoop-common?referer=');">our Hadoop code</a> to use the <a href="https://github.com/rackspace/java-cloudfiles" onclick="pageTracker._trackPageview('/outgoing/github.com/rackspace/java-cloudfiles?referer=');">new Rackspace one</a>. Although the simple filesystem tasks will work as expected, any MapReduce job that works with big files will fail without correct HTTP range support.</p>
<p>I want to thank Zheng Xu for the effort and congratulate him on the success of his small experiment.</p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/8oYifkiPSW0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/406/feed</wfw:commentRss>
		<slash:comments>12</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/406</feedburner:origLink></item>
		<item>
		<title>Apache Drill Progress</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/4wx0av9gPWA/398</link>
		<comments>http://bigdatacraft.com/archives/398#comments</comments>
		<pubDate>Tue, 04 Sep 2012 14:57:16 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Dremel]]></category>
		<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=398</guid>
		<description><![CDATA[<p>We are continuing our efforts in contributing our OpenDremel code to Apache Drill project and look forward to be active with it right after that.</p> <p>Right now the efforts are being put into our ANTLR-based parser, we want to make it work with the new grammar of BigQuery language. That should be done within <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/398">Apache Drill Progress</a></span>]]></description>
				<content:encoded><![CDATA[<p>We are continuing our efforts in contributing our OpenDremel code to Apache Drill project and look forward to be active with it right after that.</p>
<p>Right now the efforts are being put into our ANTLR-based parser, we want to make it work with the new grammar of BigQuery language. That should be done within a few days, the parser will be committed to the new Drill repository as a first phase of the OpenDremel-Drill merge. </p>
<p>Next on, we plan to refactor and contribute the Semantic Analyzer, which processes the output of the parser into an intermediate form, resolving references and rewriting (flattening) the query into single full table scan operation. That is expected within a week or two, it would depend when the Drill architecture doc will be published. We still don&#8217;t know what will be the schema language/format. Will it be Protobuf? Avro? OpenDremel supports Avro right now and has an initial support for Protobuf.</p>
<p>The final phase of OpenDremel &#8211; Drill merge, will be the contribution of the code generator based on the Apache Velocity templates. We have two sets of templates for now: one is a Java-based and executed with Janino executor and second one uses C/asm and executed with ZeroVM executor. </p>
<p>Everyone who wishes to help is welcome. The OpenDremel code resides in its usual Google code repo &#8211; http://code.google.com/p/dremel/. BE SURE TO LOCATE AND USE REPO COMBO BOX on the upper part of the page.</p>
<p>We probably will use https://github.com/ApacheDrill repo as a staging area or the Apache git repo directly, it all depends on what will be proposed by Ted Dunning &#8211; the Apache Drill Champion.</p>
<p>We also continue work on our generic execution backend built on top of OpenStack Swift and integrated with ZeroVM. We are contributing to both projects here.</p>
<p>We look ahead to Apache Drill with pluggable frontends and pluggable backends. So it would be able to run on top of a toy single-JVM Janino backend, or under YARN management on HDFS with Janino or ZeroVM backend, or even on a Zwift backend (that&#8217;s how we codenamed OpenStack Swift + ZeroVM combo).</p>
<p>On other hand the frontends will be pluggable too, so, in the future, support for new languages such as Apache Pig or Apache Hive can be added easily. Another option would be to create single frontend with pluggable language handlers, that would allow us to embed functionality from other projects such as Apache Mahout or R.</p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/4wx0av9gPWA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/398/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/398</feedburner:origLink></item>
		<item>
		<title>Apache Drill</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/2ycS3cJLFO0/374</link>
		<comments>http://bigdatacraft.com/archives/374#comments</comments>
		<pubDate>Sat, 25 Aug 2012 23:41:43 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Dremel]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=374</guid>
		<description><![CDATA[<p>We are not longer alone implementing Google Dremel and BigQuery technology. A proposal was made recently to Apache Foundation suggesting similar project. Moreover Ted Dunning kindly invited us to take part in the project.</p> <p>The project is just starting now and there is no source code yet and not even a consensus design. So <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/374">Apache Drill</a></span>]]></description>
				<content:encoded><![CDATA[<p>We are not longer alone implementing Google Dremel and BigQuery technology. A <a href="http://wiki.apache.org/incubator/DrillProposal" onclick="pageTracker._trackPageview('/outgoing/wiki.apache.org/incubator/DrillProposal?referer=');">proposal</a> was made recently to Apache Foundation suggesting similar project. Moreover <a href="http://twitter.com/ted_dunning" onclick="pageTracker._trackPageview('/outgoing/twitter.com/ted_dunning?referer=');">Ted Dunning</a> kindly <a href="https://groups.google.com/forum/?fromgroups=#!topic/opendremel/AFwz__vkUIs" onclick="pageTracker._trackPageview('/outgoing/groups.google.com/forum/?fromgroups=_topic/opendremel/AFwz_vkUIs&amp;referer=');">invited</a> us to take part in the project.</p>
<p>The project is just starting now and there is no source code yet and not even a consensus design. So we sat together today evening and wrote a proposed design for Apache Drill. We already working for about two years on Dremel and BigQuery implementation. It was a fascinating journey and we have learned quite a lot and would be more than happy to share our experiences and accumulated knowledge.</p>
<p>All our code (OpenDremel/Dazo/ZeroVM) has Apache License from the beginning and used several Apache technologies from Avro to Velocity. Apache seems to be best home for Drill project and we are looking forward to contribute to it.</p>
<p><iframe src="http://www.slideshare.net/slideshow/embed_code/14071739" width="597" height="486" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen> </iframe>
<div style="margin-bottom:5px"> <strong> <a href="http://www.slideshare.net/CamuelGilyadov/apache-drill-14071739" title="Apache Drill" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.slideshare.net/CamuelGilyadov/apache-drill-14071739?referer=');">Apache Drill</a> </strong> from <strong><a href="http://www.slideshare.net/CamuelGilyadov" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.slideshare.net/CamuelGilyadov?referer=');">Camuel Gilyadov</a></strong> </div>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/2ycS3cJLFO0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/374/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/374</feedburner:origLink></item>
		<item>
		<title>Start-Up Chile</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/v-3QMYeNLRU/354</link>
		<comments>http://bigdatacraft.com/archives/354#comments</comments>
		<pubDate>Sat, 07 Jul 2012 19:45:38 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=354</guid>
		<description><![CDATA[<p>I&#8217;ve been frequently asked about my experiences in Start-Up Chile program. For the past half year that I&#8217;ve been participating in the program I could say that it was interesting and fulfilling experience.</p> <p>On top of provided seed capital you get a supporting framework of mentors and fellow startupists. You can literally &#8220;feel&#8221; the surrounding  entrepreneurial spirit. And <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/354">Start-Up Chile</a></span>]]></description>
				<content:encoded><![CDATA[<p>I&#8217;ve been frequently asked about my experiences in Start-Up Chile program. For the past half year that I&#8217;ve been participating in the program I could say that it was interesting and fulfilling experience.</p>
<p>On top of provided seed capital you get a supporting framework of mentors and fellow startupists. You can literally &#8220;feel&#8221; the surrounding  entrepreneurial spirit. And despite me being unlucky to find peer support with my infrastructure BigData@Cloud idea (most folks were doing consumer web kind of startups) I did found the framework highly encouraging.</p>
<p>Provided capital is equity-free which is especially nice and makes negotiating next financing round easier. Getting the money is paperwork-intensive process but the staff are friendly and helping.</p>
<p>I found Chileans hospitable and friendly to foreigner. Yet minimal Spanish seems to be mandatory. I found myself speaking Spanish after a few month in Santiago and that was  unplanned initially.</p>
<p>Santiago is a nice mountain-surrounded modern city and pretty safe I would say. I cannot count how many times locals warned me on how unsafe Santiago really is, but except permanently going strike/riot in the central part of the city I never experienced and never witnessed or heard about any incident. And I&#8217;m usually working deep into the night and walk extensively before retiring to bed. I lived in Centro but especially enjoyed walking in west-northern part of the town. Underground transprtation is quite efficient to get around, a little hot during mid-day in February I remember. I was mostly fully consumed by my startup so haven&#8217;t enough time to tour the rest of the country, and even Santiago only from walking experience guided by GPS in my Nokia. I really should rent a car one weekend and get out for a couple of days&#8230; In fact I did one weekend in Vinna del Mar / Valparaiso and found it quite a nice and relaxing place.</p>
<p>The local entrepreneourship and geekish community is also thriving and this is not including very visible Start-Up Chile folks. Go to meetup.com and choose your favorite topic or technology and I bet you will find a packed santiago interest group there.</p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/v-3QMYeNLRU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/354/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/354</feedburner:origLink></item>
		<item>
		<title>Apache Hadoop over OpenStack Swift</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/wjJU9owhRAM/349</link>
		<comments>http://bigdatacraft.com/archives/349#comments</comments>
		<pubDate>Thu, 01 Mar 2012 09:59:57 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=349</guid>
		<description><![CDATA[This is a post by Constantine Peresypkin and David Gruzman. Lately we were working on integrating Hadoop with OpenStack Swift. Hadoop doesn&#8217;t need an introduction neither does OpenStack. Swift is an object-storage system and the technology behind RackSpace cloud-files (and quite a few others like Korea Telecom object storage, Internap and etc&#8230;) Before we go <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/349">Apache Hadoop over OpenStack Swift</a></span>]]></description>
				<content:encoded><![CDATA[<div>This is a post by Constantine Peresypkin and David Gruzman.</div>
<div></div>
<div>Lately we were working on integrating Hadoop with OpenStack Swift. Hadoop doesn&#8217;t need an introduction neither does OpenStack. Swift is an object-storage system and the technology behind RackSpace cloud-files (and quite a few others like Korea Telecom object storage, Internap and etc&#8230;)</div>
<div>
<div></div>
<div>Before we go into details of Hadoop-Swift integration let&#8217;s get some relevant background:</div>
<div></div>
<div>
<ol>
<li><span style="color: #222222; font-family: arial, sans-serif;">Hadoop already have integration with Amazon S3 and is widely used to crunch S3-stored data. </span><a href="http://wiki.apache.org/hadoop/AmazonS3" onclick="pageTracker._trackPageview('/outgoing/wiki.apache.org/hadoop/AmazonS3?referer=');">http://wiki.apache.org/hadoop/AmazonS3</a></li>
<li>NameNode is a known SPOF in Hadoop. If it can be avoided it would be nice.</li>
<li>Current S3 integration stages all data as temporary files on local disk to S3. That&#8217;s because S3 needs to know content length in advance it is one of the required headers.</li>
<li>Current S3 also suffers form 5GB max file limitation which is slightly annoying.</li>
<li>Hadoop requires seek support which means that HTTP range support is required if it is run over an object-storage . S3 supports it.</li>
<li>Append file support is optional for Hadoop, but it&#8217;s required for HBase. S3 doesn&#8217;t have any append support thus native integration can not use HBase over S3.</li>
<li><span style="color: #222222; font-family: arial, sans-serif;">While OpenStack Swift is compatible with S3, RackSpace CloudFiles is not. It is because RackSpace CloudFiles disables S3-compatibility layer in Swift. This prevents existing Swift users from integration with Hadoop.</span></li>
<li>The only information that is available on Internet on Hadoop-Swift integration is that with using Apache Whirr! it should work. But for best of our knowledge it is relevant only to rolling out Block FileSystem on top of Swift not a Native FileSystem. In other words we haven&#8217;t found any solution on how to process data that is already stored in RackSpace CloudFiles without costly re-importing.</li>
</ol>
</div>
<div></div>
<div><span style="color: #222222; font-family: arial, sans-serif;">So instrumented with above information let&#8217;s examine what we got here:</span></div>
<div></div>
<div>
<ol>
<li><span style="color: #222222; font-family: arial, sans-serif;">In general we instrumented Hadoop to run over Swift naively, without resorting to S3 compatibility layer.  This means it works with CloudFiles which misses the S3-compatibility layer.</span></li>
<li>CloudFiles client SDK doesn&#8217;t have support for HTTP range functionality. Hacked it to allow using HTTP range, this is a must for Hadoop to work.</li>
<li>Removed the need for NameNode in similar way it is removed with S3 integration for Amazon.</li>
<li>As opposed to S3 implementation we avoided staging files on local disk to and from CloudFiles/Swift. In other words data directly streamed to/from compute node RAM into CloudFiles/Swift.</li>
<li>Though the data is still processed remotely. Extensive data shipping takes place between compute nodes and CloudFiles/Swift. As frequent readers of this blog know we are working on technology that will allow to run code snippets directly in Swift. Look here for more details: http://www.zerovm.com. As next step we plan to perform predicate-pushdown optimization to process most of data completely locally inside ZeroVM-enabled object-storage system.</li>
<li>Support for native Swift large objects is planned also (something that&#8217;s absent in Amazon S3)</li>
<li>We also working on append support for Swift (this could be easily done through Swift large object support which uses versioning) so even HBase will work on top of Swift, and this is not the case with S3 now.</li>
<li>As it is the case with Hadoop S3, storing BigData in native format on Swift provides options for multi-site replication and CDN</li>
</ol>
</div>
</div>
<div>I also added a question to Quora on this issue: <a href="http://www.quora.com/Is-it-possible-to-run-Hadoop-to-directly-process-data-stored-in-RackSpace-CloudFiles-Swift" onclick="pageTracker._trackPageview('/outgoing/www.quora.com/Is-it-possible-to-run-Hadoop-to-directly-process-data-stored-in-RackSpace-CloudFiles-Swift?referer=');">http://www.quora.com/Is-it-possible-to-run-Hadoop-to-directly-process-data-stored-in-RackSpace-CloudFiles-Swift</a></div>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/wjJU9owhRAM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/349/feed</wfw:commentRss>
		<slash:comments>39</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/349</feedburner:origLink></item>
		<item>
		<title>Futility of “tooling” a proprietary cloud.</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/ymsGLbyiUrA/340</link>
		<comments>http://bigdatacraft.com/archives/340#comments</comments>
		<pubDate>Sun, 12 Feb 2012 02:44:14 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=340</guid>
		<description><![CDATA[<p>I&#8217;v been pitched by a lot of entrepreneurs trying to make a better-than-original &#8220;tooling&#8221; for a proprietary cloud, particularly for AWS. Ain&#8217;t the attempt futile from the beginning? Amazon is smart, innovative and working hard to make its cloud offering comprehensive and has much larger arsenal to overdo anyone who dare to compete on their own turf. <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/340">Futility of &#8220;tooling&#8221; a proprietary cloud.</a></span>]]></description>
				<content:encoded><![CDATA[<p>I&#8217;v been pitched by a lot of entrepreneurs trying to make a better-than-original &#8220;tooling&#8221; for a proprietary cloud, particularly for AWS. Ain&#8217;t the attempt futile from the beginning? Amazon is smart, innovative and working hard to make its cloud offering <em>comprehensive</em> and has much larger arsenal to overdo anyone who dare to compete on their own turf. That is their party, the invitation cannot be taken for granted.</p>
<p>Let&#8217;s take NoSQL data-stores and DBMS vendors as examples. There are VC-backed companies out-there which are exclusively focused on outdoing Amazon with running MySQL/NoSQL on their own cloud, Xeround comes to mind, but many others also hoping their product will catch fire on EC2.</p>
<p>Well, if just single branding and plain convenience is not enough , how about these two exclusive and &#8220;unfair&#8221; competative advatnages in Amazon arselnal:</p>
<ul>
<li>[My unverified assumption is] that DynamoDB has storage integrated into its fabric whenever all the rest must use slower EBS.</li>
<li>Not it is just integrated, but <a href="http://aws.amazon.com/dynamodb/" onclick="pageTracker._trackPageview('/outgoing/aws.amazon.com/dynamodb/?referer=');">as announced by Amazon</a>, it uses SSD-backed storage. SSD-backed storage is not available, as of today, for DynamoDB competitors running on AWS. So competitors must continue to use ordinary EBS. That is in fact a double-kick, first the mere fact of using different hardware for a competitive advantage and second, the announcment itself as catalyst to trigger migration.</li>
</ul>
<p>So, future EMR, may also have an integrated storage as well as other hardware optimization, making Hadoop more efficient on AWS and good if so.  Same goes to RDS and other current and future PaaS-related services.</p>
<p>Do I accuse Amazon on wrongdoing? Of course not! They brought the cloud to the main-street while others were only talking about it they and made large-scale computing affordable to all and continue dropping prices passing their economies-of-scale savings on customers and also keeps optimizing and enhancing their infrastructure constantly, and were good to their shareholders also. However, as any proprietary and monopolistic platform,  they do hinder some outside-of-Amazon innovation. No matter how good they are, we don&#8217;t want only one company in the world doing cloud-infrastructure stuff for the rest. That&#8217;s why, OpenStack are so extremely important for the industry. If OpenStack will be widely adopted then infrastructural and &#8220;tooling&#8221; kind of innovation could go directly into OpenStack for greater good and fairer monetization model for the author.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/ymsGLbyiUrA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/340/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/340</feedburner:origLink></item>
		<item>
		<title>OpenDremel update and Dremel vs. Tenzing</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/X6yOX3PbUd0/327</link>
		<comments>http://bigdatacraft.com/archives/327#comments</comments>
		<pubDate>Wed, 08 Feb 2012 03:34:41 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Dremel]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Sawzall]]></category>
		<category><![CDATA[Tenzing]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=327</guid>
		<description><![CDATA[<p>I wasn&#8217;t blogged for whole 2011 year&#8230; I&#8217;m not dead, quite on contrary, we were pretty active with OpenDremel project in 2011. First, we are renaming it to Dazo to avoid using a trademarked name and second, we did a good job implementing a secure generic execution engine and integrating it into OpenStack Swift. It also <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/327">OpenDremel update and Dremel vs. Tenzing</a></span>]]></description>
				<content:encoded><![CDATA[<p>I wasn&#8217;t blogged for whole 2011 year&#8230; I&#8217;m not dead, quite on contrary, we were pretty active with OpenDremel project in 2011. First, we are renaming it to <a href="https://github.com/Dazo-org" onclick="pageTracker._trackPageview('/outgoing/github.com/Dazo-org?referer=');">Dazo</a> to avoid using a trademarked name and second, we did a good job implementing a secure <a href="http://zerovm.org" onclick="pageTracker._trackPageview('/outgoing/zerovm.org?referer=');">generic execution engine</a> and <a href="https://github.com/Dazo-org/swift" onclick="pageTracker._trackPageview('/outgoing/github.com/Dazo-org/swift?referer=');">integrating it</a> into OpenStack Swift. It also came out, that the engine is actually quite useful virtualization technology in itself and it could potentially deserve a better fate than being buried as OpenDremel subcomponent. So, we do plan to release it as <a href="http://zerovm.org" onclick="pageTracker._trackPageview('/outgoing/zerovm.org?referer=');">independent project</a> and are quite busy with that now, so the work on OpenDremel is all but stalled unfortunately. As for storage infrastructure we settled with OpenStack Swift, we falled in love with Swift from the day it was released and now after we <a href="https://github.com/Dazo-org/swift" onclick="pageTracker._trackPageview('/outgoing/github.com/Dazo-org/swift?referer=');">have integrated</a> ZeroVM into it we even like it even more. So right now, we have fully salable storage backend with the unique capability to run securely any arbitrary native code inside, close to data. Now, what&#8217;s left is to take our old Metaxa Query Compiler and integrate it with that backend and then after many iterations it would bake into something pretty similar to Google Dremel and BigQuery. Even better, it will always process data locally (not sure BigQuery does it now) and it will not be limited to BQL on nested records, but for any query on any data and with full multi-tenant semantics. That&#8217;s how interesting 2011 was&#8230;</p>
<p>It was a preamble now back to the main feature:</p>
<p>Google released a <a href="http://research.google.com/pubs/archive/37200.pdf" onclick="pageTracker._trackPageview('/outgoing/research.google.com/pubs/archive/37200.pdf?referer=');">paper</a> on <a href="http://research.google.com/pubs/pub37200.html" onclick="pageTracker._trackPageview('/outgoing/research.google.com/pubs/pub37200.html?referer=');">Tenzing</a> last year on <a href="http://www.vldb.org/pvldb/vol4/p1318-chattopadhyay.pdf" onclick="pageTracker._trackPageview('/outgoing/www.vldb.org/pvldb/vol4/p1318-chattopadhyay.pdf?referer=');">VLDB</a>. Tenzing is an SQL query-system implemented on top of MapReduce infrastructure and it can be thought as Google-way to do <a href="en.wikipedia.org/wiki/Apache_Hive">Hive</a> and as always full of juicy details. There is already a <a href="http://sandeeptata.blogspot.com/2011/10/tenzing-sql-on-mapreduce.html" onclick="pageTracker._trackPageview('/outgoing/sandeeptata.blogspot.com/2011/10/tenzing-sql-on-mapreduce.html?referer=');">quality post</a> on this published and another one <a href="http://nosql.mypopescu.com/post/17118426860/paper-tenzing-a-sql-implementation-on-the-mapreduce" onclick="pageTracker._trackPageview('/outgoing/nosql.mypopescu.com/post/17118426860/paper-tenzing-a-sql-implementation-on-the-mapreduce?referer=');">here</a>. On top of that my additional takeways are:</p>
<p>1. It is possible to build MPP-grade system on top of MapReduce with relatively low-latency (10 seconds). However, it would requires quite a number of patches to MapReduce. Hive and Hadoop has certainly a lot to learn from Tenzing.</p>
<p>2. Even with Google version of a patched and leaner-than-Hadoop implementation of MapReduce getting it to Dremel latencies was not achievable. On other hand 10 seconds as minimal latency is not that bad and in same ball park as Netezza/Greenplum/Aster and other MPP gear.</p>
<p>3. As general Sawzall vs. Dremel vs. Tenzing comparison there is an nice youtube-datawarehousing <a href="http://www-conf.slac.stanford.edu/xldb2011/talks/xldb2011_tue_1120_YoutubeDataWarehouse.pdf" onclick="pageTracker._trackPageview('/outgoing/www-conf.slac.stanford.edu/xldb2011/talks/xldb2011_tue_1120_YoutubeDataWarehouse.pdf?referer=');">presentation published</a>. In fact, Dremel beats both of them on latency and if not only for limited expressive power of its query language it would end up as complete winner on all metrics considered there. Sawzall having imperative query language scores highest on the power metric. I guess when OpenDremel will be released it will be a unique combination of low-latency querying with the full expressive power of imperatively-augmented SQL.</p>
<p>4. Tenzing can query MySQL databases as many other popular data formats. What we witnessing here is that query-engines is being decoupled from storage engines. 10 years ago it was only the case for MySQL ecosystem and anyone who tried Oracle external table interface knows how friendly past DBMSes were to external data sources.  Dremel columnar encoding component was released internally in Google as separate ColumnIO storage engine. Then Google <a href="http://code.google.com/p/leveldb/" onclick="pageTracker._trackPageview('/outgoing/code.google.com/p/leveldb/?referer=');">open-sourced</a> their key-value LevelDB engine a-la Hadoop&#8217;s <a href="http://en.wikipedia.org/wiki/RCFile" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/RCFile?referer=');">RCFiles</a>. So we can learn here of emergence of multiple storage-engines working with multiple query engines, quite interesting phenomenon.</p>
<p>5. The query is compiled into native code (with LLVM) and this gave significant acceleration by factor from six to twelve. This means that SQL to native code compilation is a must for high-performance BigData query engines.</p>
<p>&nbsp;</p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/X6yOX3PbUd0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/327/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/327</feedburner:origLink></item>
		<item>
		<title>Upcoming hardware renaissance era:  part #2.</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/pMj0CobddhM/300</link>
		<comments>http://bigdatacraft.com/archives/300#comments</comments>
		<pubDate>Mon, 17 Jan 2011 17:17:39 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=300</guid>
		<description><![CDATA[<p>Some examples of upcoming hardware renaissance era:</p> <p>1. Virtually all server vendors are pitching modularized data centers by now. MDC are boxes resembling shipping containers accommodating complete vritualized data-center inside. With MDC one just connects power, network and chilled water and gets access to the cloud in the box. Most MDC are good to <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/300">Upcoming hardware renaissance era:  part #2.</a></span>]]></description>
				<content:encoded><![CDATA[<p>Some examples of upcoming hardware renaissance era:</p>
<p>1. Virtually all server vendors are pitching modularized data centers by now. MDC are boxes resembling shipping containers accommodating complete vritualized data-center inside.  With MDC one just connects power, network and chilled water and gets access to the cloud in the box. Most MDC are good to be deployed outside and have built-in protection against weather elements. Of course all current offering are based on x86 commodity servers but here is a hint: once competition moves to comparing whose shipping container can stuff more storage and computing power inside and who has better price/performance and energy efficiency, we will see innovation in hardware skyrocketing. </p>
<p>2. On processor front&#8230;. ARM architecture has all ingredients to become next Intel. If I am not mistaken, ARM processors are outnumbering x86 10 to 1 with tens of billions of processors shipped. 95% of cellphones and advanced gadgets use ARM. ARM power efficiency puts x86 in shame. However, till now ARM was focused on gadgets and dismissed data-center market. Not anymore! With newer Cortex-A15 ARM took aim at x86 on datacenter territory. Calxeda already got ~$50M in venture money for commercializing ARM in datacenter.  However, ARM is not alone here, Tilera with their server-vendor partner Quanta are already shipping 512 core server in 2U form-factor. Tilera took lean MIPS processor core and put some 100 of them into single die together with x8 10Gbit Ethernet channels and four DDR3 memory channels. Nvidia also claimed that they are not GPU-only vendor anymore and are readying general-purpose processors based on ARM architecture with ample amount of GPU muscle inside. That said Intel and AMD are also far from stagnating and moving into heterogeneous many-core designs. I think we never witnessed more innovation in processor space than now.</p>
<p>3. On memory front&#8230; Flash is making inroads to claim space in memory-hierarchy between DRAM and HDD. Disrupting DRAM market and high-performance 15K RPM HDD market. I think 15K RPM HDD and DRAM-based SSD products are already safe to be declared dead. Same about HDD smaller than 2.5 inch form-factor. I even think 2.5 HDD are also in risk. Only capacity-optimized HDDs would survive. Even without flash, the DRAM got such capacity that most datasets fits in RAM completely. And if not in RAM of single server than it surely fits in shared cluster RAM. This solid-memory advancements in DRAM and Flash disrupts storage market, especially making high-performance SAN redundant. The only storage tomorrow server will need is capacity-optimized and energy-optimized ones. That fact among other forced EMC to move into computing&#8230; and provide complete cloud in the box instead just storage in the box like it did in the past.</p>
<p>4. Networking&#8230; in my view networking is most stagnating hardware market here. Infiniband finally moves into mainstream and it is good. Does it? Or it will succumb to 10GbE? Remains to be seen. My bet is on Infiniband due to architectural superiority.  Networking virtualization is still on whiteboards&#8230; unfortunately. So in networking there is no signs of renaissance but the potential is there.</p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/pMj0CobddhM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/300/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/300</feedburner:origLink></item>
		<item>
		<title>Emerging Proprietary Hardware Renaissance</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/l27bfguHid0/295</link>
		<comments>http://bigdatacraft.com/archives/295#comments</comments>
		<pubDate>Wed, 17 Nov 2010 14:42:53 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=295</guid>
		<description><![CDATA[<p>INTRO I cannot count number of times I heard that cloud computing means innovation stagnation in the proprietary hardware business and that with cloud computing, hardware doesn&#8217;t matter anymore and will succumb sooner or later into boring razor-thin-margins oligopolistic commodity industry.</p> <p>GAME OVER FOR FAT MARGINS IN PROPRIETARY HARDWARE? Why folks think like that? <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/295">Emerging Proprietary Hardware Renaissance</a></span>]]></description>
				<content:encoded><![CDATA[<p>INTRO<br />
I cannot count number of times I heard that cloud computing means innovation stagnation in the proprietary hardware business and that with cloud computing, hardware doesn&#8217;t matter anymore and will succumb sooner or later into boring razor-thin-margins oligopolistic commodity industry.</p>
<p>GAME OVER FOR FAT MARGINS IN PROPRIETARY HARDWARE?<br />
Why folks think like that? Well&#8230; there is one reason that dominates their thinking &#8211; hardware products became components and worse of all they became a well standardized components. And as such, certain low-wage countries can quickly master how to assemble them in large quantities and win competition purely on cost and nothing else seriously matters in component business. No one carries about component extended enterprise feature set and premium brand and long list of ISV partnerships and etc&#8230; what matters is very well defined functionality and price, price and price. In fact it already happened to low-end gear like entry-level basic servers and routers. The situation is very different for enterprise IT products. I estimate that if IT costs will be tripled overnight for most enterprises it would not matter in their bottom line. So IT departments of most enterprises are cost-insensitive regarding IT gear. They will not blindly overpay in most cases but cost is not high in their priority list either.  IT for most enterprises is more or less fixed cost amortized among very large number of its products or services. I guess if Coca Cola IT was tripled the price of single can of coke should be elevated for less than a cent. Unlikely that it is life-threatening to their business. Therefore the game was and largely still is to market hardware products directly to enterprise IT departments competing on enterprise feature set rather than price. Now with emerging cloud computing paradigm hardware products are components and are marketed to cloud operators which are essentially server farmers. And they are extremely cost-sensitive and marketing to them premium computing gear will be as successful as marketing a premium booze as fuel to alcohol-fueled car owner. If for cloud operator server costs will be tripled overnight the next morning it would be out of business. So without a doubt it is a game over for fat margins in hardware manufacturing/assembly business. What happened to enterprise software 5 years ago is happening to enterprise hardware right now &#8211; commoditization.</p>
<p>PROPRIETARY HARDWARE INNOVATION<br />
So the common thinking goes, in the boring commodity business no one is going to invest in innovation so no one is going to invest much into proprietary hardware because no cloud vendor is going to buy premium hardware. The only hope is private cloud as a freshly invented loophole to continue to sell premium gear to enterprises. Well lets consider the following situation. XYZ startup manages to build a proprietary appliance which is essentially a cloud-in-the-box solution that through tight internal integration and optimization for one particular task achieves an order-of-magnitude better price-performance.Lets say it is an KV-store appliance. Would cloud-operators be interested? I bet they will. From outside it doesn&#8217;t matter if the functionality are backed up by cassandra-on-generic hardware or custom-hardware-appliance. So the cloud provider quietly rolling out such appliances can compete well both on price and on functionality (like latency) with other cloud providers running software-based KV-store on generic hardware. Other startup ABC may produce a computing appliance that can run ruby-on-rail or java or pyton applications order-of-magnitude more efficiently than a generic hardware and a cloud provider deploying such appliances from ABC startup could compete better serving RoR clients. Yet another startup may build a custom rack-sized box filled with fermi chip specifically designed for video processing. I can bring more examples but the trend is obvious. Use large chunk of dedicated hardware to do one specific task and do it extremely efficiently and you can have nice margins as hardware manufacturer. Before cloud &#8211; hardware must be generic because single enterprise server should be able to run a variety of different workloads. With cloud computing it is not longer the case. A vendor can build a dedicated hardware appliance optimized to do one specific workload and serve the whole world with it raising high barriers for the competitors. So despite popular belief I think cloud computing presents unique opportunities to creative hardware engineers, tough not in premium enterprise-feature set area as it used to be but in extreme-efficiency, acceleration and specialization areas.</p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/l27bfguHid0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/295/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/295</feedburner:origLink></item>
		<item>
		<title>Two Envelopes Problem: Am I just dumb?</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/txZpb3udBS8/274</link>
		<comments>http://bigdatacraft.com/archives/274#comments</comments>
		<pubDate>Sun, 17 Oct 2010 19:34:02 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=274</guid>
		<description><![CDATA[<p>It seems the recent craze about statistician being a profession of choice in the future gains steam. In future where we will be surrounded by quality BigData, capable computers and bug-free open source software including OpenDremel. Well the last one I made up&#8230; but the rest seems to be the current situation. Acknowledging this <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/274">Two Envelopes Problem: Am I just dumb?</a></span>]]></description>
				<content:encoded><![CDATA[<p>It seems the recent craze about statistician being a profession of choice in the future gains steam. In future where we will be surrounded by quality BigData, capable computers and bug-free open source software including OpenDremel. Well the last one I made up&#8230; but the rest seems to be the current situation. Acknowledging this I was checking what is the state of open-source statistics software and who are the guys behind it and etc.. But it is not my today topic, it is the topic of one of my next posts. Today I want to talk about one of the strangest problems/paradoxes on the internet I have ever seen. The story is, that I encountered right now &#8220;the two envelops problem&#8221; or &#8220;paradox&#8221; as some put it. Having worked with math guys for a long time (having math mastery that I&#8217;ll never reach in my life) I immediately recognized that problem as I was teased by it few times a long time a go. However, I was never told that this is such a big deal of a problem. Wikipedia lists it under &#8220;Unsolved problems in statistics&#8221;. Heh? And I never understood what is so paradoxical or hard or even interesting in it? For me it seemed high-school grade problem at most. So I put &#8220;two envelopes problem&#8221; into Google and found tens of blogs trying to explain it and propose <del>over-engineered </del> over-complicated and long solutions to such a simple problem. I have a very strange feeling that I&#8217;m either totally dumb or a genius and I know I&#8217;m not a genius <img src='http://bigdatacraft.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  In some sources it is mentioned that only Bayesian subjectivists suffer from this, however in large majority of other sources it is presented as an universal problem&#8230; Well enough talking lets dive in into simplest solution on internet (or I will be embarrassed by someone pointing my mistake or similar solution published elsewhere).</p>
<p><strong><em>The problem description for those who never heard about it:</em></strong></p>
<p><strong>You are approached by a guy that shows you two identical envelopes. Both envelopes contain money. You are allowed to pick and keep any one of them for yourself. After you pick one, the guy makes you an offer to swap envelopes. The question is if one should swap. For me it is as clear as sunny day that it doesn&#8217;t matter if you swap and it is easily provable by simple math. Somehow most folks (some Ph.D. level!) get into very hairy calculations that suggest that one should swap and than even more hairy ones why one should not. Some mention subjectivity but most don&#8217;t.</strong></p>
<p><em><strong>The simplest solution on internet (joking&#8230; but seriously I haven&#8217;t found such simple one):</strong></em></p>
<p>Let&#8217;s denote the smaller sum in one of envelopes as X and therefore the larger sum in the other will be 2X. Then expected value of current envelope selection before swap consideration is 1.5X. How I got to it? Very simply we have 0.5 probability of holding the envelope with larger sum that is 2X and 0.5 probability with holding an envelope with smaller sum witch is X. So:</p>
<p><code>0.5*2X+0.5*X = X+0.5X=1.5X</code></p>
<p>So far so good&#8230; let&#8217;s now calculate the expected value if we swap. If we swap, we will have same 0.5 probability of holding larger sum and same 0.5 probability of holding the smaller sum. Needless to repeat the calculation, you will get exactly same 1.5X as an expected value, meaning that the swap doesn&#8217;t matter. Or if time has any value it doesn&#8217;t make sense to waste it by swapping envelopes.</p>
<p>Do you see it as hard problem? I bet 10-year old will do fine with it, especially if offered some reward.</p>
<p><em><strong>How come others get lost here?</strong></em></p>
<p>The answer is that some try to apply Bayesian subjectivism probability theory and then innocent folks follow it and gets lost as well. </p>
<p>If you look to <a href="http://en.wikipedia.org/wiki/Two_envelopes_problem" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Two_envelopes_problem?referer=');">Wikipedia article</a> for example you will find a classic wrong solution that allegedly is &#8220;obvious&#8221; and then a link outside Wikipedia to a &#8220;correct&#8221; solution. The correct solution seems a long post with a lot of formulae and usage of Bayes theorem that at the end came to correct answer.</p>
<p>Well&#8230; I see clearly a flaw with the solution published in wikipedia. That solution really looks artificial, but according to the number of followers it should be obvious for many. The blunder is in the third  line:</p>
<blockquote><p>The other envelope may contain either 2A or A/2</p></blockquote>
<p>By A they denoted the sum in the envelope they are holding. The mistake is in &#8220;either 2A or A/2&#8243;, it should be &#8220;either 2A or A&#8221;, Then everything will be ok and no &#8220;paradox&#8221; will emerge in the end. The mistake stems from the fallacy of using same name for two separate variables that are dependent but not equal! And then repeatedly confusing them since they have same name. Here is a &#8220;patch&#8221; to be applied to wikipedia published reasoning:</p>
<blockquote><p>1. I denote by A the amount in my selected envelope. =&gt; <strong>FINE</strong><br />
2. The probability that A is the smaller amount is 1/2, and that it is the larger amount is also 1/2. =&gt; <strong>FINE</strong><br />
3. The other envelope may contain either 2A or A/2. =&gt;<strong>INCORRECT variable A denotes different values so it is highly confusing to write it this way.</strong></p>
<p><strong>let&#8217;s explicitly consider two cases here instead of implicit &#8220;either..or&#8230;&#8221;</p>
<p>in first case let&#8217;s assume we are holding the smaller sum then the other envelope contains 2A<br />
in the second case let&#8217;s assume we have holding the larger sum then the other envelope contains A/2. However, the A is different from A of first case so let&#8217;s write it as _A/2</p>
<p></strong><strong>Moreover we know that _A is not just different from A but is exactly twice the other so<br />
_A = 2A<br />
 So the expression &#8220;either 2A or A/2&#8243; must be written as &#8220;either 2A or _A/2&#8243; or substituting _A=2A as &#8220;either 2A or A&#8221;.</strong></p></blockquote>
<p>Then for calculating expected value you also substitute A instead of A/2 and get same expected value than before swap.</p>
<p>===============================</p>
<p>That said, I saw many people feeling so &#8220;enlightened&#8221; by reading a complicated &#8220;correct&#8221; solution that they erroneously think and argue that one should not accept the following offer thinking it is equivalent to the above problem (well not exactly this but I rephrased it for clarity):</p>
<p>One guy comes to you and says there are three envelopes. You are allowed to pick one and keep it. One envelope is red and two are white. All three contain money. One of white envelopes contain twice as many as red one. Another white one contains half of red one. The white envelopes are identical and there is no way to know which one contains double and which one contains half. The question is which envelope you should choose: the red one or one of the white ones. And the answer is that you should pick one of white envelopes! In fact the calculation errorneously applied to the two-envelopes problem is 100% correct to the three-envelopes-problem. And on average you will win choosing one of white envelopes rather than a red one.</p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/txZpb3udBS8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/274/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/274</feedburner:origLink></item>
		<item>
		<title>Debunking common misconceptions in SSD, particularly for analytics</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/6I0ehsl0nx4/252</link>
		<comments>http://bigdatacraft.com/archives/252#comments</comments>
		<pubDate>Wed, 13 Oct 2010 13:45:23 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[BI]]></category>
		<category><![CDATA[Flash Memory]]></category>
		<category><![CDATA[SSD]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=252</guid>
		<description><![CDATA[1. SSD is NOT synonymous for flash memory. <p>First of all let&#8217;s settle on terms. SSD is best described as a concept of using semiconductor memory as disk. There is two common cases: DRAM-as-disk and flash-as-disk. And flash-memory is a semiconductor technology pretty similar to DRAM, just with slightly different set of trade-offs made.</p> <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/252">Debunking common misconceptions in SSD, particularly for analytics</a></span>]]></description>
				<content:encoded><![CDATA[<h2><strong>1. SSD is NOT synonymous for flash memory.</strong></h2>
<p>First of all let&#8217;s settle on terms. SSD is best described as a concept of using semiconductor memory as disk. There is two common cases: DRAM-as-disk and flash-as-disk. And flash-memory is a semiconductor technology pretty similar to DRAM, just with slightly different set of trade-offs made.</p>
<p>Today there are little options to use flash memory in analytics beyond SSD. Nevertheless, it should not suggest that SSD is synonymous for flash memory. Flash memory can be used in products beyond SSD, and SSD can use non-flash technology, for example DRAM.</p>
<p>So the question is: do we have any option of using flash-memory in other form rather than flash-as-disk?</p>
<p>FusionIO is the only one and was always bold in claiming that their product are not-SSD but a totally new category product, called ioMemory. Usually I dismiss such claims automatically in subconscious as a common-practice of  term-obfuscation. However, in the case of FusionIO I found it to be a a rare exception and <em>technically true</em>. On hardware level there is no disk-related overhead in FusionIO solution and in my opinion FusionIO are closest to the flash-on-motherboard vision among all the rest of SSD manufacturers. That said, FusionIO succumbed to implementing a disk-oriented storage layer in software because unavailability of any other standards covering  flash-as-flash concept.</p>
<p>You can find a more in-depth coverage of New-Dynasty SSD versus Legacy SSD issue in <a href="http://www.storagesearch.com/enterprise-ssd.html" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.storagesearch.com/enterprise-ssd.html?referer=');">recent article of Zsolt Kerekes on StorageSearch.com</a>. Albeit I&#8217;m not 100% agree with his categorization.</p>
<h2><em>2. SSD DOESN&#8217;T provide more throughput than HDD.</em></h2>
<p>The bragging about performance density of SSD could safely be dismissed. There is no problem in stacking up HDDs together. As many as 48 of them can be put in single server 4U box providing aggregate throughput of 4GB/sec for fraction of SSD price. Same goes to power, vibration, noise and etc&#8230; The extent to which this properties are superior to disk is uninteresting and not justifying the associated premium in cost.</p>
<p>Further, for any amount of  money, HDD can provide significantly more IO throughput , than SSD of any of today vendor. On any workload: read, write or combined.  Not only this, but it will do so with an order of magnitude more capacity for your big data as additional bonus. However, a few nuances are to be considered:</p>
<ul>
<li> If  data is accessed in random small chunks (let&#8217;s say 16KB chunks), then SSD will provide significantly more throughput (factor of x100 may be) than disk will do. Reading in chunks at least 1MB will put HDD as a winner in the throughput game again.</li>
<li>The flash memory itself, has great potential to provide an order of magnitude more throughput than disks. Mechanical &#8220;gramophone&#8221; technology of disks cannot compete in agility with the electrons. However, this potential throughput is hopelessly being left unexploited by the nowadays SSD controller. How bad it is? Pretty bad, SSD controllers pass on less than 10% of potential throughput. The reasons include: flash-management complexity, cost-constraints leading to small embedded DRAM buffers and computationally-weak controllers,  and the main reason being that there is no standards for 100 faster disk, neither legacy software could potentially keep with higher multi-gigabyte throughputs, so SSD vendors don&#8217;t bother and are obsessed with the laughable idea of bankrupting HDD manufacturers calling the technology disruptive which it is not by definition. S<strong><em>o we have a much more expensive disk replacement that is only barely more performant, throughput-wise, than vanilla low-cost HDD array.</em></strong></li>
</ul>
<h2><em>3. Array of SSDs DOESN&#8217;T provide larger number of useful IOPS than arrays of disks.</em></h2>
<p>While it is true that one SSD can match disk array easily in IOPS, it should not suggest that array of SSD will provide larger number of useful IOPS. The reason is prosaic, array of disks provides an abundance of IOPS, many times more than enough for any analytic application. So any additional IOPS are not needed and astronomical number of IOPS in SSD arrays is a solution looking for a problem in analytics industry.</p>
<h2>4. SSD are NOT disruptive to disks.</h2>
<p>Well if it is true it is not according to Clayton Christiansen definition of &#8220;disruptiveness&#8221;.  As far as I remember Christiansen defines &#8220;disruptiveness&#8221; as technology A being disruptive to technology B when all following holds true:</p>
<ul>
<li>A is worse than technology B in quality and features</li>
<li>A is cheaper than technology B</li>
<li>A is affordable to a large number of new users to whom technology B is appealing but too costly.</li>
</ul>
<p>SSD-to-disk pair is clearly not true for any condition above so I&#8217;m puzzled how one can call it disruptive to disks?</p>
<p>Again. I&#8217;m not claiming that SSD or flash-memory is not disruptive to any technology I just claiming that SSD are not disruptive to HDD. In fact, I think flash-memory IS disruptive to DRAM. All three conditions above hold for flash-to-DRAM pair. Also a pair of directly attached SSDs are highly disruptive to SAN.</p>
<p>&#8212;&#8212;&#8212;&#8212;&#8212;</p>
<p><strong><em>Make no mistake I&#8217;m a true believer in flash-memory as a game-changer for analytics just not in the form of disk replacement. I&#8217;ll explore in my upcoming posts the ideas where flash memory can make a change. I know I totally missed any quantification proofs for all the claims above but&#8230;. well&#8230;. let&#8217;s leave it for comment section.</em></strong></p>
<p><strong><em>Also one of best coverage of flash-memory for analytics (and not coming from a flash vendor) is of Curt Monash on DBMS2 blog: </em></strong><a href="http://www.dbms2.com/2010/01/31/flash-pcmsolid-state-memory-disk/" onclick="pageTracker._trackPageview('/outgoing/www.dbms2.com/2010/01/31/flash-pcmsolid-state-memory-disk/?referer=');"><strong><em>http://www.dbms2.com/2010/01/31/flash-pcmsolid-state-memory-disk/</em></strong></a></p>
<p><em><br />
</em></p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/6I0ehsl0nx4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/252/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/252</feedburner:origLink></item>
		<item>
		<title>Google Percolator: MapReduce Demise?</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/hWd5-0pUAh0/240</link>
		<comments>http://bigdatacraft.com/archives/240#comments</comments>
		<pubDate>Tue, 12 Oct 2010 20:10:02 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Advanced Analytics]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Analytics Patterns]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Percolator]]></category>
		<category><![CDATA[SVLC]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=240</guid>
		<description><![CDATA[<p>Here is my early thoughts after quickly looking into  Google Percolator and skimming the paper .</p> <p>Major take-away: massive transactional mutating of tens-petabyte-scale dataset on thousands-node cluster is possible!</p> <p>MapReduce is still useful for distributed sorts of big-data and few other things, nevertheless it&#8217;s &#8220;karma&#8221; has suffered a blow. Beforehand you could end any MapReduce dispute by <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/240">Google Percolator: MapReduce Demise?</a></span>]]></description>
				<content:encoded><![CDATA[<p>Here is my early thoughts after quickly looking into  <a href="http://research.google.com/pubs/pub36726.html" onclick="pageTracker._trackPageview('/outgoing/research.google.com/pubs/pub36726.html?referer=');">Google Percolator</a> and skimming the <a href="http://research.google.com/pubs/archive/36726.pdf" onclick="pageTracker._trackPageview('/outgoing/research.google.com/pubs/archive/36726.pdf?referer=');">paper </a> .</p>
<p><em><strong>Major take-away:</strong></em> massive transactional mutating of tens-petabyte-scale dataset on thousands-node cluster is possible!</p>
<p>MapReduce is still useful for distributed sorts of big-data and few other things, nevertheless it&#8217;s &#8220;karma&#8221; has suffered a blow. Beforehand you could end any MapReduce dispute by saying &#8220;well&#8230; it works for Google&#8221;, however, nowadays before you say it you would hear &#8220;well&#8230;. it didn&#8217;t work for Google&#8221;. MapReduce is particularly criticized by having 1) too long latency, 2)too wasteful, requiring full rework of the whole tens-of-PB-scale dataset even if only a fraction of it had been changed and 3)<strong><em> inability to support kinda-real-time data processing (meaning processing documents as they are crawled and updating index appropriately)</em></strong>. In short: <strong><em>welcome to disillusionment stage of MapReduce</em></strong> saga. And luckily Hadoop is not only MapReduce, I&#8217;m convinced Hadoop will thrive and flourish beyond MapReduce and <strong><em>MapReduce being an important big data tool will be widely used where it really makes sense</em></strong> rather than misused or abused in various ways. Aster Data and remaining MPP startups can relax on the issue a bit.</p>
<p>Probably a topic for another post, but I think MapReduce is best leveraged as ETL tool.</p>
<p>See also <a href="http://blog.tonybain.com/tony_bain/2010/09/was-stonebraker-right.html" onclick="pageTracker._trackPageview('/outgoing/blog.tonybain.com/tony_bain/2010/09/was-stonebraker-right.html?referer=');">http://blog.tonybain.com/tony_bain/2010/09/was-stonebraker-right.html</a> for another view on the issue. There are few others posts already published  on Precolator but I haven&#8217;t yet looked into them.</p>
<p>I&#8217;m very happy about my <a href="http://bigdatacraft.com/archives/135">SVLC-hypothesis</a>, I think I knew it for a long time, but somehow only now, after I have put it on paper, I felt that the reasoning about different analytics approaches became easier. It is like having a map instead of visualizing it. So where is Percolator in the context of SVLC? If it is still considered analytics, Percolator is an SVC system &#8211; giving up latency for everything else, albeit to a lot lesser degree than its successor MapReduce. That said Percolator has a sizable part that is not analytics anymore but rather transaction processing. And transaction processing  is not usefully modeled by my SVLC-hypothesis. In summary:<em> <strong>Percolator is essentially the trade-off as MapReduce &#8211; sacrificing latency for volume-cost-sophistication but more temperate, more rounded,  less-radical</strong></em><strong>.</strong></p>
<p>Unfortunately I haven&#8217;t enough time to enjoy the <a href="http://research.google.com/pubs/archive/36726.pdf" onclick="pageTracker._trackPageview('/outgoing/research.google.com/pubs/archive/36726.pdf?referer=');">paper </a> as it should be enjoyed with easy weekend-style reading. So inaccuracies may have been infiltrated in:</p>
<ul>
<li><strong><em>Percolator is big-data ACID-compliant transaction-processing non-relational DBMS.</em></strong></li>
<li><strong><em>Percolator fits most NoSQL definitions and therefore it is a NoSQL.</em></strong></li>
<li><strong><em>Percolator  continuously mutates dataset <span style="font-weight: normal;"><span style="font-style: normal;">(called data corpus in the paper)</span></span> with full transactional semantics and in the sizes of tens of petabytes on thousands of nodes.</em></strong></li>
<li><strong><em>Percolator uses a message-queue style approach for processing crawled data. <span style="font-weight: normal;"><span style="font-style: normal;">Meaning, it processes the crawled pages continuously as they arrive updating the index database transactionaly.</span></span></em></strong></li>
<li><strong><span style="font-weight: normal;">BEFORE Percolator: Indexing was done in stages taking weeks. All crawled data was accumulated and staged first, then pass-after-pass transformed into index. 100-passes were quoted in the paper as I remember. When the cycle was completed a new one was initiated. Few weeks latency after content was published and before it appeared in Google search results were considered too long in twitter age, so Google implemented some shortcuts allowing</span><em> preliminary results to show in search before the cycle is completed.</em></strong></li>
<li><strong><em>Percolator  doesn&#8217;t have declarative query language.</em></strong></li>
<li><strong><em>No joins.</em></strong></li>
<li><strong><em>Self-stated ~3% single node efficiency relative to the state-of-the-art DBMS system on single node. </em><span style="font-weight: normal;">That&#8217;s the price for handling (which is transactional mutating) high-volume dataset&#8230; and relatively cheaply. Kudos for Google to being so open on this and not exercising in term-obfuscation. On the other hand, they can afford it&#8230; they don&#8217;t have to sell it tomorrow on rather competitive NoSQL market <img src='http://bigdatacraft.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </span></strong></li>
<li><strong><em>Thread-per-transaction model. Heavily threaded many-core servers as I understand.</em></strong></li>
</ul>
<p>Architecturally reminds me MoM (Message Oriented Middleware) with transactional queues and guarantied delivery.</p>
<p><strong><em>Definitely to be continued&#8230;</em></strong></p>
<p><strong>other Percolator blog posts:</strong></p>
<p><a href="http://www.quora.com/What-is-Google-Percolator" onclick="pageTracker._trackPageview('/outgoing/www.quora.com/What-is-Google-Percolator?referer=');">http://www.infoq.com/news/2010/10/google-percolator</a><br />
<a href="http://www.theregister.co.uk/2010/09/24/google_percolator" onclick="pageTracker._trackPageview('/outgoing/www.theregister.co.uk/2010/09/24/google_percolator?referer=');">http://www.theregister.co.uk/2010/09/24/google_percolator</a><br />
<a href="http://coolthingoftheday.blogspot.com/2010/10/percolator-question-is-how-does-google.html" onclick="pageTracker._trackPageview('/outgoing/coolthingoftheday.blogspot.com/2010/10/percolator-question-is-how-does-google.html?referer=');">http://coolthingoftheday.blogspot.com/2010/10/percolator-question-is-how-does-google.html</a><br />
<a href="http://www.quora.com/What-is-Google-Percolator" onclick="pageTracker._trackPageview('/outgoing/www.quora.com/What-is-Google-Percolator?referer=');">http://www.quora.com/What-is-Google-Percolator</a></p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/hWd5-0pUAh0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/240/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/240</feedburner:origLink></item>
		<item>
		<title>How scalable is linux kernel on 48-core machine?</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/rYUrdVB2oNQ/225</link>
		<comments>http://bigdatacraft.com/archives/225#comments</comments>
		<pubDate>Tue, 12 Oct 2010 00:40:25 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=225</guid>
		<description><![CDATA[<p>According to this excellent and comprehensive research with some kernel hacking ~x33 speedup (compared to single core) is possible. For example PostgreSQL running on 48 cores gives ~x4  out of the box and after kernel/postgreSQL patches are applied it grows to ~x33. Assuming IO can keep up of course.</p> ]]></description>
				<content:encoded><![CDATA[<p>According to this <a href="http://www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf" onclick="pageTracker._trackPageview('/outgoing/www.usenix.org/events/osdi10/tech/full_papers/Boyd-Wickizer.pdf?referer=');">excellent and comprehensive research</a> with some kernel hacking ~x33 speedup (compared to single core) is possible. For example PostgreSQL running on 48 cores gives ~x4  out of the box and after kernel/postgreSQL patches are applied it grows to ~x33. Assuming IO can keep up of course.</p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/rYUrdVB2oNQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/225/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/225</feedburner:origLink></item>
		<item>
		<title>Is NoSQL a DBMS?</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/3rsH_-2FcvA/215</link>
		<comments>http://bigdatacraft.com/archives/215#comments</comments>
		<pubDate>Tue, 12 Oct 2010 00:14:30 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Terminology]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=215</guid>
		<description><![CDATA[<p>Yes, it is.</p> <p>Proof? &#8211; By definition.</p> <p>But Wikipedia&#8230;&#8230; &#8211; fixed.</p> ]]></description>
				<content:encoded><![CDATA[<p>Yes, it is.</p>
<p>Proof? &#8211; By definition.</p>
<p>But Wikipedia&#8230;&#8230; &#8211; fixed.</p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/3rsH_-2FcvA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/215/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/215</feedburner:origLink></item>
		<item>
		<title>CAP equivalent for analytics?</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/8-ZAXfI-Yts/135</link>
		<comments>http://bigdatacraft.com/archives/135#comments</comments>
		<pubDate>Sat, 09 Oct 2010 02:46:34 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[CAP]]></category>
		<category><![CDATA[SVLC]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=135</guid>
		<description><![CDATA[<p>CAP theorem deals with trade-off in transactional system. It doesn’t need an introduction, unless of course you have been busy on the moon for last couple of years. In this case you can easily Google for good intros. Here is a wikipedia entry on the subject.</p> <p>I was thinking how would I build an <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/135">CAP equivalent for analytics?</a></span>]]></description>
				<content:encoded><![CDATA[<p>CAP theorem deals with trade-off in transactional system. It doesn’t need an introduction, unless of course you have been busy on the moon for last couple of years. In this case you can easily Google for good intros. Here is a <a href="http://en.wikipedia.org/wiki/CAP_theorem" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/CAP_theorem?referer=');">wikipedia entry on the subject</a>.</p>
<p>I was thinking how would I build an <a href="http://bigdatacraft.com/archives/94">ideal analytics system</a>. Quickly came realization that all “care abouts” cannot be satisfied simultaneously, even assuming enough time for development. Some desirable properties must be sacrificed in favor for others, hence architectural trade-offs are unavoidable in principle. I immediately had déjà vu regarding CAP. So the following is my take on the subject:</p>
<p><strong>SVLC hypothesis regarding architectural trade-offs in analytics</strong></p>
<p>I haven&#8217;t came to rigorous definition yet, here is an intuitive one:  <strong>Current technology doesn&#8217;t allow implementation of a single analytics system that is SVLC which is simultaneously sophisticated, high-volume, low-latency and low-cost .One of these four properties must be sacrificed, the extent to which it is sacrificied determines the extend to which other properties could potentially be implemented.</strong></p>
<p><strong>Deep dive for the brave souls</strong></p>
<p>Let’s reiterate the desired system properties first (see <a href="http://bigdatacraft.com/archives/94">ideal analytics system</a>):</p>
<ol>
<li>Deep <em><strong>Sophistication</strong></em><strong> </strong>=&gt; …free-form SQL:2008 with multi-way joins of 2 and more big tables, sorts of big tables and all the rest of data heavy lifting.</li>
<li>High <strong><em>Volume</em></strong> =&gt; …handling big data volumes, Let&#8217;s cap it 1PB meanwhile for easier thinking.</li>
<li>Low <strong><em>Latency</em></strong> =&gt; …subsecond response time for the query on average. A more concrete description is that latency must be low enough to allow analyst working interactively in conversational manner with the system.</li>
<li>Low-<strong><em>Cost</em></strong> =&gt; … I&#8217;ll define it as commodity hardware and software must not exceed hardware costs. More rigorously? $1/GB/month for actively queried data is my very rough estimation for low-cost.</li>
<li>Multi-form =&gt; any data, relational, serialized objects, text etc&#8230;.</li>
<li>Security =&gt; can speak for itself</li>
</ol>
<p>I found that multi-formness and security doesn’t interfere with implementing the rest of properties and can in principal always be implemented in satisfactory way without major compromises. Some nuances exists tough, but I&#8217;ll ignore them for clearness. So removing them and getting the following list:</p>
<p><em>1. </em><strong><em>Sophistication</em></strong> <em>(deep)              =&gt; </em><strong>S</strong></p>
<p><strong><em>2. </em></strong><strong><em>Volume</em></strong> <em>(high)                          =&gt; </em><strong>V</strong><em> <strong> </strong></em></p>
<p><strong><em>3. </em></strong><strong><em>Latency </em></strong><em>(low)                           =&gt; </em><strong>L<em> </em></strong></p>
<p><strong><em>4. </em></strong><strong><em>Cost </em></strong><em>(low)                                  =&gt; </em><strong>C<em> </em></strong></p>
<p>These four are highly inter-related and form а constraint system . Implementing one to full extent hampers the rest. Let&#8217;s see what trade-offs we have here. Four properties that is 6 potential simple 2-extremes trade-offs. Let&#8217;s settle on geometric tetrahedron to model the architectural trade-off space. Four properties correspond to four vertexes and six trade-offs correspond to six edges. Then we model particular trade-off by putting a point on the corresponding edge. So we get something like this:</p>
<p style="text-align: center;"><a href="http://bigdatacraft.com/wp-content/uploads/2010/10/tetrahedron_SVLC.png"><img class="size-full wp-image-179 aligncenter" title="tetrahedron_SVLC" src="http://bigdatacraft.com/wp-content/uploads/2010/10/tetrahedron_SVLC.png" alt="" width="441" height="294" /></a></p>
<p>Okey, so far so good. Now, I&#8217;ll try to be а devil advocate and challenge my point that any trade-offs are necessary in the first place. So let&#8217;s review the system denoted as</p>
<p><strong>SVLC</strong>=&gt;<em> high-volume, low-latency, deep analytics, low cost</em></p>
<p>Because it is <em>low-latency</em> it will need I/O throughput adequate to scan whole dataset quickly and since it is <em>high-volume</em> (see above for quantitative definition) meaning aforementioned dataset is big, it will need a large number of individual nodes in cluster to provide the required aggregated adequate I/O throughput. The number of machines is further increased with<em> low-cost</em> requirement meaning that simpler servers that are in mainstream sweetspot must be purchased. Therefore system becomes extremely distributed and data being dispersed all over it. The low-cost networking usually mean TCP/IP that is high-overhead, high-latency and low-throughput. <em>Deep Sophistication <span style="font-style: normal;">analytics </span></em>requires performing complex data-intensive operations like full sorting of big datasets, joining big tables or just simple select distinct over big data that will inevitably have long latencies. Once latency is long enough that probability of node failing mid-query become non-neglectable. Latency increase becomes self-perpetuating because of required finer grain of  intermediate result materialization. This is needed to prevent never-ending query restarting and provide a kind of resumable queries. Not other solutions to resumable queries are documented except MapReduce-style intermediate result materialization. This ultimately makes latency batch-class long violating <em>low-latency </em>requirement.</p>
<p>I guess my proof miss the required rigor to be considered seriously by academics, I&#8217;m just an engineer <img src='http://bigdatacraft.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  I love to see it reworked to something more serious tough. I just hope to get the point across and to be of value to engineers and practicing architects.</p>
<p>Anyhow this is the base of my hypothesis showing that it impossible to achieve full <strong>SVLC </strong>using today technology.</p>
<p>Let&#8217;s consider other cases where we give up something. It is easy to visualize such trade-off as a 2D plane dissecting tetrahedron. The three points were three edges are cut corresponds to three trade-offs. For simplicity I&#8217;ll elaborate only radical trade-offs in this post. Radical trade-offs are those were on all six trade-off edges one extreme is selected and this corresponds to putting a trade-off point on one of vertices. Most real-world system make temperate trade-offs that corresponds to the plane that dissects the tetrahedron into sizable parts. Moreover real-world system, especially available from commercial vendors, are a toolbox kind of a system. Meaning that system consists of a set of tools where each one makes a different set of trade-offs. Then it is up to engineer to choose the right tool for the job to the toolbox. However, toolbox approach is not a loophole for this hypothesis, because properties of the different tools don’t add up in desired way. For example the simultaneous use of expensive tool and low-cost tool is still expensive; the simultaneous use of low-latency and high-latency is high-latency. Nevertheless, toolbox approach is best one for real-world problems. Because real-world problems are usually decomposed to a number of sub problems where each may require different tool.</p>
<p>Well&#8230;back to the <strong><em>radical systems</em></strong>&#8230; Let&#8217;s consider all four cases where we completely give up one property to max out the rest three:</p>
<p style="text-align: center;"><a href="http://bigdatacraft.com/wp-content/uploads/2010/10/tetrahedron_SVL.png"><img class="size-full wp-image-184 aligncenter" title="tetrahedron_SVL" src="http://bigdatacraft.com/wp-content/uploads/2010/10/tetrahedron_SVL.png" alt="" width="437" height="292" /></a></p>
<p><strong><br />
</strong><strong> </strong></p>
<p><strong>SVL</strong> =&gt; <em>high-volume, low-latency, deep analytics</em> …giving up <strong>C</strong>ost&#8230; seems to be implementable. In its pure form it reminds classic national security style system. Subsecond querying petabyte-scale dataset with arbitrary joins. Heavily over provisioned Netezza / Exadata / Greenplum / Aster and other MPP-system could do it I believe. Data kept in huge RAM or on flash, huge I/O is available to scan the whole dataset in matter of seconds. High-speed, low-overhead networking is available to with huge bi-section network bandwidth capable to shuffle the whole dataset in matter of seconds. Infiniband/RDMA are the best probably. How bad <strong>C</strong>ost<strong> </strong>can be here? Well… unhealthy to imagine. Throw some numbers in anyway? Will do some back of envelope calculation in my future posts.</p>
<p style="text-align: center;"><a href="http://bigdatacraft.com/wp-content/uploads/2010/10/tetrahedron_SVC.png"><img class="size-full wp-image-187 aligncenter" title="tetrahedron_SVC" src="http://bigdatacraft.com/wp-content/uploads/2010/10/tetrahedron_SVC.png" alt="" width="432" height="288" /></a></p>
<p><strong>SVC</strong> =&gt; <em>high-volume, low-cost, deep analytics</em> …giving up <strong>L</strong>atency&#8230; seems to be implementable, in fact it is MapReduce territory, Hadoop natural habitat. Are ETL systems <strong>SVC</strong>? I think no, because while they given up <strong>L</strong>atency they haven&#8217;t kept on <strong>V</strong>olume. How bad is <strong>L</strong>atency<strong>?</strong> well… forget interactivity, create queuing system and get notified when the job is done. If too slow add servers. If some interactive experimentation is needed, use <strong>VLC</strong> first to develop and prove your hypothesis and only than crunch the data with <strong>SVC</strong>. Since cost is involved I guess Hadoop MapReduce is really a king here. Tough if Aster licenses for example are comparable to commodity cluster overall cost and is not many multiples of it, then it could fit the category nicely. Otherwise it will make suboptimal (considering my model context not in wider sense!) great <strong>SV </strong>system. The <a href="http://www.dbms2.com/2008/01/18/the-great-mapreduce-debate/" onclick="pageTracker._trackPageview('/outgoing/www.dbms2.com/2008/01/18/the-great-mapreduce-debate/?referer=');">great MapReduce debate</a> is not for nothing!</p>
<p style="text-align: center;"><a href="http://bigdatacraft.com/wp-content/uploads/2010/10/tetrahedron_SLC.png"><img class="size-full wp-image-186 aligncenter" title="tetrahedron_SLC" src="http://bigdatacraft.com/wp-content/uploads/2010/10/tetrahedron_SLC.png" alt="" width="432" height="288" /></a></p>
<p><strong>SLC</strong> =&gt; <em>low-latency, low-cost, deep analytics</em> …seems to be implementable in a minute, just start your favorite spreadsheet application <img src='http://bigdatacraft.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  You will be shocked how much data Excel crunches in just few seconds, nowadays. Most traditional BI tools are in this category too. Heck, if not for BigData, the analytics industry will be as would become as exciting as enterprise payroll systems. Though, innovation is possible even <a href="http://www.workday.com" onclick="pageTracker._trackPageview('/outgoing/www.workday.com?referer=');">there</a>.  Heck, 99% of BI is fully feasible to be done completely  in-memory, often on single server and the deployment must be really low-risk low-cost very-rapid if done correctly. Most cloud BI vendors are also in this category. “R-project” is here too. This was Kickfire beloved spot as well as is now for QlikTech &amp; GoofData, PivotLink and etc&#8230; So pretty much all BI vendors are here except MPP heavy-lifters. How bad <strong>V</strong>olumes are limited? Well with CPU-DRAM bandwidth being 50GB/sec and DRAM sizes 64GB on common commodity servers I think crunching few tens GB should be well possible in matter of seconds, if not for implementation sloppiness, and with literally pocket money (average enterprise’s pocket not mine&#8230; yet).</p>
<p style="text-align: center;"><a href="http://bigdatacraft.com/wp-content/uploads/2010/10/tetrahedron_VLC.png"><img class="size-full wp-image-185 aligncenter" title="tetrahedron_VLC" src="http://bigdatacraft.com/wp-content/uploads/2010/10/tetrahedron_VLC.png" alt="" width="432" height="288" /></a></p>
<p><strong>VLC</strong> =&gt; <em>high-volume, low-latency, low-cost</em> …giving up <strong>S</strong>ophistication&#8230;seems to be implementable, that is doing a simple scan and giving up the <strong>S</strong><strong><span style="font-weight: normal;">ophistication</span></strong>, particularly joins. Dremel and BigQuery seem following this approach. How bad is giving up <strong>S</strong>ophistication? Well, it all depends on how pre-joined/nested dataset is. With normalized schemas, well&#8230; unavailability of joins makes it pretty much impractical implement any usable analytics. However, with star-schema and particularly nested data (with some extensive pre-joins even if it means some redundancy), this can work wonders to vast majority of queries, completing them in seconds on even large datasets. However, no pre-join strategy will work for 100% of queries and functions like COUNT DISTINCT must be approximated when run over big dataset like described in Dremel paper. Also I&#8217;ll assign sampling strategy to this category, because sophistication also means accuracy here. One clarification: only joins of several big tables are sacrificed here, joins of big table with even large number of small tables are perfectly okay and done on the fly during the scan. Sorts of big table before it was reduced significantly to manageable size is also sacrificed in this approach, however approximation algorithms can be used for this and then it will be okay too.</p>
<p>Hence the conclusion: <strong><em>only 3 of 4</em></strong> <strong>SVLC </strong>properties can be implemented in full extend in <strong><em>single </em></strong>analytic system. The hypothesis goes that any attempt that allegedly violates it, in fact either is no a single system or impairment is latent in one or more properties.</p>
<p><em>[TODO: rewrite] The extended hypothesis for fractional cases:</em></p>
<p><strong><em> </em></strong></p>
<ul>
<li>Systems/trade-offs may be <strong><em>radical </em></strong>or<strong><em> temperate. </em></strong>Radical trade-off completely gives up one of four properties of the system. Temperate trade-off gives-up the property only fractionally on expense of giving-up other properties also fractionally.</li>
<li>Most real-world systems are complex. They are a set of tools, where each separate tool is a concrete trade-off. Then the user of such system can use different tools with different trade-off sequentially or simultaneously. This may seems as way out of the restraint; however it is not, because properties of separate tools don’t add up in desired way. For example the simultaneous use of expensive tool and low-cost tool is still expensive; the simultaneous use of low-latency and high-latency is high-latency.  Nevertheless, toolbox approach is best one for real-world problems. Because real-world problems are usually decomposed to a number of sub problems where each may require different tool.</li>
<li>Most often trade-offs of real-world systems are <strong><em>temperate.</em></strong></li>
</ul>
<p><strong><em><br />
</em></strong></p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/8-ZAXfI-Yts" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/135/feed</wfw:commentRss>
		<slash:comments>5</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/135</feedburner:origLink></item>
		<item>
		<title>Analytics Patterns</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/gOL3ho6KLIM/96</link>
		<comments>http://bigdatacraft.com/archives/96#comments</comments>
		<pubDate>Fri, 08 Oct 2010 16:22:42 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Advanced Analytics]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Analytics Patterns]]></category>
		<category><![CDATA[Monte Carlo]]></category>
		<category><![CDATA[Terminology]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=96</guid>
		<description><![CDATA[<p>Unsatisfied by my previous post&#8216;s Advanced Analytics definition and giving it a thought of what is advanced methods in analytics I realized that analytics industry miss a good analytics pattern catalog. A list of common problems followed by a list of common industry-consensus solutions to them. An equivalent of GoF design patterns to analytics. The <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/96">Analytics Patterns</a></span>]]></description>
				<content:encoded><![CDATA[<p>Unsatisfied by <a href="http://bigdatacraft.com/archives/62">my previous post</a>&#8216;s Advanced Analytics definition and giving it a thought of what is advanced methods in analytics I realized that analytics industry miss a good analytics pattern catalog. A list of common problems followed by a list of common industry-consensus solutions to them. An equivalent of GoF design patterns to analytics. The list, where each list item starts from brief description of common recurring analytics problem and follows by elaboration by commonly accepted solutions to this problem followed by mandatory example section illustrating the solution using widely available tools.</p>
<p>Software engineers stolen this idea from the real architects (those dealing with a concrete structures not an abstract ones <img src="http://bigdatacraft.com/wp-includes/images/smilies/icon_wink.gif" alt=";)" /> ) 15 years ago.  They haven’t avoided initial short period of mass obsession and abuse of the concept&#8230; who does?  But eventually it worked out quite well for them us. I wonder if analytics industry could leverage these experience and create a catalog of some 25-50 most common patterns. Pattern descriptions in a catalog not to exceed few pages and number of patterns limited to few tens, making it wide industry adoption feasible.</p>
<p>What you think? Any ideas? I’ll try to make a first step by dumping patterns from my head right now (it is by no way a finished work):</p>
<p>I&#8217;ll call it <strong><em>analytics patterns</em></strong>:</p>
<p><strong><em>1.</em> </strong><em><a href="http://en.wikipedia.org/wiki/Predictive_analytics" target="_blank" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Predictive_analytics?referer=');"><strong>Predictive Analytics</strong></a></em>. That was the easiest for me. I was involved into it for the first time some 12 years ago and developing what is now <a href="http://www.oracle.com/demantra/index.html" onclick="pageTracker._trackPageview('/outgoing/www.oracle.com/demantra/index.html?referer=');">http://www.oracle.com/demantra/index.html</a>. The system was used mostly to forecast sales taking into account an array of causal factors like seasonality, marketing campaigns, historical growth rates and etc. The problem is that there is a lot of time-based historic data available and it is required to forecast future values in the context of given historic data. The basic mechanism of implementing <strong><em>Predictive Analytics </em><span style="font-weight: normal;">is to find or less preferably to develop a suitable mathematical model that can model closely (but be  cautious about <a href="http://en.wikipedia.org/wiki/Overfitting" target="_blank" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Overfitting?referer=');">overfitting</a>) existing data, usually a <a href="http://en.wikipedia.org/wiki/Time-series" target="_blank" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Time-series?referer=');">time-series</a> data and then use the model to induce forecasted values. In simple terms it is a case of <a href="http://en.wikipedia.org/wiki/Extrapolation" target="_blank" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Extrapolation?referer=');">extrapolation</a>. Correct me if I&#8217;m wrong. As it was the case in 90-ties I&#8217;m pretty sure it is the case now, that exotic hardcore AI approaches like neural networks &amp; genetic programming are best kept exclusively for moonlighting experiments and as material for cooler conversation the next morning. With deadlines defined and limited budget it is best to stick to proven techniques to achieve quick wins. I think the value of working forecasting is self evident.</strong></p>
<p><strong><em>2.</em> </strong><em><a href="http://en.wikipedia.org/wiki/Cluster_analysis" target="_blank" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Cluster_analysis?referer=');"><strong>Clustering</strong></a></em>. Well not the heavy noisy one in a cold hall <img src='http://bigdatacraft.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  but the statistics sub-discipline called better cluster-analysis. The problem here is that a lot of high-dimensionality data is available and it is required to discover groups with similar observations in other words automatically classify them. It is implemented by searching for correlations grouping the records according to the discovered correlations. What it is good for? Well in simple terms it helps to discriminate different kinds of objects and observe the specific properties of each kind. Without such grouping, one would be able only to observe properties that all objects exhibit or alternatively go object by object and observe it in isolation.</p>
<p><em>3.</em><strong><em> </em></strong><a href="http://en.wikipedia.org/wiki/Monte_Carlo_simulation" target="_blank" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Monte_Carlo_simulation?referer=');"><strong><em>Risk Analysis</em></strong></a><strong><em> </em></strong>- particularly through <em><strong>Monte-Carlo simulation</strong></em>. It is not called Monte-Carlo because it is invented there <img src='http://bigdatacraft.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  it called so because of reliance on random numbers akin Monte-Carlo casinos. Random numbers are proved most effective way to simulate mathematical model with large number of free-variables. With advent of computers it became a whole lot easier than using the <a href="http://en.wikipedia.org/wiki/Monte_Carlo_simulation" target="_blank" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Monte_Carlo_simulation?referer=');">book</a>.</p>
<p><strong>4. </strong>Given telecom event stream, run events through the rules engine to detect and prevent telecom fraud in real-time. This is essentially CEP engine and usually implemented by creating a state-machine per rule and running the events through it. Special version of stream sql is used. Similar scheme can be used for real-time click fraud prevention.<br />
<em><strong>5. </strong></em>Given serialized object data or nested data allow running ad-hoc interactive queries over it in BigQuery fashion.<br />
<em><strong>6. </strong></em>Given normalized relational model, allow running any ad-hoc queries. For common joins create a materialized view to speed up joins.<br />
<em><strong>7. </strong></em>Canned reports. I guess they are good also for some cases…….<br />
<em><strong>8. </strong></em>OLAP/Star schema when to use? ……</p>
<p>What else?</p>
<p>Of course it is just a first step and to do it correctly it will be a project in itself, in form of a book most probably. However, as one Chinese proverb  goes “A journey of a thousand miles begins with a single step”.</p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/gOL3ho6KLIM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/96/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/96</feedburner:origLink></item>
		<item>
		<title>Feature list of ultimate BigData analytics</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/k-sU84J4fIQ/94</link>
		<comments>http://bigdatacraft.com/archives/94#comments</comments>
		<pubDate>Fri, 08 Oct 2010 14:58:28 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Advanced Analytics]]></category>
		<category><![CDATA[Analytics]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=94</guid>
		<description><![CDATA[ Volume Scalability =&#62; the solution must handle high volumes of data, meaning the cost must scale linearly in the range of 10GB – 10PB. Latency Scalability =&#62; the solution must be interactive or batch, and cost must scale linearly in the range of 1 msec – 1 week. Sophistication Scalability =&#62; the solution <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/94">Feature list of ultimate BigData analytics</a></span>]]></description>
				<content:encoded><![CDATA[<ul>
<li><strong><em>Volume </em></strong>Scalability =&gt; the solution must handle high volumes of data, meaning the cost must scale linearly in the range of 10GB – 10PB.</li>
<li><strong><em>Latency </em></strong>Scalability =&gt; the solution must be interactive or batch, and cost must scale linearly in the range of 1 msec – 1 week.</li>
<li><strong><em>Sophistication </em></strong>Scalability =&gt; the solution must support simple summing scans or complex multi-way joins and statistics functionality and the cost must scale linearly in the range of simplistic scans to full blown SQL:2008/MDX/imperative in-database-analytics/MapReduce. Report/index viewing is not considered as analytics at all and particularly as not low-sophistication analytics. Report/index creation is analytics and can be of varied sophistication degree. ETL systems is considered as independent analytic systems.</li>
<li><em>Security</em> =&gt; any unauthorized access to data must be prevented and in the same time, in-place data analysis (like predicate evaluation) must be possible and resource-efficient.
<ul>
<li>Keeping data always encryption and keeping keys always on client will not work. It will require shipping all the data to the client and is non-starter for big data analytics. So compromises must be made. The issue is especially contentious in public cloud setting.</li>
<li>If data is stored encrypted and is continuously decrypted in-place for predicate evaluation, for example, it means that keys must kept in same place (at least temporarily) and it compromises the whole scheme altogether, flooring its cost-benefit factor. The cost of decryption is pretty high.</li>
<li>De-identification of all fields may work; random scaling may be applied to numeric fields with subsequent query/result rewrite.</li>
<li>Security-by-obscurity methods and defense-in-depth approach may have good cost-benefit factors matching or exceeding overall security for in-house approach.</li>
</ul>
</li>
<li><strong>Cost </strong>=&gt; must have low-TCO that scales linearly to dataset size and the load factor caused by submitted queries. The breakdown (assuming cloud):
<ul>
<li>Storage component linear to dataset size. Economies of scale must bring this cost down significantly. Eventually it must be cheaper than on-site storage.</li>
<li>Computing component linear to load with infinite intra-query automatic elasticity. Guarantied elasticity may bear a fixed premium proportional to guarantied capacity. Minor failures of cloud component must not restart long running queries.</li>
<li>Bandwidth component. Fedexing hard-drives are by far the cheapest way to upload data, and then query results are really small. How much information human can comprehend instantly after all?</li>
</ul>
</li>
<li><strong>Multi-form</strong> =&gt;
<ul>
<li>normalized relational</li>
<li>star-schema</li>
<li>cubes</li>
<li>serialized objects / nested data.</li>
<li>text</li>
<li>media</li>
<li>spatial</li>
<li>bio / scientific</li>
<li>topographical</li>
<li>and other data forms must be equally well supported and cross-queried.</li>
</ul>
</li>
</ul>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/k-sU84J4fIQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/94/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/94</feedburner:origLink></item>
		<item>
		<title>Terminology: Analysis vs. analytics and more…</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/oT9D3M8MjYo/62</link>
		<comments>http://bigdatacraft.com/archives/62#comments</comments>
		<pubDate>Fri, 08 Oct 2010 00:10:39 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[BI]]></category>
		<category><![CDATA[QlikView]]></category>
		<category><![CDATA[Terminology]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=62</guid>
		<description><![CDATA[<p>I see a lot of confusion in the usage of newer terms in analytics. I do confuse them myself occasionally. I find it funny that the industry as serious as analytics tolerates constant renewal of its basic terminology. Yet, I confess, I&#8217;m very guilty of it myself. I do enjoy the freshness and the novelty <span style="color:#777"> . . . &#8594; Read More: <a href="http://bigdatacraft.com/archives/62">Terminology: Analysis vs. analytics and more&#8230;</a></span>]]></description>
				<content:encoded><![CDATA[<p>I see a lot of confusion in the usage of newer terms in analytics. I do confuse them myself occasionally. I find it funny that the industry as serious as analytics tolerates constant renewal of its basic terminology. Yet, I confess, I&#8217;m very guilty of it myself. I do enjoy the freshness and the novelty of newer terms even being fully aware that is fake by a large extent.</p>
<p>In this post I&#8217;ll take a step to clear the confusion on few most basic terms: analysis vs. analytics vs. BI and all their common derivatives.</p>
<p><em><strong>The Spoiler (the quick answer):</strong></em></p>
<p>Analysis is the <em>examination process</em> itself where analytics is the supporting <em>technology and associated tools</em>. BI is quite synonymous to analytics in IT context. Advanced Analytics, Business Analytics, Data Analytics, Analytics Software, Analytics Technology are almost always marketing pleonasms (redundant expressions) and can be safely substituted by just &#8216;analytics&#8217;. &#8216;Data analysis&#8217; is yet another pleonasm. Compound expressions of these words such as &#8216;BI Analytic Technology&#8217; are yet again pleonasms albeit of higher degrees. Some nuances exist tough and are elaborated in this post.</p>
<p><em><strong>The deep dive for the brave souls:</strong></em></p>
<p>Let’s attempt to properly define the terms and then carefully examine the alleged differences.</p>
<p>Before we dive in, a word of caution: definition by synonyms is wrong. It causes stack overflow in the mind of programmers. For example “analysis” =&gt; “critical examination” =&gt; “examination” =&gt; “critical inspection” =&gt; “inspection” = “critical examination” =&gt; “f&#8230;”=&gt; “why I just don’t make myself a cup of coffee?”.</p>
<p>You can check what makes a good definition and common mistakes following…&#8230;. Well apparently I haven’t found in a quick look a good material  on proper definition but for fallacies there is <a href="http://en.wikipedia.org/wiki/Fallacies_of_definition" target="_blank" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Fallacies_of_definition?referer=');">a nice wikipedia article</a>. If you find a good article on what makes a good definition drop me a note / comment, if so it would include a definition definition.</p>
<p>Let&#8217;s start&#8230;.</p>
<p><em><strong>What is analysis?</strong></em></p>
<p>Analysis is a pretty old, well understood term and essentially means “<strong><em>breaking down</em></strong>” or “<strong><em>decomposition</em></strong>”. More accurately – “<strong><em>the process of decomposing complex entity into simpler components for easier comprehension of aforementioned entity</em></strong>”. As a child I did a lot of it to the toys and electronic appliances around me. I challenge you to find a better and more concise definition than mine above (it is a matter of taste but anyway). Here is some links to save you time:</p>
<p><a href="http://www.google.com/search?q=define:+analysis" target="_blank" onclick="pageTracker._trackPageview('/outgoing/www.google.com/search?q=define_+analysis&amp;referer=');">http://www.google.com/search?q=define:+analysis</a></p>
<p><a href="http://en.wikipedia.org/wiki/Analysis" target="_blank" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Analysis?referer=');">http://en.wikipedia.org/wiki/Analysis</a></p>
<p><a href="http://en.wiktionary.org/wiki/analysis" onclick="pageTracker._trackPageview('/outgoing/en.wiktionary.org/wiki/analysis?referer=');">http://en.wiktionary.org/wiki/analysis</a></p>
<p><a href="http://thesaurus.com/browse/analysis" onclick="pageTracker._trackPageview('/outgoing/thesaurus.com/browse/analysis?referer=');">http://thesaurus.com/browse/analysis</a></p>
<p><strong><em>What is analytics?</em></strong></p>
<p>Analytics is a newer term related to analysis and looking it up will usually only add to confusion since definitions vary and are fuzzy and seems to be context-dependent. Focusing on IT context I went through many usage examples and definitions. My verdict is that analytics just means: <strong>the technology and the associated tools for data analysis</strong>.</p>
<p>If so, then &#8216;data analytics technology&#8217; is a double redundant (or more accurately pleonasmic) term because analytics is a technology by itself and it&#8217;s clearly obvious that in IT context only data can be analyzed. Hence the above phrase can be abbreviated as &#8216;analytics&#8217; without any impairment to the meaning. Same goes to &#8216;data analytics tools&#8217;. However, when IT context is not implied, something like  &#8217;data analytics software&#8217; could be appropriate. In this case &#8216;data&#8217; links it to IT and &#8216;software&#8217; further narrows its meaning.</p>
<p><strong><em>Incorrect usage (according to my interpretation):</em></strong></p>
<p>Software company most probably doesn&#8217;t develop “next-gen data-analysis” but “next-gen data-analytics”.  And by the same token “cloud computing analysis” means examining cloud computing concept not using cloud computing as a tool for doing analysis. In latter case “cloud analytics” must be used.</p>
<p>Analyst performs in-database analysis or applies in-database analytics to calculate something. However analyst doesn&#8217;t performs in-database analytics.</p>
<p>If you look the terms used by QlikView folks you will find pretty much all the above terns used interchangeably, including the statement that they &#8220;provide fast, powerful and visual in-memory business analysis”. One may think that they provide business advise for companies in memory business. Terminology aside no bashing QlikView, it is excellent analytics software and one of very few that just works out of the box.</p>
<p><em><strong>What is analytical?</strong></em></p>
<p>In regard to data it means that it compiled using analysis. In regard to the tool it means that it is intended for analysis.</p>
<p><em><strong>Data Analysis and Data Analytics</strong></em></p>
<p>As already mentioned in IT context both are pleonasms and non-data analysis or non-data analytics are both oxymorons. So why stress data anyway? Mostly there is no reason and in other cases it is there to hint IT context. For example for bankers it is &#8216;financial analytics&#8217; but for IT folks in the bank it is &#8216;data analytics&#8217;.</p>
<p><strong><em>What &#8216;advanced analytics&#8217; hints then?</em></strong></p>
<p>Well, I guess it is a way for a vendor to indicate that their analytics is less stagnating than of their competitors <img src='http://bigdatacraft.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  Seriously tough, I guess it means, where it really used to mean anything that statistics methods are implemented like: predictive modeling and clustering. Also it has strong connotations with <a href="http://www.gartner.com/it/page.jsp?id=1210613" onclick="pageTracker._trackPageview('/outgoing/www.gartner.com/it/page.jsp?id=1210613&amp;referer=');">Gartner press-release</a> naming it second most promising technology for 2010.</p>
<p><strong><em>What is wrong with just sticking with older BI term?</em></strong></p>
<p>It is a fashion thing I guess…. who said IT is boring? We could easily challenge Parisian fashion industry on that. Seriously tough, BI is considered as more comprehensive approach encompassing many aspect and is usually cross departmental, notorious for high project failure rate.  At least that way younger startups portrait it. On the other hand &#8216;data analytics&#8217; is portrait something more simple and more of a &#8216;quick wins&#8217; departmental solution. Something akin &#8216;Data mart&#8217;. And don&#8217;t ask me what is the difference with data marts. Have I mentioned fashion thing.</p>
<p>Well aside of fashion, there are more rational reasons too of course. Startup pitching BI, sounds boring at best with Microsoft, IBM, Oracle dominating it. It must define a new disruptive category and then dominate it. Who read Christiansen could remember that no new terms is necessary for disruption. Somehow it is easier to communicate using new terms. I would love to believe that it is not deceiving. In fact masquerading advanced analytics as something completely distinct may work all the way from investors to the customer&#8217;s CIO that may find suspicious that he is purchasing too many BI solutions, and purchasing first “advanced analytics” solution and early enough may seems quite smart and a sign that his organization is far from being in stagnation, especially just after reading Gartner press-release.</p>
<p><em><strong>UPDATE:</strong></em></p>
<p>Another view on the subject: <a href="http://www.b-eye-network.com/view/13797" onclick="pageTracker._trackPageview('/outgoing/www.b-eye-network.com/view/13797?referer=');">http://www.b-eye-network.com/view/13797</a></p>
<p>Yet another one: <a href="http://blogs.forrester.com/boris_evelson/10-06-07-bi_vs_analytics" onclick="pageTracker._trackPageview('/outgoing/blogs.forrester.com/boris_evelson/10-06-07-bi_vs_analytics?referer=');">http://blogs.forrester.com/boris_evelson/10-06-07-bi_vs_analytics</a></p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/oT9D3M8MjYo" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/62/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/62</feedburner:origLink></item>
		<item>
		<title>The story behind this blog</title>
		<link>http://feedproxy.google.com/~r/InsightCrew/~3/t7LT-kAbcpE/113</link>
		<comments>http://bigdatacraft.com/archives/113#comments</comments>
		<pubDate>Fri, 01 Oct 2010 08:47:24 +0000</pubDate>
		<dc:creator>Camuel Gilyadov</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://bigdatacraft.com/?p=113</guid>
		<description><![CDATA[<p>The story behind this blog</p> ]]></description>
				<content:encoded><![CDATA[<p><span id="more-113"></span><a href="http://bigdatacraft.com/story" target="_self">The story behind this blog</a></p>
<img src="http://feeds.feedburner.com/~r/InsightCrew/~4/t7LT-kAbcpE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://bigdatacraft.com/archives/113/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://bigdatacraft.com/archives/113</feedburner:origLink></item>
	</channel>
</rss>
