<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><description>NoSQL Databases and Polyglot Persistence: A Curated Guide</description><title>myNoSQL</title><generator>Tumblr (3.0; @nosql)</generator><link>https://nosql.mypopescu.com/</link><item><title>Autoscaling, welcome to Google Compute Engine</title><description>&lt;a href="http://googlecloudplatform.blogspot.com/2014/11/autoscaling-welcome-to-google-compute.html"&gt;Autoscaling, welcome to Google Compute Engine&lt;/a&gt;: &lt;blockquote&gt;
&lt;p&gt;Autoscaling allows customers to build more cost effective and resilient
applications. Using Compute Engine Autoscaling, you can ensure that exactly
the right number of Compute Engine instances are available at any given time
to handle your application’s workload. This saves you money when your
application’s usage is low, and ensures your application is responsive when
utilization is high.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;Autoscaling&lt;/strong&gt; is the the Holy grail of a distributed system. The promise is that the system is be able to adapt—both up and down—to the needs/requirements/SLAs. Basically, the system will be able to get the performance it is demanded to provide, maximum availability, and these with optimal costs.&lt;/p&gt;
&lt;p&gt;The first step in finding this &lt;em&gt;Holy grail&lt;/em&gt; is to be able to describe the needs and requirements and SLAs of the system.&lt;/p&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/" rel="permalink" style="color:red" target="_blank"&gt;Autoscaling, welcome to Google Compute Engine&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;


&lt;!--quid:7ec3849189058165104d3d6fa708d17dc84c5759--&gt;</description><link>https://nosql.mypopescu.com/post/103465069225</link><guid>https://nosql.mypopescu.com/post/103465069225</guid><pubDate>Mon, 24 Nov 2014 07:41:21 -0800</pubDate><category>distributed systems</category><category>scalability</category></item><item><title>Aurora for MySQL is coming</title><description>&lt;a href="http://smalldatum.blogspot.com/2014/11/aurora-for-mysql-is-coming.html"&gt;Aurora for MySQL is coming&lt;/a&gt;: &lt;p&gt;Mark Callghan takes a look at: &lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Amazon’s participation in the MySQL community — none&lt;/li&gt;
&lt;li&gt;some of the things said during the presenttions — performance seems to be inflated&lt;/li&gt;
&lt;li&gt;compability with existing MySQL features and especially InnoDB engine&lt;/li&gt;
&lt;li&gt;features — very similar to my &lt;a href="http://nosql.mypopescu.com/post/102599302892/amazon-aurora-in-bullet-points" target="_blank"&gt;Amazon Aurora in bullet points&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;What is Aurora? I don’t know and we might never find out. I assume it is a
completely new storage engine rather than a new IO layer under InnoDB.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/" rel="permalink" style="color:red" target="_blank"&gt;Aurora for MySQL is coming&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;


&lt;!--quid:c0448d27ce09952ddac25a72fa0742f1c026fbbc--&gt;</description><link>https://nosql.mypopescu.com/post/103454088542</link><guid>https://nosql.mypopescu.com/post/103454088542</guid><pubDate>Mon, 24 Nov 2014 03:14:13 -0800</pubDate><category>Aurora</category><category>Amazon</category></item><item><title>Medium uses Neo4j and Go for GoSocial service</title><description>&lt;a href="https://medium.com/medium-eng/how-medium-goes-social-b7dbefa6d413"&gt;Medium uses Neo4j and Go for GoSocial service&lt;/a&gt;: &lt;p&gt;Medium’s social graph stored in Neo4j and exposed through a Go service:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;It makes a lot of sense to store social data in a graph database. Medium
users, posts and collections are represented by graph nodes, and the edges
between them describe relationships — users following users, users
recommending posts, or users editing collections, to name a few common
examples. Using a graph database also makes our queries simpler: we don’t
have to do any complicated joins or other query wizardry.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It’s hard to deny that when looking at highly connected data the first answer is &lt;em&gt;almost&lt;/em&gt; always a graph database. Once the amount of data stored grows, you start thinking how you access that data. In many cases, the predominant answer is not traversals.&lt;/p&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/" rel="permalink" style="color:red" target="_blank"&gt;Medium uses Neo4j and Go for GoSocial service&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;


&lt;!--quid:5c1359f638cb3e50aa6f5e37c3fa72f960921ce4--&gt;</description><link>https://nosql.mypopescu.com/post/103451536694</link><guid>https://nosql.mypopescu.com/post/103451536694</guid><pubDate>Mon, 24 Nov 2014 01:44:29 -0800</pubDate><category>Neo4j</category><category>Go</category><category>graphdb</category><category>graph database</category></item><item><title>Stripe's Hadoop tools open sourced</title><description>&lt;a href="https://stripe.com/blog/four-new-hadoop-projects"&gt;Stripe's Hadoop tools open sourced&lt;/a&gt;: &lt;p&gt;Stripe has put on &lt;a href="http://github.com/stripe" rel="external nofollow" target="_blank"&gt;GitHub&lt;/a&gt; 4 Hadoop related projects they’ve developed internally:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;a dashboard for Hadoop jobs&lt;/li&gt;
&lt;li&gt;a Scala framework for distributed learning&lt;/li&gt;
&lt;li&gt;a database for serving data in SequenceFile format&lt;/li&gt;
&lt;li&gt;a collection of command-line utilities.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;As a side note, Stripe is using Cloudera Impala with Parquet.&lt;/p&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/post/103272099222" rel="permalink" style="color:red" target="_blank"&gt;Stripe’s Hadoop tools open sourced&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;</description><link>https://nosql.mypopescu.com/post/103272099222</link><guid>https://nosql.mypopescu.com/post/103272099222</guid><pubDate>Sat, 22 Nov 2014 02:48:00 -0800</pubDate><category>Hadoop</category><category>Impala</category><category>Parquet</category><category>MapReduce</category><category>BigData</category></item><item><title>NoSQL databases, Hadoop, Big Data: Pinned tabs Nov.19th</title><description>&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.b-eye-network.com/blogs/vanderlans/archives/2014/10/querygrid_is_ne.php" title="QueryGrid is New Data Federation Technology by Teradata - Blog: Rick van der Lans - BeyeNETWORK " id="86f5dfdb0aa3ac6aad752e6035044289df0e9dff" rel="external nofollow" target="_blank"&gt;01&lt;/a&gt;&lt;/strong&gt;: Teradata QueryGrid is the technology used to allow querying both Teradata/AsterData and external data stored in Hadoop or Oracle.
&lt;a href="#86f5dfdb0aa3ac6aad752e6035044289df0e9dff" class="ptl" target="_blank"&gt;★&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.marklogic.com/press-releases/marklogic-sets-standard-for-modern-database/" title="MarkLogic Sets the Standard for Modern Database Technology | MarkLogic " id="cf879df76d55dfe5e1917b6d4704d6a2095ca83c" rel="external nofollow" target="_blank"&gt;02&lt;/a&gt;&lt;/strong&gt;: 
MarkLogic 8 will bring Javascript server-side engine, RDF triple store engine with support for SPARQL 1.1, bitemporal data management.
&lt;a href="#cf879df76d55dfe5e1917b6d4704d6a2095ca83c" class="ptl" target="_blank"&gt;★&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;I still believe that MarkLogic should position itself as real-time search solution.&lt;/em&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.datastax.com/dev/blog/whats-coming-to-cassandra-in-3-0-improved-hint-storage-and-delivery" title="What’s Coming to Cassandra in 3.0: Improved Hint Storage and Delivery : DataStax " id="d6aad96e77ebb5ba32a6a9422f13c283a9bb46d2" rel="external nofollow" target="_blank"&gt;03&lt;/a&gt;&lt;/strong&gt;: 
For Cassandra 3.0, there’s an completely revamped, and optimized, solution for handling &lt;strong&gt;hinted handoff&lt;/strong&gt; that uses sort of a commit log instead of a Cassandra system table (thus avoiding any overhead associated).
&lt;a href="#d6aad96e77ebb5ba32a6a9422f13c283a9bb46d2" class="ptl" target="_blank"&gt;★&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.pcworld.idg.com.au/article/559848/hp-plugs-vertica-analytics-engine-into-hadoop/" title="HP plugs the Vertica analytics engine into Hadoop - PC World Australia " id="3e3b8702a29791b3981500908808791e495000d9" rel="external nofollow" target="_blank"&gt;04&lt;/a&gt;&lt;/strong&gt;: 
YASH. Yet another SQL-on-Hadoop. This one from HP Vertica.
&lt;a href="#3e3b8702a29791b3981500908808791e495000d9" class="ptl" target="_blank"&gt;★&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href="http://www.cmswire.com/cms/big-data/mapr-teradata-ink-deal-bad-timing-for-hortonworks-027253.php" title="MapR, Teradata Ink Deal, Bad Timing for Hortonworks? " id="859c0a35eae584dcdd22d0696ae1ef7bf3c9eda5" rel="external nofollow" target="_blank"&gt;05&lt;/a&gt;&lt;/strong&gt;: 
Teradata and MapR are signing a partnership to collaborate on the integration and co-development of join products. Some can say this might impact &lt;a href="http://nosql.mypopescu.com/post/103036005105/it-aint-easy-making-money-in-open-source-thoughts-on" target="_blank"&gt;the Hortonworks’s IPO&lt;/a&gt;.
&lt;a href="#859c0a35eae584dcdd22d0696ae1ef7bf3c9eda5" class="ptl" target="_blank"&gt;★&lt;/a&gt;&lt;/p&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/post/103110030252" rel="permalink" style="color:red" target="_blank"&gt;NoSQL databases, Hadoop, Big Data: Pinned tabs Nov.19th&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;</description><link>https://nosql.mypopescu.com/post/103110030252</link><guid>https://nosql.mypopescu.com/post/103110030252</guid><pubDate>Thu, 20 Nov 2014 12:41:27 -0800</pubDate><category>Teradata</category><category>MarkLogic</category><category>document database</category></item><item><title>The states and transitions of a Couchbase node</title><description>&lt;p&gt;The different states and the transitions of a Couchbase node in a diagram:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Couchbase node states and transitions" src="https://64.media.tumblr.com/f8e32e25678455fb3300db9c8f682ecb/tumblr_nf9vqrpZjj1qavt6co1_1280.jpg" width="580" height="326"/&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blog.couchbase.com/lifecycle-node-couchbase-server-demystified-adding-removing-nodes-rebalancing-failover" rel="external nofollow" target="_blank"&gt;This post&lt;/a&gt; describes the states and actions that can trigger the transitions. One interesting aspect is that state changes are not applied immediately and you can &lt;em&gt;commit&lt;/em&gt; multiple such changes at once when satisfied with the new topology.&lt;/p&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/" rel="permalink" style="color:red" target="_blank"&gt;The states and transitions of a Couchbase node&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;


&lt;!--quid:c81f93728a8fb611621b0dcfbcca7c0ee88a4b43--&gt;</description><link>https://nosql.mypopescu.com/post/103125141850</link><guid>https://nosql.mypopescu.com/post/103125141850</guid><pubDate>Thu, 20 Nov 2014 07:29:34 -0800</pubDate><category>Couchbase</category><category>key-value store</category><category>document database</category></item><item><title>Can MapReduce Solve Planning Problems?</title><description>&lt;a href="https://www.voxxed.com/blog/2014/11/can-mapreduce-solve-planning-problems-3/"&gt;Can MapReduce Solve Planning Problems?&lt;/a&gt;: &lt;p&gt;&lt;a href="http://en.wikipedia.org/wiki/Betteridge's_law_of_headlines" rel="external nofollow" target="_blank"&gt;Betteridge’s law of headlines&lt;/a&gt;.&lt;/p&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/" rel="permalink" style="color:red" target="_blank"&gt;Can MapReduce Solve Planning Problems?&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;


&lt;!--quid:d2e0f3b8589c10c00ce98d0d4cf0d7c177f43da9--&gt;</description><link>https://nosql.mypopescu.com/post/103116398183</link><guid>https://nosql.mypopescu.com/post/103116398183</guid><pubDate>Thu, 20 Nov 2014 04:01:45 -0800</pubDate><category>MapReduce</category></item><item><title>It Ain’t Easy Making Money in Open Source:  Thoughts on the Hortonworks's IPO Filling</title><description>&lt;a href="http://www.enterpriseirregulars.com/80464/aint-easy-making-money-open-source-thoughts-hortonworks-s-1/"&gt;It Ain’t Easy Making Money in Open Source:  Thoughts on the Hortonworks's IPO Filling&lt;/a&gt;: &lt;p&gt;Dave Kellogg’s in-depth look at the Hortonworks’s filling for IPO, a comparison with RedHat’s model, and a definitely interesting hypothesis and conclusion:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;While Hadoop and big data are unarguably huge trends driving the industry
and while the future of Hadoop looks very bright indeed, on reading the
Hortonworks S-1, the reader is drawn to the inexorable conclusion that  it’s
hard to make money in open source, or more crassly, it’s hard to make money
when you give the shit away.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Others:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="http://nosql.mypopescu.com/post/102949496417/hortonworks-ipo-why-now-or-better-who-will-benefit" target="_blank"&gt;Gartner’s Merv Adrian&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://nosql.mypopescu.com/post/102949958827/hortonworks-filling-for-ipo-the-marketing-of-going" target="_blank"&gt;InfoWorld’s Yves de Montcheuil&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://nosql.mypopescu.com/post/102342581262/game-on-hortonworks-files-for-ipo" target="_blank"&gt;myself&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/" rel="permalink" style="color:red" target="_blank"&gt;It Ain’t Easy Making Money in Open Source:  Thoughts on the Hortonworks’s IPO Filling&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;


&lt;!--quid:8566fd5d9b2a4feac20e405ed99c08e1629a62a2--&gt;</description><link>https://nosql.mypopescu.com/post/103036005105</link><guid>https://nosql.mypopescu.com/post/103036005105</guid><pubDate>Wed, 19 Nov 2014 04:25:17 -0800</pubDate><category>Hortonworks</category><category>Hadoop market</category></item><item><title>CouchDB's long road to clustering</title><description>&lt;a href="http://www.infoworld.com/article/2848127/nosql/couchdb-20-counters-mongodb-with-improved-scaling.html"&gt;CouchDB's long road to clustering&lt;/a&gt;: &lt;p&gt;Keyword is &lt;em&gt;partially&lt;/em&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;CouchDB’s long road to clustering can be partially traced to conscious
design decisions and philosophical choices made by CouchDB’s creators. As
Lehnardt explained, “CouchDB has always said no to features that we know
couldn’t be scalable in a cluster or even doable in a cluster. This puts us
in a position to migrate upward seamlessly.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Two years ago and CouchDB would have actually been somewhere.&lt;/p&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/" rel="permalink" style="color:red" target="_blank"&gt;CouchDB’s long road to clustering&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;


&lt;!--quid:1c5b3375f7fbb1aa56528fade1b3f18bbb504c3a--&gt;</description><link>https://nosql.mypopescu.com/post/103034903968</link><guid>https://nosql.mypopescu.com/post/103034903968</guid><pubDate>Wed, 19 Nov 2014 03:52:23 -0800</pubDate><category>CouchDB</category><category>Cloudant</category><category>document database</category></item><item><title>Apache CouchDB 2.0 gets clustering support</title><description>&lt;blockquote&gt;
&lt;p&gt;At ApacheCon Europe 2014, the Apache CouchDB™ project today announced a
Developer Preview release of its CouchDB 2.0 document database. The
Developer Preview release brings all-new clustering technology to the Open
Source NoSQL database, enabling a range of big data capabilities that
include being able to store, replicate, sync, and process large amounts of
data distributed across individual servers, data centers, and geographical
regions in any deployment configuration, including private, hybrid, and
multi-cloud.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I’m not sure who wrote &lt;a href="https://blogs.apache.org/foundation/entry/apache_couchdb_adds_clustering_and" rel="external nofollow" target="_blank"&gt;the ASF PR announcement&lt;/a&gt;, but if it was me I would have simply posted “Apache CouchDB 2.0 features clustering support. Finally. &amp;lt;/eom&amp;gt;&amp;ldquo;&lt;/p&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/" rel="permalink" style="color:red" target="_blank"&gt;Apache CouchDB 2.0 gets clustering support&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;


&lt;!--quid:2caf77fe91db6194f2be893cf3a925e50fdb7b3d--&gt;</description><link>https://nosql.mypopescu.com/post/103034595127</link><guid>https://nosql.mypopescu.com/post/103034595127</guid><pubDate>Wed, 19 Nov 2014 03:43:06 -0800</pubDate><category>CouchDB</category><category>document database</category></item><item><title>The data flow and the massive historical Tweet index</title><description>&lt;a href="https://blog.twitter.com/2014/building-a-complete-tweet-index"&gt;The data flow and the massive historical Tweet index&lt;/a&gt;: &lt;p&gt;We rarely have the opportunity to learn about the &lt;em&gt;almost&lt;/em&gt; complete architecture and data flow for a massive data indexing solution. Twitter’s blog post covers many details of their indexing solution starting with design goals and getting down to technical &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;But our long-standing goal has been to let people search through every Tweet
ever published.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;My notes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;half a trillion documents&lt;/li&gt;
&lt;li&gt;average latency under 100ms&lt;/li&gt;
&lt;li&gt;(super tuned) SSD used as storage&lt;/li&gt;
&lt;li&gt;4 components: batch data aggregation and preprocess pipeline, inverted index builder, Earlybird shards and roots; &lt;em&gt;what are the Earlybird roots?&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;ingestion processes one day of tweets batches. it is run every day; in this process tweets are scored and partitioned&lt;/li&gt;
&lt;li&gt;Hadoop for ETL: ingestion process is run on Hadoop, with the output being stored in HDFS&lt;/li&gt;
&lt;li&gt;Mesos is used to parallelize the inverted index creation; results are stored in HDFS&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;after praising the high parallelism and statelessness of the index builders, some coordination using ZooKeeper is mentioned:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;These inverted index builders can coordinate with each
other by placing locks on ZooKeeper, which ensures that
two builders don’t build the same segment. Using this
approach, we rebuilt inverted indices for nearly half a
trillion Tweets in only about two days (fun fact: our
bottleneck is actually the Hadoop namenode).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;the Earlybird shards are the storage of the inverted index partitioned by time and then hash; partitioning by time tiers will allow growing the storage without affecting the current time tiers&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;the Earlybird roots are the endpoint for the client API; they forward requests to the corresponding Earlybird shards, merge results, etc;&lt;/li&gt;
&lt;li&gt;not very sure how Earlybird roots decide what time tiers should not receive a query&lt;/li&gt;
&lt;li&gt;no words about the actual Earlybird storage; can it be &lt;a href="https://blog.twitter.com/2014/manhattan-our-real-time-multi-tenant-distributed-database-for-twitter-scale" rel="external nofollow" target="_blank"&gt;Manhattan&lt;/a&gt;?&lt;/li&gt;
&lt;li&gt;no details about the query processor&lt;/li&gt;
&lt;li&gt;this project started in 2012; the full index was completely built in 2014&lt;/li&gt;
&lt;/ul&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/" rel="permalink" style="color:red" target="_blank"&gt;The data flow and the massive historical Tweet index&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;


&lt;!--quid:cf7eed8b8b2e8fd2945c284b07eb5d2de4131e14--&gt;</description><link>https://nosql.mypopescu.com/post/103029869612</link><guid>https://nosql.mypopescu.com/post/103029869612</guid><pubDate>Wed, 19 Nov 2014 00:53:43 -0800</pubDate><category>full text indexing</category></item><item><title>What skills is a recruiting company looking for in a data scientist</title><description>&lt;a href="http://www.burtchworks.com/2014/11/17/must-have-skills-to-become-a-data-scientist/"&gt;What skills is a recruiting company looking for in a data scientist&lt;/a&gt;: &lt;p&gt;For the &lt;em&gt;technical&lt;/em&gt; part the list goes like this: &lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;SAS and/or R&lt;/li&gt;
&lt;li&gt;Python&lt;/li&gt;
&lt;li&gt;Hadoop&lt;/li&gt;
&lt;li&gt;SQL&lt;/li&gt;
&lt;li&gt;unstructure data&lt;/li&gt;
&lt;/ol&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/post/103023903087" rel="permalink" style="color:red" target="_blank"&gt;What skills is a recruiting company looking for in a data scientist&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;</description><link>https://nosql.mypopescu.com/post/103023903087</link><guid>https://nosql.mypopescu.com/post/103023903087</guid><pubDate>Tue, 18 Nov 2014 22:21:27 -0800</pubDate><category>data science</category></item><item><title>Why Couchbase Lite is so strategically important for you?</title><description>&lt;a href="http://www.odbms.org/blog/2014/11/mobile-data-management-interview-bob-wiederhold-2/"&gt;Why Couchbase Lite is so strategically important for you?&lt;/a&gt;: &lt;p&gt;In an interview with Bob Widerhold&lt;sup id="fnref-2-fn-Widerhold"&gt;&lt;a class="footnote-ref" href="#fn-2-fn-Widerhold" target="_blank"&gt;1&lt;/a&gt;&lt;/sup&gt;, Roberto V. Zicary asks: “why Couchbase Lite is so strategically important?”&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Bob Wiederhold&lt;/em&gt;: First, because the world is going mobile. That is
indisputable. Mobile initiatives top the list of every IT department. As I
said above, if you don’t have a mobile data management offering, you are not
looking at the complete needs of the developer or the enterprise.&lt;/p&gt;
&lt;p&gt;Second, let’s level set on Couchbase Lite. Couchbase Lite is our offering
for an embedded mobile JSON database.&lt;/p&gt;
&lt;p&gt;Our complete mobile offering, Couchbase Mobile, includes Couchbase Server –
for data management in the cloud, and Sync Gateway for synchronization of
data stored on the device with other devices, or the database in the cloud.
Today, because connectivity is unknown, data synchronization challenges
force developers to either choose a total online (data stored in the cloud),
or total offline (data stored on the device) data management strategy.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Maybe I’m seeing things from the wrong perspective:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;the data synching between the disconnected device and the central databases needs to see very low contention; resolving conflicts on the device would be much more difficult than having a server component solving it;&lt;/li&gt;
&lt;li&gt;as far as I can tell, the king of storage on mobile phones is SQLite; I somehow doubt that JSON + map/reduce can beat it;&lt;/li&gt;
&lt;li&gt;while not an expert in iOS services, I think the CloudKit already covers the local-to-remote storage sync problem.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;What am I missing?&lt;/p&gt;
&lt;div class="footnote"&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id="fn-2-fn-Widerhold"&gt;
&lt;p&gt;Bob Widerhold is CEO of Couchbase. &lt;a class="footnote-backref" href="#fnref-2-fn-Widerhold" title="Jump back to footnote 1 in the text" target="_blank"&gt;↩&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/post/103022354617" rel="permalink" style="color:red" target="_blank"&gt;Why Couchbase Lite is so strategically important for you?&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;</description><link>https://nosql.mypopescu.com/post/103022354617</link><guid>https://nosql.mypopescu.com/post/103022354617</guid><pubDate>Tue, 18 Nov 2014 21:53:15 -0800</pubDate><category>Couchbase</category><category>key-value store</category><category>document database</category></item><item><title>Hortonwork's filling for IPO: The marketing of going public</title><description>&lt;a href="http://www.infoworld.com/article/2847244/big-data/the-marketing-of-going-public.html"&gt;Hortonwork's filling for IPO: The marketing of going public&lt;/a&gt;: &lt;p&gt;Pretty much the &lt;a href="http://nosql.mypopescu.com/post/102949496417/hortonworks-ipo-why-now-or-better-who-will-benefit" target="_blank"&gt;same perspective about Hortonwork’s filling for IPO&lt;/a&gt; from Yves de Montcheuil (&lt;em&gt;InfoWorld&lt;/em&gt;):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;By filing first among Hadoop distribution vendors, Hortonworks is guaranteed
to get the lion’s share of publicity for the foreseeable future. Any
competitor who follows suit will be perceived as a copycat. And since it’s
unlikely that said competitors can produce a more attractive balance sheet
anyway, they would pretty much be in the same type of criticism.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/post/102949958827" rel="permalink" style="color:red" target="_blank"&gt;Hortonwork’s filling for IPO: The marketing of going public&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;</description><link>https://nosql.mypopescu.com/post/102949958827</link><guid>https://nosql.mypopescu.com/post/102949958827</guid><pubDate>Tue, 18 Nov 2014 02:25:28 -0800</pubDate><category>Hortonworks</category><category>Hadoop market</category></item><item><title>Hortonworks IPO - Why Now? Or better, who will benefit from the IPO</title><description>&lt;a href="http://blogs.gartner.com/merv-adrian/2014/11/17/hortonworks-ipo-why-now/"&gt;Hortonworks IPO - Why Now? Or better, who will benefit from the IPO&lt;/a&gt;: &lt;p&gt;Merv Adrian is looking at 3 possible reasons for &lt;a href="http://nosql.mypopescu.com/post/102342581262/game-on-hortonworks-files-for-ipo" target="_blank"&gt;Hortonworks’s filing for IPO&lt;/a&gt; by switching the &lt;em&gt;why&lt;/em&gt; question to &lt;em&gt;who will benefit&lt;/em&gt; from this IPO.  As for the &lt;em&gt;why now&lt;/em&gt; part, &lt;a href="http://nosql.mypopescu.com/post/102342581262/game-on-hortonworks-files-for-ipo" target="_blank"&gt;the main question I’ve also asked myself&lt;/a&gt;, this seems to be the general answer:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Ultimately, it’s unlikely that Hortonworks will be alone as a public company
for long. MapR told the Wall Street Journal they want to IPO next year, and
they claim to have more customers, high margins and “efficient cash
management.”  Cloudera says they “are not ready yet” though they have lower
rate of losses, and also claim more customers. At the end of the day, the
answer may be rather simple. And again, answering a question with a
question: if not now, when? There may not be a better time.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/post/102949496417" rel="permalink" style="color:red" target="_blank"&gt;Hortonworks IPO - Why Now? Or better, who will benefit from the IPO&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;</description><link>https://nosql.mypopescu.com/post/102949496417</link><guid>https://nosql.mypopescu.com/post/102949496417</guid><pubDate>Tue, 18 Nov 2014 02:07:00 -0800</pubDate><category>Hortonworks</category></item><item><title>Design consideration for Kayos messaging and durable queueing</title><description>&lt;a href="https://github.com/Damienkatz/Kayos-Design/blob/master/kayosdesign.md"&gt;Design consideration for Kayos messaging and durable queueing&lt;/a&gt;: &lt;p&gt;More details about &lt;a href="http://nosql.mypopescu.com/post/102048750227/nosql-databases-hadoop-big-data-pinned-tabs-nov-7th#43b2139d76dede91787422d1aaf6f875ec0ffce4" target="_blank"&gt;Damien Katz’s new message queue project&lt;/a&gt;: it has a name, Kayos, and some goals:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Build a fast, low cost, fault tolerant messaging and queueing system that
offers predictable performance and can take advantage of high end dedicated
hardware as well as unreliable, commodity infrastructure like EC2. We want
to support message de-duplication (newer versions of messages eliminate
older versions) while also maintaining strict consistency (ordered
synchronous delivery), causal consistency (ordered asynchronous delivery)
and eventual consistency (unordered asynchonous delivery).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;At the end of the long road ahead, “&lt;em&gt;Shit be awesome yo&lt;/em&gt;“.&lt;/p&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/" rel="permalink" style="color:red" target="_blank"&gt;Design consideration for Kayos messaging and durable queueing&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;


&lt;!--quid:d30466f0a64d0af7d8a4c606cdf2bf90f3361ccd--&gt;</description><link>https://nosql.mypopescu.com/post/102870100290</link><guid>https://nosql.mypopescu.com/post/102870100290</guid><pubDate>Mon, 17 Nov 2014 04:58:23 -0800</pubDate><category>Kayos</category></item><item><title>Kafka and Samza: Distributed stream processing in practice</title><description>&lt;p&gt;Fantastic slide deck from Martin Kleppmann. These 2 screenshots below are a good summary of the talk, but I strongly encourage you to go through the 42 slides. &lt;em&gt;Totally worth the time&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img alt="Kafka and Samza: distributed stream processing in practice" src="https://64.media.tumblr.com/77017ba31fb28c52314e9e29b6d75ea3/tumblr_nf4f6p0NUN1qavt6co1_1280.jpg" width="580" height="432"/&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt="Kafka and Samza: distributed stream processing in practice" src="https://64.media.tumblr.com/95d4672a7494be3b2921275c6f299388/tumblr_nf4f7uS3PU1qavt6co1_1280.jpg" width="580" height="432"/&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://nosql.mypopescu.com/post/1209550007/nosql-databases-and-the-unix-philosophy" target="_blank"&gt;The parallel between the Unix philosophy and the new (big) data solutions&lt;/a&gt; shows up &lt;a href="http://nosql.mypopescu.com/post/102666392027/what-do-you-have-to-say-for-the-skeptics-of-hadoop-who" rel="external nofollow" target="_blank"&gt;quite frequently&lt;/a&gt;. There’s an inherent extra complexity in the big data platform due to their distributed nature. But for some of these tools the rule of &lt;em&gt;“doing one thing and doing it well”&lt;/em&gt; was relaxed; maybe too relaxed. And in some cases there’s less than optimal openness towards integration.&lt;/p&gt;
&lt;div class="embedded smartembed speakerdeck"&gt;
&lt;script async class="speakerdeck-embed" data-id="d34613904cb2013218e606b8621c13fd" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"&gt;&lt;/script&gt;
&lt;div class="smartembed-ref-speakerdeck"&gt;&lt;a href="https://speakerdeck.com/ept/kafka-and-samza-distributed-stream-processing-in-practice" rel="nofollow external" target="_blank"&gt;Kafka and Samza: Distributed stream processing in practice&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/" rel="permalink" style="color:red" target="_blank"&gt;Kafka and Samza: Distributed stream processing in practice&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;


&lt;!--quid:64e10e083404a467d9093a5cbab2f322aeedf521--&gt;</description><link>https://nosql.mypopescu.com/post/102868229019</link><guid>https://nosql.mypopescu.com/post/102868229019</guid><pubDate>Mon, 17 Nov 2014 04:06:00 -0800</pubDate><category>Kafka</category><category>Samza</category></item><item><title>What do you have to say for the skeptics of Hadoop who think that the ecosystem is getting too complex with too many overlapping projects doing almost similar things?</title><description>&lt;a href="http://www.infoq.com/news/2014/11/hortonworks-enterprise-push"&gt;What do you have to say for the skeptics of Hadoop who think that the ecosystem is getting too complex with too many overlapping projects doing almost similar things?&lt;/a&gt;: &lt;blockquote&gt;
&lt;p&gt;There is a truth to the point of growing complexity of the entire ecosystem
but there is also a misattribution of the complexity that comes with it.&lt;/p&gt;
&lt;p&gt;Unlike many other unified single-stack architectures that came before, the
Hadoop platform is built around individual layers of individual
responsibilities. This is the Unix philosophy; each of these layers is built
in order to perform one thing and one thing well. This not only helps in
delineating responsibilities, but it also helps in a much faster evolution.
Remember that several different open developer communities are working on
each layer. Sometimes, this does mean there are two or more disjoint sets of
developers that work on the same layer, but that’s okay – either each of
those projects carve out their niche or the single best project simply
emerges. In a truly open community, a meritocracy, no single vendor
ultimately decides the best approach.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The other side of the coin is that to get things working you are either ready to put a lot of time and money into it or you’ll need to use one of the vendor’s distros. There’s nothing wrong with having vendor distros—polish, automation, testing, and documentation are always welcome—but their raison d’être shouldn’t just be the environment complexity. Ideally setting things up should be possible without too much hasle. But the Linux world proves that the convenience of distros cannot be challenged.&lt;/p&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/post/102666392027" rel="permalink" style="color:red" target="_blank"&gt;What do you have to say for the skeptics of Hadoop who think that the ecosystem is getting too complex with too many overlapping projects doing almost similar things?&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;</description><link>https://nosql.mypopescu.com/post/102666392027</link><guid>https://nosql.mypopescu.com/post/102666392027</guid><pubDate>Fri, 14 Nov 2014 20:43:43 -0800</pubDate><category>Hadoop</category><category>MapReduce</category><category>BigData</category></item><item><title>Can hard drives' failure be predicted?</title><description>&lt;a href="https://www.backblaze.com/blog/hard-drive-smart-stats/"&gt;Can hard drives' failure be predicted?&lt;/a&gt;: &lt;p&gt;Hardware failure is one of the major causes leading to failure of systems and implicitely to the deterioration of the quality of service. Predicting hardward failures would allow taking proactive measures, thus reducing the chances of downtime in the systems.&lt;/p&gt;
&lt;p&gt;Unfortunately for a large number of hardware components this is not possible. &lt;strong&gt;But&lt;/strong&gt;, Backblaze, the company providing a consumer online backup solution, has published some results that show that hard drivers failure &lt;strong&gt;can be predicted&lt;/strong&gt;; and that by analysing only 5 metrics (out of over 70 available):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;From experience, we have found the following 5 SMART metrics indicate impending disk drive failure:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;SMART 5 – Reallocated_Sector_Count.&lt;/li&gt;
&lt;li&gt;SMART 187 – Reported_Uncorrectable_Errors.&lt;/li&gt;
&lt;li&gt;SMART 188 – Command_Timeout.&lt;/li&gt;
&lt;li&gt;SMART 197 – Current_Pending_Sector_Count.&lt;/li&gt;
&lt;li&gt;SMART 198 – Offline_Uncorrectable.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;The rest of the post dives into each of these. If other large cluster users—I’m thinking of Amazon, Facebook, Google, Microsoft here—could back these findings, the results could have a significant impact on operating storage. &lt;/p&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/post/102623883667" rel="permalink" style="color:red" target="_blank"&gt;Can hard drives’ failure be predicted?&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;</description><link>https://nosql.mypopescu.com/post/102623883667</link><guid>https://nosql.mypopescu.com/post/102623883667</guid><pubDate>Fri, 14 Nov 2014 10:27:00 -0800</pubDate></item><item><title>Amazon Aurora in bullet points</title><description>&lt;ul&gt;
&lt;li&gt;relational database engine&lt;/li&gt;
&lt;li&gt;part of the Amazon Relational Database Service products (i.e. fully managed database)&lt;/li&gt;
&lt;li&gt;MySQL-compatible&lt;/li&gt;
&lt;li&gt;supports migrating data from Amazon RDS MySQL&lt;/li&gt;
&lt;li&gt;auto-scaling storage in 10GB increments and up to 64TB&lt;/li&gt;
&lt;li&gt;uses SSD-powered storage&lt;/li&gt;
&lt;li&gt;automatically replicated on 3 availability zones with 2 replicas per AZ&lt;/li&gt;
&lt;li&gt;replicas share storage with the primary instance&lt;/li&gt;
&lt;li&gt;can have up to 15 replicas improving read throughput&lt;/li&gt;
&lt;li&gt;writes require quorum&lt;/li&gt;
&lt;li&gt;&lt;em&gt;I read this somewhere but cannot find it anymore&lt;/em&gt;: writes: 100k/s, reads: 500k/s&lt;/li&gt;
&lt;li&gt;continuous backups with 1-second granularity point-in-time restoration&lt;/li&gt;
&lt;li&gt;backups go to Amazon S3&lt;/li&gt;
&lt;li&gt;designed for 99.99% availability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The rest of the story can be read in &lt;a href="http://aws.amazon.com/blogs/aws/highly-scalable-mysql-compat-rds-db-engine/" rel="external nofollow" target="_blank"&gt;Jeff Barr’s post&lt;/a&gt;.&lt;/p&gt;


&lt;p class="cc" style="font-style: italic; font-size: 0.9em;"&gt;
Original title and link: &lt;a href="http://nosql.mypopescu.com/post/102599302892" rel="permalink" style="color:red" target="_blank"&gt;Amazon Aurora in bullet points&lt;/a&gt;
(&lt;a href="http://nosql.mypopescu.com" style="display:none;visibility:hidden;" target="_blank"&gt;NoSQL database&lt;/a&gt;©myNoSQL)&lt;/p&gt;</description><link>https://nosql.mypopescu.com/post/102599302892</link><guid>https://nosql.mypopescu.com/post/102599302892</guid><pubDate>Fri, 14 Nov 2014 00:53:43 -0800</pubDate><category>Aurora</category><category>Amazon</category></item></channel></rss>
