myNoSQL

Autoscaling, welcome to Google Compute Engine

Mon, 24 Nov 2014 07:41:21 -0800

Autoscaling, welcome to Google Compute Engine:

Autoscaling allows customers to build more cost effective and resilient applications. Using Compute Engine Autoscaling, you can ensure that exactly the right number of Compute Engine instances are available at any given time to handle your application’s workload. This saves you money when your application’s usage is low, and ensures your application is responsive when utilization is high.

Autoscaling is the the Holy grail of a distributed system. The promise is that the system is be able to adapt—both up and down—to the needs/requirements/SLAs. Basically, the system will be able to get the performance it is demanded to provide, maximum availability, and these with optimal costs.

The first step in finding this Holy grail is to be able to describe the needs and requirements and SLAs of the system.

Original title and link: Autoscaling, welcome to Google Compute Engine (NoSQL database©myNoSQL)

Aurora for MySQL is coming

Mon, 24 Nov 2014 03:14:13 -0800

Aurora for MySQL is coming:

Mark Callghan takes a look at:

Amazon’s participation in the MySQL community — none
some of the things said during the presenttions — performance seems to be inflated
compability with existing MySQL features and especially InnoDB engine
features — very similar to my Amazon Aurora in bullet points

What is Aurora? I don’t know and we might never find out. I assume it is a completely new storage engine rather than a new IO layer under InnoDB.

Original title and link: Aurora for MySQL is coming (NoSQL database©myNoSQL)

Medium uses Neo4j and Go for GoSocial service

Mon, 24 Nov 2014 01:44:29 -0800

Medium uses Neo4j and Go for GoSocial service:

Medium’s social graph stored in Neo4j and exposed through a Go service:

It makes a lot of sense to store social data in a graph database. Medium users, posts and collections are represented by graph nodes, and the edges between them describe relationships — users following users, users recommending posts, or users editing collections, to name a few common examples. Using a graph database also makes our queries simpler: we don’t have to do any complicated joins or other query wizardry.

It’s hard to deny that when looking at highly connected data the first answer is almost always a graph database. Once the amount of data stored grows, you start thinking how you access that data. In many cases, the predominant answer is not traversals.

Original title and link: Medium uses Neo4j and Go for GoSocial service (NoSQL database©myNoSQL)

Stripe's Hadoop tools open sourced

Sat, 22 Nov 2014 02:48:00 -0800

Stripe's Hadoop tools open sourced:

Stripe has put on GitHub 4 Hadoop related projects they’ve developed internally:

a dashboard for Hadoop jobs
a Scala framework for distributed learning
a database for serving data in SequenceFile format
a collection of command-line utilities.

As a side note, Stripe is using Cloudera Impala with Parquet.

Original title and link: Stripe’s Hadoop tools open sourced (NoSQL database©myNoSQL)

NoSQL databases, Hadoop, Big Data: Pinned tabs Nov.19th

Thu, 20 Nov 2014 12:41:27 -0800

01: Teradata QueryGrid is the technology used to allow querying both Teradata/AsterData and external data stored in Hadoop or Oracle. ★

02: MarkLogic 8 will bring Javascript server-side engine, RDF triple store engine with support for SPARQL 1.1, bitemporal data management. ★

I still believe that MarkLogic should position itself as real-time search solution.

03: For Cassandra 3.0, there’s an completely revamped, and optimized, solution for handling hinted handoff that uses sort of a commit log instead of a Cassandra system table (thus avoiding any overhead associated). ★

04: YASH. Yet another SQL-on-Hadoop. This one from HP Vertica. ★

05: Teradata and MapR are signing a partnership to collaborate on the integration and co-development of join products. Some can say this might impact the Hortonworks’s IPO. ★

Original title and link: NoSQL databases, Hadoop, Big Data: Pinned tabs Nov.19th (NoSQL database©myNoSQL)

The states and transitions of a Couchbase node

Thu, 20 Nov 2014 07:29:34 -0800

The different states and the transitions of a Couchbase node in a diagram:

This post describes the states and actions that can trigger the transitions. One interesting aspect is that state changes are not applied immediately and you can commit multiple such changes at once when satisfied with the new topology.

Original title and link: The states and transitions of a Couchbase node (NoSQL database©myNoSQL)

Can MapReduce Solve Planning Problems?

Thu, 20 Nov 2014 04:01:45 -0800

Can MapReduce Solve Planning Problems?:

Betteridge’s law of headlines.

Original title and link: Can MapReduce Solve Planning Problems? (NoSQL database©myNoSQL)

It Ain’t Easy Making Money in Open Source: Thoughts on the Hortonworks's IPO Filling

Wed, 19 Nov 2014 04:25:17 -0800

It Ain’t Easy Making Money in Open Source: Thoughts on the Hortonworks's IPO Filling:

Dave Kellogg’s in-depth look at the Hortonworks’s filling for IPO, a comparison with RedHat’s model, and a definitely interesting hypothesis and conclusion:

While Hadoop and big data are unarguably huge trends driving the industry and while the future of Hadoop looks very bright indeed, on reading the Hortonworks S-1, the reader is drawn to the inexorable conclusion that it’s hard to make money in open source, or more crassly, it’s hard to make money when you give the shit away.

Others:

Original title and link: It Ain’t Easy Making Money in Open Source: Thoughts on the Hortonworks’s IPO Filling (NoSQL database©myNoSQL)

CouchDB's long road to clustering

Wed, 19 Nov 2014 03:52:23 -0800

CouchDB's long road to clustering:

Keyword is partially:

CouchDB’s long road to clustering can be partially traced to conscious design decisions and philosophical choices made by CouchDB’s creators. As Lehnardt explained, “CouchDB has always said no to features that we know couldn’t be scalable in a cluster or even doable in a cluster. This puts us in a position to migrate upward seamlessly.”

Two years ago and CouchDB would have actually been somewhere.

Original title and link: CouchDB’s long road to clustering (NoSQL database©myNoSQL)

Apache CouchDB 2.0 gets clustering support

Wed, 19 Nov 2014 03:43:06 -0800

At ApacheCon Europe 2014, the Apache CouchDB™ project today announced a Developer Preview release of its CouchDB 2.0 document database. The Developer Preview release brings all-new clustering technology to the Open Source NoSQL database, enabling a range of big data capabilities that include being able to store, replicate, sync, and process large amounts of data distributed across individual servers, data centers, and geographical regions in any deployment configuration, including private, hybrid, and multi-cloud.

I’m not sure who wrote the ASF PR announcement, but if it was me I would have simply posted “Apache CouchDB 2.0 features clustering support. Finally. </eom>“

Original title and link: Apache CouchDB 2.0 gets clustering support (NoSQL database©myNoSQL)

The data flow and the massive historical Tweet index

Wed, 19 Nov 2014 00:53:43 -0800

The data flow and the massive historical Tweet index:

We rarely have the opportunity to learn about the almost complete architecture and data flow for a massive data indexing solution. Twitter’s blog post covers many details of their indexing solution starting with design goals and getting down to technical

But our long-standing goal has been to let people search through every Tweet ever published.

My notes:

half a trillion documents
average latency under 100ms
(super tuned) SSD used as storage
4 components: batch data aggregation and preprocess pipeline, inverted index builder, Earlybird shards and roots; what are the Earlybird roots?
ingestion processes one day of tweets batches. it is run every day; in this process tweets are scored and partitioned
Hadoop for ETL: ingestion process is run on Hadoop, with the output being stored in HDFS
Mesos is used to parallelize the inverted index creation; results are stored in HDFS
after praising the high parallelism and statelessness of the index builders, some coordination using ZooKeeper is mentioned:

These inverted index builders can coordinate with each other by placing locks on ZooKeeper, which ensures that two builders don’t build the same segment. Using this approach, we rebuilt inverted indices for nearly half a trillion Tweets in only about two days (fun fact: our bottleneck is actually the Hadoop namenode).
the Earlybird shards are the storage of the inverted index partitioned by time and then hash; partitioning by time tiers will allow growing the storage without affecting the current time tiers
the Earlybird roots are the endpoint for the client API; they forward requests to the corresponding Earlybird shards, merge results, etc;
not very sure how Earlybird roots decide what time tiers should not receive a query
no words about the actual Earlybird storage; can it be Manhattan?
no details about the query processor
this project started in 2012; the full index was completely built in 2014

Original title and link: The data flow and the massive historical Tweet index (NoSQL database©myNoSQL)

What skills is a recruiting company looking for in a data scientist

Tue, 18 Nov 2014 22:21:27 -0800

What skills is a recruiting company looking for in a data scientist:

For the technical part the list goes like this:

SAS and/or R
Python
Hadoop
SQL
unstructure data

Original title and link: What skills is a recruiting company looking for in a data scientist (NoSQL database©myNoSQL)

Why Couchbase Lite is so strategically important for you?

Tue, 18 Nov 2014 21:53:15 -0800

Why Couchbase Lite is so strategically important for you?:

In an interview with Bob Widerhold¹, Roberto V. Zicary asks: “why Couchbase Lite is so strategically important?”

Bob Wiederhold: First, because the world is going mobile. That is indisputable. Mobile initiatives top the list of every IT department. As I said above, if you don’t have a mobile data management offering, you are not looking at the complete needs of the developer or the enterprise.

Second, let’s level set on Couchbase Lite. Couchbase Lite is our offering for an embedded mobile JSON database.

Our complete mobile offering, Couchbase Mobile, includes Couchbase Server – for data management in the cloud, and Sync Gateway for synchronization of data stored on the device with other devices, or the database in the cloud. Today, because connectivity is unknown, data synchronization challenges force developers to either choose a total online (data stored in the cloud), or total offline (data stored on the device) data management strategy.

Maybe I’m seeing things from the wrong perspective:

the data synching between the disconnected device and the central databases needs to see very low contention; resolving conflicts on the device would be much more difficult than having a server component solving it;
as far as I can tell, the king of storage on mobile phones is SQLite; I somehow doubt that JSON + map/reduce can beat it;
while not an expert in iOS services, I think the CloudKit already covers the local-to-remote storage sync problem.

What am I missing?

Bob Widerhold is CEO of Couchbase. ↩

Original title and link: Why Couchbase Lite is so strategically important for you? (NoSQL database©myNoSQL)

Hortonwork's filling for IPO: The marketing of going public

Tue, 18 Nov 2014 02:25:28 -0800

Hortonwork's filling for IPO: The marketing of going public:

Pretty much the same perspective about Hortonwork’s filling for IPO from Yves de Montcheuil (InfoWorld):

By filing first among Hadoop distribution vendors, Hortonworks is guaranteed to get the lion’s share of publicity for the foreseeable future. Any competitor who follows suit will be perceived as a copycat. And since it’s unlikely that said competitors can produce a more attractive balance sheet anyway, they would pretty much be in the same type of criticism.

Hortonworks IPO - Why Now? Or better, who will benefit from the IPO

Tue, 18 Nov 2014 02:07:00 -0800

Hortonworks IPO - Why Now? Or better, who will benefit from the IPO:

Merv Adrian is looking at 3 possible reasons for Hortonworks’s filing for IPO by switching the why question to who will benefit from this IPO. As for the why now part, the main question I’ve also asked myself, this seems to be the general answer:

Ultimately, it’s unlikely that Hortonworks will be alone as a public company for long. MapR told the Wall Street Journal they want to IPO next year, and they claim to have more customers, high margins and “efficient cash management.” Cloudera says they “are not ready yet” though they have lower rate of losses, and also claim more customers. At the end of the day, the answer may be rather simple. And again, answering a question with a question: if not now, when? There may not be a better time.

Design consideration for Kayos messaging and durable queueing

Mon, 17 Nov 2014 04:58:23 -0800

Design consideration for Kayos messaging and durable queueing:

More details about Damien Katz’s new message queue project: it has a name, Kayos, and some goals:

Build a fast, low cost, fault tolerant messaging and queueing system that offers predictable performance and can take advantage of high end dedicated hardware as well as unreliable, commodity infrastructure like EC2. We want to support message de-duplication (newer versions of messages eliminate older versions) while also maintaining strict consistency (ordered synchronous delivery), causal consistency (ordered asynchronous delivery) and eventual consistency (unordered asynchonous delivery).

At the end of the long road ahead, “Shit be awesome yo“.

Kafka and Samza: Distributed stream processing in practice

Mon, 17 Nov 2014 04:06:00 -0800

Fantastic slide deck from Martin Kleppmann. These 2 screenshots below are a good summary of the talk, but I strongly encourage you to go through the 42 slides. Totally worth the time.

The parallel between the Unix philosophy and the new (big) data solutions shows up quite frequently. There’s an inherent extra complexity in the big data platform due to their distributed nature. But for some of these tools the rule of “doing one thing and doing it well” was relaxed; maybe too relaxed. And in some cases there’s less than optimal openness towards integration.

Kafka and Samza: Distributed stream processing in practice

What do you have to say for the skeptics of Hadoop who think that the ecosystem is getting too complex with too many overlapping projects doing almost similar things?

Fri, 14 Nov 2014 20:43:43 -0800

What do you have to say for the skeptics of Hadoop who think that the ecosystem is getting too complex with too many overlapping projects doing almost similar things?:

There is a truth to the point of growing complexity of the entire ecosystem but there is also a misattribution of the complexity that comes with it.

Unlike many other unified single-stack architectures that came before, the Hadoop platform is built around individual layers of individual responsibilities. This is the Unix philosophy; each of these layers is built in order to perform one thing and one thing well. This not only helps in delineating responsibilities, but it also helps in a much faster evolution. Remember that several different open developer communities are working on each layer. Sometimes, this does mean there are two or more disjoint sets of developers that work on the same layer, but that’s okay – either each of those projects carve out their niche or the single best project simply emerges. In a truly open community, a meritocracy, no single vendor ultimately decides the best approach.

The other side of the coin is that to get things working you are either ready to put a lot of time and money into it or you’ll need to use one of the vendor’s distros. There’s nothing wrong with having vendor distros—polish, automation, testing, and documentation are always welcome—but their raison d’être shouldn’t just be the environment complexity. Ideally setting things up should be possible without too much hasle. But the Linux world proves that the convenience of distros cannot be challenged.

Original title and link: What do you have to say for the skeptics of Hadoop who think that the ecosystem is getting too complex with too many overlapping projects doing almost similar things? (NoSQL database©myNoSQL)

Can hard drives' failure be predicted?

Fri, 14 Nov 2014 10:27:00 -0800

Can hard drives' failure be predicted?:

Hardware failure is one of the major causes leading to failure of systems and implicitely to the deterioration of the quality of service. Predicting hardward failures would allow taking proactive measures, thus reducing the chances of downtime in the systems.

Unfortunately for a large number of hardware components this is not possible. But, Backblaze, the company providing a consumer online backup solution, has published some results that show that hard drivers failure can be predicted; and that by analysing only 5 metrics (out of over 70 available):

From experience, we have found the following 5 SMART metrics indicate impending disk drive failure:

SMART 5 – Reallocated_Sector_Count.

SMART 187 – Reported_Uncorrectable_Errors.

SMART 188 – Command_Timeout.

SMART 197 – Current_Pending_Sector_Count.

SMART 198 – Offline_Uncorrectable.

The rest of the post dives into each of these. If other large cluster users—I’m thinking of Amazon, Facebook, Google, Microsoft here—could back these findings, the results could have a significant impact on operating storage.

Amazon Aurora in bullet points

Fri, 14 Nov 2014 00:53:43 -0800

relational database engine
part of the Amazon Relational Database Service products (i.e. fully managed database)
MySQL-compatible
supports migrating data from Amazon RDS MySQL
auto-scaling storage in 10GB increments and up to 64TB
uses SSD-powered storage
automatically replicated on 3 availability zones with 2 replicas per AZ
replicas share storage with the primary instance
can have up to 15 replicas improving read throughput
writes require quorum
I read this somewhere but cannot find it anymore: writes: 100k/s, reads: 500k/s
continuous backups with 1-second granularity point-in-time restoration
backups go to Amazon S3
designed for 99.99% availability

The rest of the story can be read in Jeff Barr’s post.