Jesse Yates

Stuck on Kafka

2020-12-21T00:00:00+00:00

Low volume data pipelines in Kafka tend to get stuck; there is more data to process but your consumers aren’t moving forward. And its not because you are doing anything wrong in your application. In fact, it’s part of the design!

A quirk in how Kafka manages its consumer groups can - without careful management after investigation into root causes (or just reading this post) - lead to ‘out of order’ commits that appear to cause a consumer group to become “stuck” at an offset. And if there aren’t more messages to clear the clog, chances are you - the trusty data plumbers - are going to be woken up to fix it.

Fortunately, once we dive into understand _why_its happening, we can understand how to work around the issue. And even better, if you already use Alpakka Kafka you can get the fix for free, just by upgrading to 2.0.4+!

From the outside

When you are processing more than a single Kafka message at a time - common in high throughput Kafka consumer applications - you are likely to see negative direction commits; that is, commits that go ‘back in time’ after another consumer committed forward progress! This can happen even if you are ensuring you have correct ordering of your offsets.

Which seems… impossible. At worst, a consumer should only go back to the latest committed offset and then start committing from there after a rebalance.

Let’s consider a case where you are processing a topic with two partitions, in parallel, on two different consumers: c1 and c2. Their in-process data queues look like this:

| Consumer| Partition Offset
|   c1    | p1-5
|   c2    | p1-5 | p1-6 | p1-7 |

c1 is processing the message at offset 5 in partition 1 (p1) when a rebalance occured and p1 is moved to c2. For whatever reason, c1 is slow to commit its progress up to p1-5 (maybe you only flush commits every so often, maybe the processing got slow… it could be any number of things). When p1 is rebalanced to c2, it will start consuming from the latest committed offset (p1-4) and receives p1-5, p1-6 and p1-7 which it starts processing.

Recall that we are using a ‘high throughput’ application, so our consumers can work on these messages in parallel. A correct processing framework insures that we don’t commit the progress out of order, even if we are done with the work; that is work can be done in parallel and even finish early, but it still needs to be committed in order (think a function like akka’s mapAsync logic). In-order commits ensure that data is always fully processed in the case of failures; in a failure the worst case then is reprocessing the data.

The alternative to reprocessing (atleast-once message handling) is to use a transactional processing framework to ensure that you only ever process the messages exactly-once.

However, that has its own overhead - transactions aren’t free - so if you are looking for sheer throughput and velocity, you are often better off paying the occasional small price of reprocessing vs. the consistent tax of transactions.

In our example c2 is compatively quite speedy, finishes its work and commits progress on p1 up to offset 7. Immediately after that, c1 realizes that it needs to commit its progress and then it commits the work it has completed on p1, up to offset 5. Now the consumer state in the __committed_offsets topic looks like:

| p1-4 | p1-7 |  p1-5 |

Uh oh! From the view of external monitoring (e.g. Burrow), the consumer looks like its 2 offsets behind. If this is a slow moving topic, one that doesn’t get a lot of data, then this consumer could appear “stuck” like this for quite a while. In all likleyhood, its going to be stuck juuuuust long enough to page someone at 2am.

However, as far as the consumers are concerned, they are doing the right thing and everything is fine. That is, their internally reported lag will be zero, while the externally reported lag will be two. In this case, both are correct!

c2 is correct in that is has processed all the data from p1 (up to offset 7), so it has no lag. And c1 doesn’t think its lagging because it is no longer assigned p1, so it doesn’t report and lag from its internal metrics. But all is not well in the kingdom.

Let’s say another rebalance were to happen right now. The newly assigned consumer would start receiving offsets starting from p1-5, as we see from the lag of two in Burrow. That consumer would then continue to make forward progress up to p1-7 and would then ‘correct’ the lag state.

This problem is only likely to page for these low volume topics/partitions - new data causes a ‘forward’ progress commit and state to recover. However, it can still cause wasted processing of messages that could be non-trivial in high-throughput environments; I’ve seen cases of millions of messages being reprocessed on each rebalance.

From the inside

Source: https://images.pexels.com/photos/3625023/pexels-photo-3625023.jpeg

It certainly seems like this is an issue with Kafka - we shouldn’t be allowed to commit progress for partitions that we are not assigned. But this is a feature of the low-coordination nature of consumer groups.

When a consumer group is created, it gets assigned a broker as the coordinator of the group. This gives the group a central place to manage state that all clients should be able to reach (all clients should be able to reach all brokers or you get really weird stuff happening, but not all clients need to be reachable from other clients in the same consumer group). The coordinator then helps manage which members of the consumer group are assigned which partition. When new members join or leave the group, the coordinator increments an ‘epoch’ and notifies all group members of the epoch change so they know to update their state.

Recall that the group coordinator is only a single broker, but the partitions storing the data are spread across potentially hundreds of Kafka brokers. Even for a small number of consumer groups, it becomes painful to coordinate the state of each consumer group (potentially thousands) across all the brokers. That means that Kafka brokers only care that a consumer is part of the latest epoch.

A corollary this is that Kafka brokers do not care what a consumer is committing for a topic, as long as it has the correct epoch. That is, a consumer can commit a partition it has been not been (and never been) assigned. Meaning that an ‘up to date’ consumer - one with the correct epoch - can commit progress for any partition it so chooses.

The server-side architecture is designed to allow low coordination in increase the likelyhood of low-latency and high stability. However, that means a lot of the burden is placed on consumers to do the “right thing”.

Approaching a solution

Source: https://www.humanedgetech.com/expedition/034tait01/images/P6040045.JPG

We know that we should be relying on externally based metrics to monitor our systems; internal metrics are known to lie. However, in this case the external metric can be misleading - the data has been processed, but a rebalance would show the lag.

From experience, attempting to do a correlation between the internal and external metric and then modulating your alerts appropriately is a path fraught with issues. You are more likely than not to foot-gun yourself a number of times, trying to get the right correlation set up; slow reporting of internal metrics, acceptable deltas and window width are just a couple obvious gotchas.

Instead, we should go and fix the root cause - consumers that are not assigned partitions should not be committing to them!

If you are using alpakka-kafka (highly recommended as a Kafka stream processing library), then you should strongly consider upgrading to 2.0.4+ where I added support for not committing unassigned partitions #1123. It solves this problem as part of the framework - yay, no need to change application code, it just works! - and ensures that (a) you never see this issue again and (b) get more sleep.

However, if you have your own home-grown system you will want to add filtering on the commit side to ensure that the consumer is still assigned the partitions. That means needing to track the assignments and correlate them with the state of the stream.

If your application keeps a buffer of data - ensuring you don’t block on reading from Kafka - then keeping track of the assigned partition might have a double-win: you can filter out buffered messages that are no longer assigned the consumer, avoiding any extra processing at all!

Kafka Streams continues to be exposed to this stuck commit problem - any time you are doing grouping, windowing, or in many stateful processing implementations, you get into asynchronous handling of message offsets. You are in a state where work is happening asynchronously to the commit, which can lead to progress being committed ‘backwards’. As far as I know, this has not been addressed in open source.

Hopefully, you have seen the gory horror that is some of the guts of stream processing with Apache Kafka and understand why you might need to add special assignment tracking support to your applications. And if you don’t, at least you have an explanation for why you are getting woken up at 2am.

High Performance Kafka Producers

2020-01-01T00:00:00+00:00

After my Scaling a Kafka Consumer post, it only seemed fair to take a dive into the producer side of the world too. It’s got it’s own set of problems and tuning fun that we can dive into.

The setup

Let’s assume that you already have a Kafka producer running, but its just not quite keeping up with the data flowing through. This already means you are in the 95th percentile of users - generally the default client configurations are more than enough to work.

If you are interested in how the internals the Producer, I recommend taking a look at this talk by Jiangie Qin at LinkedIn. Not only does it walk you through how the Producer works, it can give you some first pass tuning recommendations. However, I prefer a bit more empirical evaluation based on what the client is telling us - its metrics - that you can take back to decide how to manage your particular use case.

Back to basics

First, you need to understand why your producer is going slow. So the first question we need to ask is, “Is it Kafka or is it me?”

Maybe its Kafka. Some things to check to ensure that the cluster is ‘happy’:

network handler idle time
- kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent
- generally not below 60%, with average above 80%
request handler idle time
- kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent
- generally not below 60%, with the global average consistently above 70%
disks are idle
cpu usage is not maxed out (it shouldn’t be if the above are true)

Unfortunately for us in this convenient - made-up - story, Kafka seems to be idling happily along, so we are left with tuning our client.

The obvious first starting place is ensuring that you have compression.type set. Compression on the producer side is seriously worth considering, especially if you have even a little bit of extra CPU available. Producer-side compression will help Kafka store more data quickly as the broker just writes the data to disk directly out of the socket (and vice versa for the consumer path), making it much more efficient for the whole pipeline if the a producer can just handle the compression up front.

If you are running Kafka 2.X+, you should have access to zstd compression. Some tests I’ve seen show a marked improvement on the alternatives - its got close to the compression of gzip, but with the CPU overhead of lz4. But your mileage may vary; be sure to test on your data!

That out of the way, the next thing we should check to see is how good our batches are looking. The easiest configuration to tweak here is linger.ms. You can think of this as time-based batching. By increasing our latency, we can then increase our throughput by eliminating the overhead of extra network calls.

For this, we should check out the record-queue-time-avg- the average time a batch waits in the send buffer, aka how long to fill a batch. If you are consistently below your linger.ms, then you are filling your batch sizes! So the first simple tweak is that we are going to increase our latency so that we can (no surprise!) increase the throughput too, by increasing your linger.ms (HINT: Kafka defaults to not waiting for batches, leaning towards lower latency producing, at the risk of more RPCs). I find 5ms to be a nice sweet spot.

Back to our toy example, you have set compression and tuned the linger.ms, but you are still not getting the throughput you need.

Going deeper

Once you get further into the weeds, producer configurations start to get more inter-related, with some important non-linear and sometimes unexpected impacts on performance. So it pays to be extra patient and scientific about combinations of different parameters. Remember, we should be continually going back to understanding the root bottleneck while keeping an eye on optimizing the rate of records flowing through the Producer.

The next questions to ask are, how big are your records - as Kafka sees them not as you think they are - and are you making “good” batches?

The size of the batch is determined by the batch.size configuration - the number of bytes after which the producer will send the request to the brokers, regardless of the linger.ms. Requests sent to brokers will contain multiple batches, one for each partition.

So there are a few things we need to check on. How many records are there per batch, and how big are they? Here is where we can start really digging into the kafka.producer MBeans. The batch-size-[avg|max] can give you a good idea of the distribution of the number of bytes per batch. Then record-size-[avg|max] can give you a sense of the size of each record. Divide the two and tada! You have a rough rate of records per batch.

Now, you can match this to the batch.size configuration and determine approximately how many records should be flowing through your Producer. You should also sanity check this against the record-send-rate - the number of records per second - reported by your producer.

<side note>

So if you are struggling to fill your batches with the number of records, the problem now might not even be in your producer! It might actually be upstream in your processing - you did check to ensure that you were Scaling a Kafka Consumer, right? It might as simple though as just increasing the amount of client threads, the parallelism, allocated to consuming records and passing them along to the producer. But let’s assume you checked all those things.

</side note>

You might be a bit surprised if you occasionally have very large messages (you did check record-size-max right?), as the max.request.size configuration will limit the maximum size of a request and therefore also inherently limit the number of record batches.

No, what about the time you are waiting for IO? Check out the io-wait-ratio metrics to see if you really are spending lots of time waiting for IO or doing processing.

Now we need to make sure that your buffer size is not getting filled. Here buffer-available-bytesis your friend, allowing you to ensure that your buffer.memory size is not behind exhausted by your record sizes and/or batching.

Also make sure to check the bytes per topic metrics.

If you are producing to many different topics, this can affect the quality of the compression as you can’t compress well across topics. In that case, you might need some application changes so that you can more aggressively batch per destination topic, rather than relying on Kafka to just do the right thing. Remember, this is an advanced tactic and you should only consider after benchmarking and confirming other things are not working.

Wrap up

Hopefully this will give you a bit more guidance than just the raw tuning documentation for how to go about removing bottlenecks and getting the performance out of your Producer that you know you be getting.

A summary of the configurations and metrics to tweak on the client:

compression.type
linger.ms
- record-queue-time-avg, average time a batch waits in the send buffer, aka how long to fill a batch
batch.size
- determine records per batch
- bytes per batch
  - see batch-size-avg, batch-size-max
- records per topic per second
  - see record-send-rate
- check your bytes per topic
max.request.size
- can limit the number and size of batches
- see record-size-max
time spent waiting for IO
Are you really waiting? see io-wait-ratio
buffer.memory + queued requests
see buffer-available-bytes
32MB default, roughly total memory by producer, bytes allocated to buffer records for sending

Do you have any more suggestions? Drop a note in the comments below!

Vertically scaling Kafka consumers

2019-12-04T00:00:00+00:00

When scaling up Kafka consumers, particularly when dealing with a large number of partitions across a number of topics you can run into some unexpected bottlenecks. They get even worse when dealing with geographically remote clusters. The defaults will get you surprisingly far, but then you are left basically on your own.

Well, No More! Let’s dive right in.

A real life(ish) example

Let’s say you are mirroring data from an edge Kafka cluster into a central Kafka cluster that will feed your analytics data warehouse. You’ve setup the edge with 100+ partitions for many of the topics you are consuming (because you had the forethought to expect scale and knew partitions are generally pretty cheap - go you!). That means you could easily be mirroring 1000+ partitions into your central Kafka.

Let’s add in that you are mirroring across the country because you are looking for geographic isolation as well as minimizing latency to getting data into a ‘safe’ system. That also means you have an extra 100ms of latency, roughly, for every mirror request you make.

Chances are, this isn’t going to work out of the box. Too bad, so sad. Time to get engineering!

You might see something like

2019-06-28 20:24:43 INFO  [KafkaMirror-7] o.a.k.c.FetchSessionHandler:438 - [Consumer clientId=consumer-1, groupId=jesse.kafka.mirror] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 3: org.apache.kafka.common.errors.DisconnectException.

or, if you turn on, debug logging you might also see

2019-06-27 20:43:06 DEBUG [KafkaMirror-11] o.a.k.c.c.i.Fetcher:244 - [Consumer clientId=consumer-1, groupId=jesse.kafka.mirror] Fetch READ_UNCOMMITTED at offset 26974 for partition source_topic-7 returned fetch data (error=NONE, highWaterMark=26974, lastStableOffset = -1, logStartOffset = 0, abortedTransactions = null, recordsSizeInBytes=0)

What you consumer is really saying is, “I didn’t get a response in the time I expected, so I’m giving up and trying again soonish.”

Here are some quick configurations to check:

default.api.timeout.ms
- in older verisons of the client (pre 2.0) this controlled all the connection timeout
session.timeout.ms
- how long until your consumer rebalances
- watch the join-rate for all consumers in the group - joining is the first step in rebalancing.
request.timeout.ms
- as of 2.0, how long the consumer will wait for a response
- the logs are are great place to start here, to see if there are lots of failing fetches
- watch the broker metrics:
  - kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec for a gut check of fetch statuses
- watch the client metrics:
  - fetch-latency-avg and fetch-latency-max for latency when getting data
fetch.max.wait.ms
- how long the server will block waiting for data to fill your response
- metrics to watch on the broker:
  - kafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec for a gut check of fetch statuses
  - kafka.server:type=DelayedOperationPurgatory,delayedOperation=Fetch,name=PurgatorySize for the number of fetch requests that are waiting, aka ‘stuck in purgatory’
- metrics to watch on the client:
  - fetch-latency-avg and fetch-latency-max for latency when getting data
fetch.min.bytes
- minimum amount of data you want to fill your request
- metrics to watch, both at the consumer level and the topic level
  - fetch-size-avg and fetch-size-max to see your fetch size distribution
  - records-per-request-avg for the number of messages you are getting per request
  - fetch-latency-avg and fetch-latency-max to ensure this is not causing you unexpected latency

(NOTE: all the metrics above are assumed client (consumer) side metrics MBeans, and have the prefix ` kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+) or with topic=([-.w]+)` for topic-scopeds metrics, unless otherwise noted)

These all can interact in interesting ways. For instance, if you tell the server to wait to fill the request, but then have the timeout set short you will have more retries, but potentially better throughput and likely saved bandwidth for those high volume topics/partitions.

Don’t forget that whatever you were using when connecting to a geographically more local source cluster (i.e. not across the country) will probably stop working because now you have an extra 50-100ms of roundtrip latency to contend with. The default settings, with 50ms timeouts for responses mean you will start to disconnect early all the time :)

Sadly, there are no explicit things I can tell you that will always work for these settings. Instead, they are a good guide to start reading the documentation and where to starting your fiddling.

What next?

So you have tuned your timeouts way up, made sure that you are fetching at least 1 byte…and still getting these errors in your logs. Its tough to pinpoint though…you might have a 100+ Consumer instances, and because they work as a team, just one bad apple could tip you into perpetual rebalance storms.

Let’s simplify the problem, by turning down the number of instances and chopping out some of these topics we need to mirror. Eventually you will probably get to a set of topics that suddenly starts to work!

Hooray, things are working magically! Maybe it was those tweaks you made to get the topics working? Time to scale it back up…and its broken again. Crapola.

(this is EXACTLY what happened to me)

Did you remember to check your GC monitoring? I bet you are going to find that your consumers are Stop-the-World (STW) GCing for near or over your timeouts.

Your one (or two or three) little mirrors are GC’ing themselves to death; every time they disconnect, they generate a bunch more objects, which then add GC pressure. Even if your mirror starts working, it can quickly churn garbage and spiral into a GC hole from which it never recovers. This is even more frustrating as it can look like the mirror is running fine for 10, 20 minutes and then suddenly - BOOM! - it stop working.

I’ve found that using the Java GC options:

-server -XX:+UseParallelGC -XX:ParallelGCThreads=4

is more than sufficient to keep up. It doesn’t use the fancy G1GC, but for a simple Mirror application, you don’t need complex garbage collection - most object is highly transient and the rest are small and very long lived. Actually, a nice fit for the ‘old’ Java GC.

Unbalanced Consumers

This can happen when you are consuming from multiple topics, but the topics don’t have the same number of partitions. A quick reading of the documentation would have you think that it should just evenly assign partitions across all the consumers, and it does…as long as you have the same number of partitions for all topics. As recently as Kafka 2.1+ (latest stable release I’ve tested), as soon as you stop having the same number of partitions the topic with the lowest number of partitions is used to determine the buckets, and then those buckets are distributed across nodes.

For example, say you have two topics, one with 10 partitions and another with 100, and 10 consumer instances. You start getting lots of data coming into the 100 partition topic, so you turn up the number of consumers to 100, expecting to get 90 consumers with 1 partition and 10 consumers to get 2 partitions; one partition on ten instances for each of Topic One and then an even distribution of Topic Two.

This is, unfortunately, not what you see. Instead, you will end up with 10 consumers, each with 11 partitions and 90 consumers sitting idle. That’s the same distribution you had before, but now with extra overhead to manage the idle consumers!

What you need is this configuration:

 partitioner.class = org.apache.kafka.clients.producer.RoundRobinPartitioner

Now the consumer group will round-robin assign the partitions across the entire consumer group. This will get you back to the distribution you expected, allowing you to nicely balance load and increase your overall throughput!

Wrap Up

Hopefully, at this point you have all the tools you need to scale up your consumer instances. You know the basic tuning elements to check, have some guidelines to do basic GC tuning and finally the nice back-pocket trick to balance consumer groups when consuming from differently partitioned topics.

New (Open Source!) Tooling: Kafka Keystore Building

2019-11-18T00:00:00+00:00

The easiest way to setup basic authentication with Kafka is to use x509 certificates. However, getting these certificates into a place where they are actually by your Kafka client can be frustrating and error prone. That is why we recently released a Kafka Certificates tool to make your life easier.

Most intro guides (example) have you creating your own Certificate Authority (CA) to sign keys. This can work in the small scale, for instance with just one or two clusters and/or clients. However, when you start having proper “corporate infrastructure”, chances are you will want to have a central CA for the entire company that can issue certs. These certs need not even be for Kafka exclusively - x509 certificates can be used to prove individuals’ identity across a range of services.

If you are using python, ruby, golang or any other language backed by librdkafka you can just drop these certificates into the client and move along with your life. Unfortunately, taking a private key, public key, signed certificate and a CA and making Java Keystores out of them is not so straightforward.

The usual process is something along the lines of using the command line to generate a PKCS12 certificate store with all the appropriate keys and certificate chains. Then you would need to create an empty Java Keystore. Finally, you would have to import each and every certificate that you want to add to the keystore. And unfortunately, none of this is really well documented anywhere.

Not a very simple process by any means, and something I personally have messed up a number of times.

That is why we created and open sourced the Kafka Certificates tool. You just pass it:

private key
signed certificate of the private key
the issuing CA’s certificate
the issuing CA’s certificate chain (ca_chain)

And it will generate you a password protected, Java Keystore formatted, keystore and atruststore for use with a Kafka client. It will also dump them to the console as base64 encoded values, which are great for adding directly to, say, Kubernetes configurations.

Internals

There are two different Keystores that need to be created (pardon the overloaded terms, this is standard Java): the keystore and the truststore.

Here “K” Keytores are the format. So a trustore is a Keystore formated file, that holds certificates and/or key that the client use to determine which server certificates to trust.

The keystore stores the client’s private key and the certificate chain for that key back up to the root CA. This allows it to cryptographically prove that it is who it says it is, along with “testimony” all the way back up root CA.

The truststore is the opposite - you add all the certificates for authorities that you trust to sign certificates (issuing CAs), so if you get a request you can check to see if their request certificate chain cryptographically matches any of the issuing CAs certificates in your truststore.

So if we have a Private Key (PK) with a certificate C, and a certificate chain of C1 -> C2 -> Cr, where Cr is the certificate for the root Certificate Authority (CA), then our keystore would look something like:

PK + C -> C1 -> C2 -> Cr

And then to trust any certificate signed by the CA, our truststore would just need

Seems simple right? Too bad Java doesn’t make it easy. Good thing that we did :)

Be sure to check out the Kafka Certificates tool next time you need to build Keystores for a client.

A guide to Kafka Consumer Freshness

2019-11-04T00:00:00+00:00

In my recent talk at Kafka Summit I mentioned that users don’t think in offsets, but rather in amounts of time - minutes, hours - that a consumer is behind. When you say, “We might have problem, your consumer is consistently 10,000 offsets behind,” it would not be unreasonable to be met with slack-jawed incredulity and/or glassy eyed stares.

However, users can easily inuit data ‘freshness’. Were you to instead say, “We might have a problem, your consumer is consistently 12 hours behind,” you would quickly have a productive conversation about whether that was actually a problem for their use case, how that might affect downstream processing, etc. and if you are lucky actually turn what looked like a problem into lower operational burden for you!

During my Kafka summit talk I also mentioned we open sourced a tool - the Consumer Freshness Tracker - that helps you translate from offsets (exposed by Burrow) into the amount of time behind, the “freshness”, of a consumer group.

I’ll explain the logic behind the Freshness Tracker and show how easy it is to run with just a short configuration.

Motivation

As I mentioned, users think in time, not in offsets. When you tell them that their consumer is 1M offsets behind they have no context - is that a lot or a little? What is the latest data you do have? And moreover, how long until it gets better? Offsets are what operators start out using - it is what Kafka exports as part of its metrics and what Burrow gives you as well.

And for a while that might be good enough. You can point users to historic dashboards where they can look up offsets for a topic-parition and map that to a time.

Definitely doable… but kinda lame.

Existing Literature

From the New Relic Kafkapocalypse article, they make the following definition:

Commit Lag is the difference between the Append time and the Commit time of a consumed message. It basically represents how long the message sat in Kafka before your consumer processed it.

With this nice image:

For example, you see that the Commit Lag of message 126, which was appended at 1:09 and processed at 1:11, is 2 seconds.

Append Lag is the difference between the Append time of the latest message and the Append time of the last committed message. As you can see, the Append Lag for this consumer is currently 9 seconds.

Or explained slightly differently,

commit lag = tc - t0
- the time between an event entering and being committed, or the “latency”
append lag = tN - t0
- time between when the Log-End-Offset (LEO) and the latest commit message entered the topic

Shortcomings of Commit Lag

Commit Lag at first blush seems like what we need. You are tracking how long an element takes to get committed. However, if your stream gets “stuck” (maybe bad code, maybe downstream problems), you will never hear about an issue because the latest offset never gets committed, so your commit lag cannot be updated yet!

Shortcomings of Append Lag

Append Lag does approximate Freshness when the stream is high volume, because we are (approximately) continually adding and and committing to a topic.

However, at low volumes Append Lag is wildly different from freshness.

For instance, if a message enters at t0 and then another event enters 6 hours later, the Append Lag will immediately jump to 6 hrs, as soon as the event joins the queue.

Thus, Append Lag cannot be used ubiquitously for reliable alerting (though there may be value in determining when topics have issues upstream and are not receiving data).

Deriving Freshness

You might be inclined to define Freshness as:

freshness = (current time) - (the most recent committed timestamp)

However, this is also incorrect - a consumer that is not receiving data will have a continually growing freshness.

For example, at t0 the event enters. At t1 it gets committed. Then 5 minutes later we calculate freshness, it will actually be

5min + (t1 - t0)

Or the amount of time it between the calculation plus the time between the append and commit times. So if we check 10 minutes later, freshness would be

10min + (t1 - t0)

So the freshness is then increasing even though no data is being added.

This is certainly not correct

Intention of freshness

Going back to our definition, what we really want to know is:

How far behind, in time, is my stream?

To answer that, we then need to calculate maximum amount of time an event has been in the topic/partition without being committed.

The oldest uncommitted data is always going to the (latest committed offset + 1); said another way it is the oldest, uncommitted offset. Therefore, freshness is then

freshness = (current time) - (append time of oldest, uncommitted offset) 
              OR
            0, when no uncommitted offsets

Going back to the diagram above

Here t1 is the append time of the oldest, uncommitted offset.

If it is time (tc +1) but the topic has not added any more messages.

Freshness should be 0 (or, no lag)

If it is time (t1 + 1) > tc , so we have added a single message, but it has not been committed yet.

Freshness should be 1 (or, the amount of time that t1 has been in the topic without being committed)

If it is time (t1 + 10) > tc , so we have added a single message, but it has not been committed yet.

Freshness should be 10

Building a Freshness Tracker

There is an existing OSS freshness-like tracker available from Lightbend that does some fancy tricks to avoid copying too much data and working around the lack of an offset-to-timestamp API in Kafka. Maybe this is for you, but this was (1) not available when we started, and (2) contains a premature optimization around minimizing the data being pulled from Kafka, leading to approximate answers.

Instead, the Consumer Freshness Tracker (CFT) is designed to be stupid simple and follow the Linux tools philosophy of composability. It takes the output of Burrow (allowing Burrow to focus on its job), then does the heavyweight merge with state in Kafka to produce the amount of lag. The algorithm looks like this:

Scrape consumers from Burrow
For each consumer
1. find the log-end-offset (LEO) and latest commit for each partition (as provided by Burrow)
2. If there is no lag for that partition
  1. freshness is 0ms
3. Else
  1. Read the read at the LEO from Kafka
  2. Get the timestamp from the record
  3. Freshness = current time - timestamp

Thus, any consumers that Burrow monitors, the CFT also tracks. This gives you one place to configure your white/black lists, helping your monitoring to always stay in-sync.

The use of the LEO for each partition makes sense because it is definition, the longest amount of time between the latest committed offset and the oldest message not yet processed. Any newer messages have, by definition, a smaller freshness lag.

Because we also have the amount of time an offset sits in the queue before it is committed (offset time vs commit time), we also report the Commit Lag (from the first section), as a helper metric to understand the latency of an individual record in your stream.

And besides some multi-threading magic - for real,. production proven latency needs - that’s it.

Running a Freshness Tracker

You need to configure two main elements in the HOCON configuration: the Burrow URL and the clusters to query. For example,

burrow:
  url: "http://burrow.example.com"

clusters:
  - name: logs-cluster
    kafka:
      bootstrap.servers: "l1.example.com:9092, l2.example.com:9092, l3.example.com:9092"
  - name: metrics-cluster
    kafka:
      bootstrap.servers: "m1.example.com:9092, m2.example.com:9092, m3.example.com:9092"

Any other clusters defined in Burrow will be ignored and only the consumers under the clusters we have defined here will be monitored. Everything under the kafka section is directly passed into the Kafka Consumer properties, allowing you to set SSL configs or tune the client as needed.

Additional Tuning

There are a couple of additional tuning flags, which can be particularly useful when reading from clusters with very large records or when there are a large number of clusters, and you don’t want to run multiple trackers with weird subsets of clusters or client configs (though this is can often be the swiftest solution):

workerThreadCount
- the number of concurrent threads querying Kafka, aka the size of the thread-pool used for polling
numConsumers
- defined per-cluster, the number of Kafka Consumer instances to run. Each LEO pull will only happen on a single consumer instance, so this is the max parallelism intra-cluster that you can expect.

These are also necessary because reading the offset record from Kafka to get the timestamp requires reading the entire record; there is currently no API to just get the key or the timestamp. This can lead to memory and/or latency challenges if you have particularly heavy-weight messages.

However, out of the box, the default configurations are likely more than sufficient.

Happy Tracking

With the Consumer Freshness Tracker you have a small application that will convert your Consumers’ offsets into freshness milliseconds, which is invaluable in providing accurate, and useful monitoring for users.

This has been running in production for nearly 1 year, so please do give it a shot in your environment - it just might change your entire mindset on monitoring.

Kafka Upgrade Validation

2019-10-04T00:00:00+00:00

If you attended Kafka Summit, or followed along on Twitter, you probably heard many people mentioning that you really really ought to upgrade your Kafka installation. No surprise, it often will fix many obscure bugs (aka those you are guaranteed to hit at scale), while increasing performance and often times lowering operational costs. However, the big question is, “how can I be sure that this isn’t going to break everything?” Related, is the additional question of, “what about bugs in the new version?”

I’ll explore some of the process I recently went through when doing an upgrade of a somewhat out-of-date Kafka installation to the cutting edge stable release. Hopefully this can serve as a guide for doing your own upgrades, or at least help avoid some of the more common gotchas.

Didn’t someone else check this?

You might be asking, “why should I check for steady state bugs? Isn’t that what the community does before cutting a release?” I would then remind you that:

In theory, theory and practice are the same thing

Yes, there is some degree of validation by the community, but by definition this work is done on a volunteer basis, and is really just at best effort. In other communities, I’ve seen releases go out with huge severity bugs that would should have been caught by basic validation, but for one reason or another didn’t.

In short, would you trust that your business critical infrastructure is safe based on volunteer work?

Not to say that the wonderful folks supporting the Apache Foundation projects are not often very high caliber, and doing amazing work - they are - but the risk just doesn’t seem worth it to me.

Let’s say though, that you don’t use the vanilla open-source distribution, but some vendor’s distribution. Now you might ask, “but certainly their validation is enough, right?”

And you would be right about many of the edge cases the standard validation might not catch. However, their test suites (hopefully automated!) also have risks in that they cover not necessarily the original code, but whatever patches the vendor has layered on top of the codebase. Now you have all the original code, but all the patches to validate, which is itself validated with code you also probably don’t deeply know (if at all!).

Vendors are great for adding more trust the code, as well as finding/fixing bugs that might have crept into the edge releases.

However, there really is no substitute for doing the validation yourself - especially when you have millions of dollars (or more!) in business cost risk on the line.

Why Validate

Validation of a release will help you gain confidence that the bits you are pushing out will be “good”. However, just as important, the validation will also help you gain confidence in the rollout process so that you have confidence not just in the final state, but also in every step along the way.

One of the biggest risks with new code is the risk of new bugs. While lots of work is done to validate the code, there is still substantial risks that are not likely to be covered by others. The most common are those related to your setup and usage:

what does you particular upgrade path look like and work
your particular usage (maybe you are using a less-common API and didn’t know it?)
how things work on your particular hardware.

You would probably be surprised by the number of bugs that aren’t found before a stable release. For instance, in upgrading to Kafka 2.2+ from 1.X,there are some major bugs like:

KAFKA-8002 - Replica reassignment to new log dir may not complete if future and current replicas segment files have different base offsets
KAFKA-8069 - Committed offsets get cleaned up right after the coordinator loading them back from __consumer_offsets in broker with old inter-broker protocol version (< 2.2)
KAFKA-8012 - NullPointerException while truncating at high watermark can crash replica fetcher thread
KAFKA-7165 - Error while creating ephemeral at /brokers/ids/BROKER_ID
KAFKA-7557- truncating logs can potentially block a replica fetcher thread, which indirectly causes the request handler threads to be blocked

These are non-trivial issues that impact two major areas: (1) data loss and (2) consumer offset loss. While data loss is understandably bad, the latter can actually be just as bad. If you have a lot of data retention for certain topics, loss of consumer offsets can cause your consumers to rewind themselves all the way back to the beginning of the topic, essentially crushing your cluster - now the brokers are thrashing your OS caches to support this old read, and also pushing data out as fast as they can. At the same time, if you have processes that don’t expect very old data, this can break downstream components as well. Basically, it can be very very bad.

As well as some more minor things, that might break your workflow:

KIP-272: added API version tag to metrics, which breaks JMX monitoring tools
KIP-225 changed the metric “records.lag” to use tags for topic and partition. The original version with the name format “{topic}-{partition}.records-lag” has been removed.
KAFKA-7373: GetOffsetShell doesn’t work when SSL authentication is enabled

On top of that, there were a number of things that you need to take into account with major behavior changes:

Upgrading each broker can take lots of time as it rewrites the data on disk in the new format. This could leave partitions under-replicated for long periods of time
The default value for ssl.endpoint.identification.algorithm was changed to https, requireing you to set ssl.endpoint.identification.algorithm to an empty string to restore the previous behavior
ZooKeeper hosts are now re-resolved if connection attempt fails. But if your ZooKeeper host names resolve to multiple addresses and some of them are not reachable, then you may need to increase the connection timeout zookeeper.connection.timeout.ms

Hopefully, by this point I’ve convinced you that you need to validate the code you deploy before you deploy to production, even if it is a vendor release.

How to validate

The first step should be to take a look at release notes (duh) for the version you are upgrading to, but also all the intervening versions. These will usually be a good start to make sure you have all the operational changes in place.

Then you should look to the JIRA for issues that are labels “critical” or “blockers”, particularly for the version to which you are upgrading. Its up to you to determine if they are “real” issues and, if so, actually sever enough to warrant either your own fork or waiting for another release…or if its fine and you can go ahead.

From there, you can then start actually testing a release. For this, you will want to start by spinning up a completely separate cluster. We are going to be hammering on it.

Tools

There are many tools available out there that can be used to validate and test Kafka. For instance, a couple of Google searches yields:

Kafka Monitor - https://github.com/linkedin/kafka-monitor/wiki/Design-Overview
ducktape - https://github.com/confluentinc/ducktape
Jepsen - https://aphyr.com/posts/293-jepsen-kafka
Pepperbox - templating + generating messages - http://pepperbox.gslab.com
Blockage - docker-based network partition - https://blockade.readthedocs.io/en/latest/
Gatling + kafka plugin - https://github.com/mnogu/gatling-kafka
Kafka core ProducerPerformance - https://github.com/kafka-dev/kafka/blob/master/core/src/main/scala/kafka/tools/ProducerPerformance.scala

But what you really need to do is find the simpliest possible tool that will help you test the scenarios you are concerned about.

Personally, I’ve found Kafka Monitor to be the most versatile tool, since automated failures, restarts, etc. seemed to be well covered in Confluent’s existing test suite. We just really need to check how the consumers/producers view state in Kafka and that we are hitting our performance expectations, but don’t need hooks into a month long running chaos suite.

KM is great in that it covers performance SLAs & data loss checks out of the box, and tracking consumer commit rate you can also check for consumer offsets being dropped.

The one thing I would have liked to see in Kafka Monitor is a consumer that you can turn on/off with an external REST call. This would helpful for ensuring in the face of consumer/broker restarts that only a couple of offsets are not being dropped. However, this is a relatively minor risk - as long as all the offsets weren’t being dropped, a couple of messages being replayed is not a big deal.

Methodology

If we want to understand how the new cluster will perform and operate, we need to start by baselining your existing installation. Start by standing up a small test cluster - minimum of 3 nodes, running hardware matching your production cluster - and deploying your existing version.

Then try and push as much data through as you can - produce and consume - with a single instance of the Kafka Monitor. We will call this the “continuous” instance/

Now, we are going to stand up a 2 other KM instances:

stop/start (SS) - this instance will be bounced regularly, but retain it offsets in Kafka.
- key configuration: enable.auto.commit = true, ensures that the consumer picks up where it left off
stop/restart (SR) - this instance is also bounced, but will restart from the beginning of retention.

The single producer/consumer instance provides the data that all the consumers will use, and also validates the ‘steady state’ flow. The SS consumer key use is that it ensures that consumer offsets are not lost. The SR consumer ensures that data is not lost.

Though we have this handful of consumers, the actual work will all be done by hand.

We will start by deploying the new code to the brokers and then upgrading them one-by-one. With each broker restart we will also be restart the SS and SR consumers. Ideally, you don’t restart them at the same point in the broker restart each time. For instance, if maybe right after you trigger the restart, or right after or after it has come back up.

There will be a number of restarts to bring the cluster up to the fully latest version. With Kafka you need a round of rolling restarts for each of:

running the new software
updating the interbroker protocol version (inter.broker.protocol)
update the client protocol version (log.format.version)

This gives us plenty of opportunity to validate of data or offsets loss in via our consumers.

Validations

So as we progress with this validation process, what do we want to check for?

#### No data loss

All consumers should not show any data loss. This is actually a nice metrics that KM exposes and is based on the essentially validating that a “linked list” like structure is correctly linked for each partition.

Consumer offsets are not lost

When restarting the SR consumer, it should take about as long to go from the beginning of time, as every previous restart. For this, you will need to graph the offset commit-rate and compare it to previous restart steps.

However, when restarting the SS consumer, it explicitly should not go back to the beginning of the partition, but instead pick up where it left off. This is reflected as a roughly steady-state offset commit-rate, with a minor spike possible as it catches up to the producer.

Performance

The SR consumer not only checks for data loss, but also allows us to validate the “top speed” of consumption - it is trying to pull data as fast as it can from the beginning of the topic. This allows us to get a handle on the comparative performance loss while progressing through each stage of the upgrade.

Additionally, our single producer should also be monitored to track its throughput throughout the upgrade process. It is expected to have slight hiccups when brokers restart, but at no point should the producer fail (be continuously unable to connect - indicative of a API compatibility bug), instead just needing to wait until the broker is ready to take writes again.

Gotcha

To give yourself reasonable window of replay and validation, I’ve found its necessary have retention set to around 10 hours. This allows a wide enough window to validate the SR consumer’s replay rate, but not keeping around so much that each step takes too long. That said, YMMV - that just seemed to be a nice number for our disks, network, etc.

Additionally, for this small three-node cluster, you want to ensure you set at least the following configs:

acks = all
min.in.sync.replicas = 2

Otherwise bouncing consumers can make it look like you are losing data, when in fact that is normal business operation of the restart.

Approved Version

I’ve found that Confluent’s 2.2.1-cp1 is quite stable and has back-ported patches to avoid the critical issues I’ve found when reviewing the stable Kafka releases. On top of that, the performance boosts, particularly over the 1.X and 0.10 lines is quite nice, as well as the solid JBOD support (making our lives much(!) better when dealing with the all too common disk failures).

On a small, 3 node cluster, running reasonably decent - but still commodity - hardware you could see as little as a 5% slowdown in producing and consuming during an upgrade.

Given that an upgrade will likely take you about 15min total per broker (assuming reasonably large volumes of data, 5min per restart and 3 restarts per step), you can then calculate approximately the amount of lag build-up in the process.

But, you won’t take my word for it, right?

From Git Noob to Wizard in 5 minutes

2019-04-28T00:00:00+00:00

Looking to improve your efficiency with Git? Learn the secrets to go from novice to master to wizard. Not only that, but it can make life significantly easier and faster - every day.

Source

Basics

Simple git aliasing is the easy way to get started with short cuts. They even integrate into the git auto-completions + suggestions, so if you misspell a shortcut it will likely recommend the right thing!

Here’s some things that I have in my ~/.gitconfig

[alias]
  b = branch
  patch = apply
  spatch = apply --summary
  st = status
  # Fix the current commit, adding any changes for 'tracked' files
  amit = commit -a -m
  amend = commit -a --amend

  # Rebase help
  ##############
  abort = rebase --abort
  continue = rebase --continue
  skip = rebase --skip
  cp = cherry-pick

  # commands to list commits
  ##########################
  # simple log printing
  glog = log --pretty
  # simple list
  ls = log --pretty=format:"%C(yellow)%h%Cred%d\\ %Creset%s%Cblue\\ [%cn]" --decorate
  # exact dates
  ll = log --pretty=format:"%C(yellow)%h\\ %ad%Cred%d\\ %Creset%s%Cblue\\ [%cn]" --decorate --date=short

  # branch manipulation
  ####################
  trunk = checkout trunk
  master = checkout master

These are all things you might find in your standard set of suggested shortcuts anywhere around the interwebs.

Shelling out

Once you are starting to get used to shortcuts in git you will likely run into things than are more complicated that just a single command. This is where shelling out becomes useful. You can alias a git command to a series of shell commands.

Often, I just chain git commands together, to save my self the typing, i.e. for common workflows.

[alias]
...
  # aggressively cleanup any files or changes
  purge = "!sh -c 'git clean -f; git checkout -- .' -"

  # If you forget this is git, and not maven
  #####################################
  generate-sources = !mvn clean generate-sources
  test = !mvn clean test
  # checkout a branch and then re-generate mvn sources. Created before golang was a thing :-/
  go = "!sh -c 'git checkout $1 && mvn clean generate-sources' -"

  #redirect gitk stderr to /dev/null b/c it is dumping lines like: 2012-08-02 21:14:49.246 Wish[33464:707] CFURLCopyResourcePropertyForKey failed because it was passed this URL which has no scheme:
  k = !gitk --all 2>/dev/null

A small aside on that last alias - k. It helps with logging from gitk, a simple UI that I findly very helpful when trying to visualize all the branches and their locations. Some prefer to do this with fancier command-line verions of git log, but for my money its hard to beat the simple navigability of gitk.

Branch in the command prompt

This is a super useful, easy addition to your command prompt that dramatically improves your life, especially if you have multiple git repos. It takes your command prompt from

jyates@home$

jyates@home (git-wizardry)$ 

Its pretty simple to add too. At the end of your ~/.bashrc you can just include:

# Print out the current branch name, if we are in a git repo. Takes the last
# error code as a parameter, and then returns that same error code, so that you
# can continue to have a correct $? output
function parse_git_branch () {
  git branch 2> /dev/null | sed -e '/^[^*]/d' -e 's/* \(.*\)/ (\1)/'
  return $1
}

PS1=${PS1}$(parse_git_branch $?)

If you are in a git repo, it shows which branch your are on. If you aren’t, it doesn’t show anything. Pretty neat.

Getting fancy

For years I’ve been wanting to switch between branches like directories and trim branches that get merged. Without further ado, here are the additions to your ~/.gitconfig

[alias]
  # not only switch branches, but store the branch I was on
  co = "!git rev-parse --abbrev-ref HEAD > ~/.git_current_branch/${PWD##*/} && git checkout"
  # go to the last stored branch
  cd = "! sh -c 'cat ~/.git_current_branch/${PWD##*/} | xargs git co'"
  # delete the last stored branch
  dlast = "!git b -d $(cat ~/.git_current_branch/${PWD##*/})"

Don’t forget to create the ~/.git_current_branch directory, otherwise these commands will break.

Ok, so … what does that all mean and why should you care? This set of aliases often come up when I am switching between branches and working on features. For instance, my workflow - similar to the standard git branching model - is something like:

(master) $ git co working-branch
... write code
(working-branch) $ git amit "A super cool feature"; git push origin
... code review, merged code
(working-branch) $ git cd
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
(master) $ git pull origin
... pulling changes
(master) $ git dlast
Deleted branch working-branch (was 31de864).

A simple flow, but something I do multiple times a day and helps keep my workspace nice and tidy.

Shelling out with auto-completion

Shelling out in git commands means that git can’t easily figure out which alias command should be recommended. Fortunately, git has hooks for bash functions (at least in newer versions) to find the root. Basically, it takes the command you enter and apply its recommendation function using functions that start with _git_.

Maybe this is easier with an example. Let’s ensure that our custom co function autocompletes like the normal checkout function:

# Wrapper git functions for auto completion
###########################################
function _git_co() {
  _git_checkout
}

As long as these functions are sourced, ideally as part of your ~/.bashrc, then they get picked up and correctly auto-completed.

Other helpful commands

I also like to keep track of my progress and things to do in my git history. To that end, I like to have these functions in my .bashrc

# Git Functions
###############
#add todo for git
todo(){
  git commit --allow-empty -m "TODO: $*"
}

#add epic tood for git
epic(){
  git commit --allow-empty -m "[EPIC]: $*"
}

Hopefully you found some of these commands useful and will help you same time and effort every single day!

Just Right Parallelism in Akka Streams

2019-04-07T00:00:00+00:00

Reliability scaling and managing streaming ingest - particularly when dealing IoT - is a challenging problem. Not only do you have to be low latency, correct and high volumes, you also get huge messages and bursty devices. On top of that, firmware developers have their own goals and are not optimizing for ease of ingest, so you have to deal with many many different data formats. What is an engineer to do?

I’ve come to find the combination of Akka Streams and the akka-streams-kafka library a powerful combination that solves many of my problems, while giving you release valves to easily do custom things when you need to. You probably haven’t heard of Akka Streams - its a streaming framework built on top of the rock solid Akka actor framework. That also means it is stable, reliable and battle proven. It also has some commericial support too, if you are into that kind of thing.

Akka Streams is built following the Reactive Manifesto - it is designed with non-blocking back-pressuring so your apps run lightnining fast. You are really only limited by your slowest step (allowing you to approach the limits of Amdahl’s Law). The API is similar to many common ETL frameworks; you stream a set of messages and have primitives to filter, groupBy, reduce, foldLeft, batch, etc., as well as develop your own custom processing stages.

If you are interesting in the some of basics of using Akka Streams, I’d suggest checking out my friend Colin Breck’s blog where he looks at some of the core components, how you can quickly compose them together and then how you can easily add in parallelism.

We are going to pick up from Colin’s posts and look at how you can take that easy parallelism and shoot yourself in the foot. :)

First, let’s setup a simple flow from a Kafka topic, through some custom logic (which could include sending to another topic, writing to some database, or anything else you could want), and then commits out progress back to Kafka.

object App {

  def main(args: Array[String]): Unit = {
    // setup the consumer to read from Kafka
    val conf = ConfigFactory.load()
    val appConf = conf.getConfig("my-app")
    val topic = appConf.getString("source-topic")
    val destTopic = appConf.getString("dest-topic")
    val control = Consumer.committableSource(consumerSettings(conf), Subscriptions.topics(topic))
      .via(downstream(appConf))
      # batch commits so we flush either every 1000 records or 1 minute
      .toMat(Committer.sink(new CommitterSettings(1000, 1.minute, 1)))(Keep.both)
      .mapMaterializedValue(DrainingControl.apply)
      .run()
  }

  def consumerSettings(conf: Config): ConsumerSettings[String, Array[Byte]] ={
    ConsumerSettings.create(conf.getConfig("akka.kafka.consumer"),
      new StringDeserializer(), new ByteArrayDeserializer())
      .withBootstrapServers("localhost:9092")
      .withGroupId("group1")
      .withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
  }

  def downstream(conf: Config): Flow[CommittableMessage[String, Array[Byte]], CommittableOffset, Any] = ...
}

Note that this flow has at-least once guarantees. We could fail after doing the destination step, but before committing. Thus, our downstream needs to be able to handle the potential repeats (yes, it is definitely 100% going to happen, especially at scale).

The interesting work is in that pesky downstream method.

Parsing a record and sending it downstream

In the IoT space, its very common to not get one record per message, but rather a bunch of messsages - its generally much more efficient to send over the wire, saves space with compression, etc. Even if you have very well formatted, easy to work with devices sending you JSON then you (a) live a charmed life, and (b) are still gonna need to unpack that message.

Let’s assume we want to parse messages with configurable parallelism (gotta use those cores!) and each mesage will parse into an iterator, making our lives simpler when we want to support parsing other data types.

Keeping this in akka-streams, there a bunch of primitives that can make this a rather straightforwared translation.

 type Message = Tuple2[Map[String, Object], CommittableMessage[String, Array[Byte]]]

 def downstream(conf: Config):
  Flow[CommittableMessage[String, Array[Byte]], CommittableOffset, Any] = {
	Flow[CommittableMessage[String, Array[Byte]]].mapAsync(conf.getInt("parser-parallelism")) { msg =>
      // generate the iterator from the record
      Future((msg, parse(msg.record.value())))
    }.map(tuple => {
     // make sure the last message in the iterator is the committable one
     // we don't want to commit before its fully processed!
      val iter = tuple._2.map { m => (null, m) }
      val end = Iterator.single((tuple._1, null))
      (iter ++ end).asInstanceOf[Iterator[Message]]
    })
    // flatten that iterator back out to the stream
    .mapConcat[Message](toIterable)
    // send to our downstream destination, e.g. the database
    .map(event => {
      if (event._1 != null) {
        sendDownstream(event._1)
      }
      event
    })
    // just grab back out our original, committable event
    .filter(event => event._2 != null)
    .map(_._2)
    // just pass the offset to commit back, which is handled by caller
    .map(_.committableOffset)
  }

  def toIterable[A](elements: Iterator[A]): Iterable[A] = new Iterable[A] {
    override def iterator: Iterator[A] = elements
  }

You can find the full code for this example here

And that could take you pretty far - maybe indefinitely - if your stream isn’t too high volume or you just handle small JSON blobs.

So where does this fall over?

The key understanding is in that the mapAsync only applies over the creation of the iterator. With parsing smaller JSON you can materialize that stream entirely in memory at once and get great parallelism because we are just sending materialized elements downstream.

That mapConcat does not execute in parallel - each iterator is going to be extracted in series, so we are going to be fundamentally limited in our throughput.

Handling big messages

For more complex parsers or big blobs, you will want to produce each event in a streaming fashion. We can take almost the same model as above, but actually do all that work inside the mapAsync with another Stream instance. This gets us the parallelism we thought we were getting.

As a bonus, we also get to process the messages out of order, while leveraging mapAsync to ensure that we continue to commit in order (mapAsync ensures ordering of results). That means the impact of p90+ size messages - those unusually large ones - is dramatically reduced.

That is, a random big message does not block the whole stream from making progress. We will still not commit any of the downstream message until the big message is processed, but then they will all commit at once.

Now our downstream handling can actually be quite succinct and lightning fast.

  def downstream(conf: Config):
  Flow[CommittableMessage[String, Array[Byte]], CommittableOffset, Any] = {
    Flow[CommittableMessage[String, Array[Byte]]].mapAsync(conf.getInt("parser-parallelism")) { msg =>
      Source.fromIterator(() => parse(msg.record.value()))
        .via(sendDownstream)
        .runFold(msg.committableOffset)((offset, _) => offset)
    }
  }

  def sendDownstream: Flow[Map[String, Object], Any, Any] = { ... }

You can find the full code here

Here we are changing our sendDownstream definition to a Flow - actually a much simpler to read approach! Now we get the expected parallelism when parsing a records and sending it downstream, ensuring that big records don’t block the flow.

Not only that, now we continue to use the Streams primitives in a composable way, ensuring that the cost of restructuring is small, testing is cohesive and that future readers are not context switching (not to be underestimated!).

Unfortunately, our implementation does hide complexity around handling exceptions - do you fail the stream if the Iterator creation throws an exception? what if the Iterator throws an exception when getting the next record? That is all left as an exercise for the reader, and is highly dependent on what guarantees you want to offer users.

Futher Implications

Now, there is a trade-off to make above: the amount of parallelism. Because we need to keep ordering (so we don’t incorrectly mark messages committed), the stream throughput is inherently limited by the slowest message to parse - assuming that you aren’t already blocking somehwere else. Thus, increasing the parallelism can increase your average throughput; you are trading CPU cycles for increased throughput. However, by increasing parallelism you could see switching costs actually leading to higher average latency per record.

That said, when viewed outside the processor, you could actually be decreasing latency when increasing throughput as small records would block until the large record is complete and then suddenly skip forward quickly.

As an example, lets assume we are using a mapAsync parallelism of 4 (my-app.parser-parallelism in our example configuration). Then we start processing 4 records in parallel.

For illustraction, lets assume the first record is the largest. While record (1) is parsing, records (2), (3), and (4) are also being parsed and flowing downstream. Akka streams is buffering their output - the CommittableOffset - until record (1) is complete, ensuring that we get correct ordering. Eventually, record (1) completes, and then immediately after records (2), (3), and (4) are seen to complete. Thus, it can apprear that their processing time is approximately zero.

That is also why its important to have metrics intra-stream as well, so you can understand the performance of your parser/downstream logic, as well as your ingest engine. This becomes even more important when building out a streaming platform, where the parser is no longer under your control and you need to export an understanding of the stream performance.

Downstream pressure

Not only do you have consider the tradeoffs in throughput, but also the effect on the downstream components. Since this is all running on the JVM you coud easily hit a GC that causes the Kakfa Consumer Group to rebalance. This means that your processor now has to rewind and reprocess the same messages over again. This can mean lots of repeat events sent downstream. In particularly bursty streams, this could easily see repeat parsers of 10+ times. So now you are wasting CPU, memory and I/O.

You need to consider if the latency requirements are necessary and that you can tolerate these occasional repeats (your milage may vary here - everyone’s data is different). It could actually be better to parser just one record at a time because the restart effort is very large or you can only tolerate limited pressure on your downstream.

The tradeoff is that you are inherently limiting your throughput in favor of avoiding repeats.

Note that in this case, you are actually better off just flattening your stream into a map and a mapConcat stage. The overhead of the mapAsync parallelism is going to just be wasteful (You can read more about managing parallelism here).

Managing large messages

In “big data” there is inherently the implication that the long-tail is just part of life. These ‘big messages’ that mess with your throughput (and potentially cause lots of repeats) will be normal.

After you quantify the quantity and effect of these messages, you then have to decide what to do with them. While you could adjust down parallelism, as we talked about above, maybe your latency requirements or parsing profile mean that is untenable.

An option is to run two different consumer groups. One that handles the small messages that parse and play together “nicely”, and a second that mess everything up. This means you can then build two very different tuning profiles to deal with each group independently. Also, these big messages are no longer blocking your small messages and you can then also likely dramatically reduce your average latency for small and large messages dramatically.

Wrap up

Akka Streams combined with the akka-streams-kafka library provides an incredibly powerful set of primatives that can be combined to provide a lightning fast streaming ingest platform. As with any powerful tool, there are sharp edges that you can cut yourseful on. However, you can get surprisingly good performance out of the box - a testament to [akka-streams]. If you are looking to wring performance or have an unique use case, you need to have a deeper understanding. Here we have seen how we can compose some basic primitives together to not only wring extra perforamnce out of our stream, but also handle some of the unique properties of IoT messaging handing.

Rather than wiping out, we can tame that long tail and surf the wave of big data.

Partial Multi-module Maven builds for pull requests

2019-03-09T00:00:00+00:00

As your Maven projets get larger it can take a non-trivial amount of time to complete a build, particularly if you are running each module in sequence due to code limitations. In this case, you should think about only having to build code that changes and its downstream dependencies.

The short answer to this is to use some of the multi-module flags.

Let’s assume that earlier in our build process we found one module from which there was a change made; let’s call the module ‘foo’.

Then, we want to build all the dependent modules up to the one we need. This ensures that we generate the correct binaries for the module (that all its dependencies are up-to-date when building later). Theoretically, you could skip this step rely on the already deployment binaries, but that depends on your branching module.

# make the module and (-am) all its dependencies
$ mvn -pl foo -am clean install -DskipTests

After that we want to build + test the module and all its dependent modules.

# run the tests for our module and all the modules that depend on it.
$ mvn -pl foo -amd clean verify

Maven offers many different flags to work with multi-module projects that are worth looking into!

This is can be a great processes for when you are running builds for each Pull Request and want to make the feedback very quick for developers. Without special tools you can easy have developers waiting 10s of minutes to hours for a build to run all the tests, even if they are only changing one small module in a corner of the code.

While you may be good at identifying changes to modules, maven is not necessarily good at managing dependencies. You often can get surprising transitive dependencies and essentially building through luck (not a sustainable pattern!). So, a word of caution - you definitely want to still run all the tests at least before going to production, but probably also before going to staging. This will help cut down on failures before it is too late.

The ideal thing would be to move to a build tool like Bazel, which can actually identify the changes made and correctly build the things you need - rigth out of the box! Wnfortunately we don’t always have the time (or opportunity) to do the better thing and are stuck with the tools we have.

Now, how do we make find the changes, say in Jenkins, and maybe support it in a multi-language environment? Well, that’s a post for another time…

HDFS Block Metrics - Missing vs Corrupt

2019-03-03T00:00:00+00:00

You are starting to move away from your Hadoop vendor - it was great for getting started, but you want to control your own destiny, reap huge saving money or institute advanced management. Once you start managing your own Hadoop cluster there are many metrics you will need to start collecting and monitoring.

Two of the most important metrics you have to monitor to ensure your HDFS cluster is happy are “MissingBlocks” and “CorruptBlocks”.

TL;DR: Corrupt Blocks have at least one copy from which the file can be repaired, while Missing Blocks have no more “good” copies available.

First, where do these metrics come from?

Every NameNode exposes it statistics over JMX and even has its own small query language built in! The missing and corrupt blocks can be found at http://mynamenode.host:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem:

 {
  "beans" : [ {
    "name" : "Hadoop:service=NameNode,name=FSNamesystem",
    "modelerType" : "FSNamesystem",
    "tag.Context" : "dfs",
    "tag.HAState" : "active",
    "tag.Hostname" : "mynamenode.host",
    "MissingBlocks" : 0,                     <------------
    "CorruptBlocks" : 10,                    <------------
    "MissingReplOneBlocks" : 0,
    "ExpiredHeartbeats" : 86,
    "TransactionsSinceLastCheckpoint" : 1354471,
    "TransactionsSinceLastLogRoll" : 159071,
    "LastWrittenTransactionId" : 14338027132,
    "LastCheckpointTime" : 1547515152261,
    "UnderReplicatedBlocks" : 0,
    ...
  } ]
}

Missing and Corrupt both sound pretty bad; when should you sound the alarm?

Let’s look at an example to understand the difference. The example above reported 10 corrupt blocks, but 0 missing blocks.

You would think if you ran fsck that it would return 10 corrupt files.

$ hdfs fsck -list-corruptfileblocks
Connecting to namenode via http://mynamenode.host:50070/fsck?ugi=yarn&listcorruptfileblocks=1&path=%2F
The filesystem under path '/' has 0 CORRUPT files

Huh. But the metrics say we have corrupt blocks!

If you dig into HDFS-8533 you will see that some blocks can be reported as bad by a DataNode but not actually be bad. Likely, the NameNode knows about the corruption and is actively working to copy a non-corrupt version of the file.

Chances are good, especially in large clusters, that you are going to see a corrupt blocks from time to time. And it turns out that users aren’t going to even notice its an issue (you are alerting on symptoms not causes, right?). So its probably something to warn about if its a persistent issue or really gets out of hand.

Instead, MissingBlocks are the real danger. Missing blocks can happen when all replicas of a block in the file are corrupted or all replicas go missing (i.e. don’t take down more than 2 datanodes (or rather replication factor - 1) at a time). This is definitely something to alert on - if a user queries for that file they will get an error back. If this becomes a common issue you need to know ahead of time to maintain the quality of your data platform, rather than waiting for blocks to not be found, since the next missing file could be that mission critical one.

But doesn’t CDH/CDP/etc. handle this for me?

Sure does! But then your alerting is scattered around your infrastructure, making it hard to manage. With poor source embedded source control these tools often end up giving your relatively small bang for the non-trivial amount of bucks (especially for large clusters).

While great when getting started, these tools often end up hamstringing you. By ripping out these tools you can accelerate your development with more confidence and at lower costs.

Bonus recommendation: prometheus exporter

If you are already using Prometheus for collecting metrics you can export JMX metrics from your NameNode and DataNodes with a simple javaagent - the prometheus JMX exporter. Its enabled with a simple java command line argument that points to the jar, prometheus scrape port and the configuration location. For instance, for a NameNode it would make sense to expose prometheus metrics on a port close to the client port, so the command line option would look something like

-javaagent:/opt/prometheu/jmx_exporter.jar=50076:/etc/hadoop/prometheus/jmx_exporter/namenode.yml

where your prometheus config would translate JMX into prometheus metrics with a config like:

---
lowercaseOutputName: true
lowercaseOutputLabelNames: true
whitelistObjectNames: ["Hadoop:*", "java.lang:*"]
rules:
  - pattern: Hadoop<service=NameNode, name=Rpc(Detailed)?ActivityForPort8020><>(\w+)
    name: hadoop_namenode_client_rpc_$2

  - pattern: Hadoop<service=NameNode, name=Rpc(Detailed)?ActivityForPort8022><>(\w+)
    name: hadoop_namenode_service_rpc_$2

  - pattern: Hadoop<service=NameNode, name=(\w+)><>(\w+)
    name: hadoop_namenode_$1_$2

Check out some of the other example configurations for some more ideas.

Dockerizing Jenkins Maven builds

2018-09-16T00:00:00+00:00

Many legacy build pipelines leverage Jenkins. If you get lucky, you will at least find the time to move to a Jenkinsfile - the same power as Jenkins, but now actually codified, rather than fragile point and click.

As apps start to move to containers, you will probably want to also run your buildsn inside Docker containers. This has the advantage that the build environment can match the develop environment can match the production environment.

Standard Use

The Jenkins pipeline documentation implies you can just drop in a pipeline by specifying the ‘agent’ as a specific docker image.

Jenkinsfile (Declarative Pipeline)
pipeline {
    agent {
       docker {
        image 'node:7-alpine'
        args '-v $HOME/.m2:/root/.m2'
      }
    }
    stages {
        stage('Test') {
            steps {
              sh 'mvn -B'
            }
        }
    }
}

In theory, this will allow you to mount your maven cache inside the container to help speed up the build (rather than re-downloading the internet each time). There are myriad options for leveraging containers to improve your build, from sidecar containers to using Dockerfile to specify the build environment and more. Its worth reading the docs to get started (if you haven’t already)!

Unfortunately, to get reasonable performance and solid isolation in your builds, its a little more complicated than the standard guides readily describe. For many people (particularly those with multi-language monoliths) you will want to have different environments and multiple build steps. This is where you want to leverage standard software practices and try and be DRY. Additionally, this will also makes it easy to update constants when changing your build from a single place.

Let’s start with a maven project, where we want to keep the cache around for multiple build runs:

DOCKER_MAVEN_IMAGE = 'maven:3.5.2-jdk-8-alpine'
DOCKER_MAVEN_ARGS = '-v $HOME/.m2:/root/.m2'

pipeline {
  stages{
    stage('load') {
      agent {
        docker {
          image DOCKER_MAVEN_IMAGE
          args DOCKER_MAVEN_ARGS
        }
      }
      stage('Test') {
        steps {
            sh 'mvn test' 
        }   
      }
    }   
  }
}

Cool, that’s pretty simple. But what if we want to also build, run tests, etc. on pull requests?

Managing Multibranch Pipelines

That’s where you need to have a more complex multi-branch pipeline. This generally happens when you want to build a master and a release branch and run tests off of pull request branches. In this case the standard maven cache mounting args will likely cause an issue if you are running more than one build per server. Remember, the same cache is mounted into all the build containers!

Best case, you merge a PR that shouldn’t have been because it got an artifact from the master build. Worst case, you release broken code from a PR to production (so pretty bad).

To avoid this, you will need to mount each branch to a separate directory. This ensures that your master and release branches continue to build quickly and correctly, as well as subsequent PR builds (unfortunately the first build of any branch will need to re-download all the necessary dependencies).

DOCKER_MAVEN_IMAGE = 'maven:3.5.2-jdk-8-alpine'
// Bind workspace m2 repo to not download internet too many times. New builds will have to download jars once, but should have minimal thrash for later runs. We don't bind $HOME/.m2 to ensure independence across builds
DOCKER_MAVEN_ARGS = '-v $HOME/.m2/builds/$BRANCH_NAME:/root/.m2'
...

Additionally, to ensure that we actually have the ability to write files, we also need to mount container root user as the local root user. If you always know the jenkins user id, you can specify this user (and skip the next step), but in complex environments that have evolved over time this is rarely easy or possible. Thus, our argline evolves to add a -u flag for the docker command.

DOCKER_MAVEN_ARGS = '-v $HOME/.m2/builds/$BRANCH_NAME:/root/.m2 -u 0:0'

Cleaning up builds

After a build you probably have a cleanup stage. This ensures that the next build executes correctly and without any leftover state. This can be a problem as the image runs as the root user, and because of the way containers are run the files generated will also be root owned, so they will not be able to be deleted by the Jenkins process!

To get around this you should make sure to run a maven clean command after your stage. The end result of all this tuning would look like:

DOCKER_MAVEN_IMAGE = 'maven:3.5.2-jdk-8-alpine'
DOCKER_MAVEN_ARGS = '-v $HOME/.m2/builds/$BRANCH_NAME:/root/.m2 -u 0:0'
pipeline {
  stages{
    stage('load') {
      agent {
        docker {
          image DOCKER_MAVEN_IMAGE
          args DOCKER_MAVEN_ARGS
        }
      }
      stage('Test') {
        steps {
            sh 'mvn test'
            // cleanup generate artifacts to ensure build can be cleaned up
            sh 'mvn clean'
        }
      }
    }
  }
}

Bonus Tip: Loading the Shared Library

You can get even more DRY by leveraging a shared library, where you can abstract a lot of thse commands into something simpler.

Sometimes when running a multi-branch pipeline you can get into cases where the wrong library gets loaded or jenkins can’t find a Git SHA (particularly if you are using BitBucket!). In that case, you can maintain your build stability by added a loading phase for the library at the beginning of the build. That could make the build look as simple as:

LIBRARY = "my-lib"

pipeline {
  stages{
    stage('load') {
      steps {
        script {  library(LIBRARY).my.path }
      }
    }
    stage('test') {
      steps {
        script {  library(LIBRARY).my.path.Maven().test() }
      }
    }
  }
}

A bonus advantage of this approach is that you can easily test new shared library versions by changing the constant to point to your test branch, like LIBRARY = "my-lib@branch" and quickly iterate.

Starting at Telsa

2017-07-30T00:00:00+00:00

I’m excited to announce that I’ll be starting a new job on Monday… at Tesla Motors! They have a mission that is incredibly exciting - nothing less than trying to save the world. With lots of opportunity and potential for impact, I can’t wait to get started.

Ironically early in their “data journey”, I’ll be working with cardata initally - on the only team looking at ‘scalable’ systems - and then looking to expand our use into things like manufacturing, Superchargers, solar panels, etc. At this point, the sky’s the limit.

It was a rather longer process than I expected to land at Tesla. I wanted to make sure that the next job I took was a good fit and somewhere I would be able to thrive. In the end I had

curated list of 30 potential companies
15 informational meetings
7 interviews

By the end of it all, I was quite tired! The reason for such a wide net was that I didn’t really know what I wanted. So, I talked to the full range of companies, from the 10 person startup created 3 months ago, all the way up to the standard behmoths of the Bay Area.

After grinding for two years on Fineo I realized that I needed somewhere with a bit more stability, but still early enough to shape the story and doing something tangibly “good” for the world.

There were a couple of startups that were really exciting, but in the end I couldn’t be confident of my success; I didn’t want to burn out in 6 months - a bad ending for the company and myself. With a wedding coming up and lots to do, more craziness was exactly the opposite of what I needed.

Looking then at “big” companies in the area, they all fit a certain scale necessary for “big data” - much smaller and it wouldn’t make sense, much larger and they are going to be established. Telsa fit nicely in that middle ground where I wasn’t fighting culture, but still had lots of things to do. Combined with a freaking awesome mission, I had to get on board.

My biggest concern is having a commute to Palo Alto. They have a bus (that leaves really early!) and shuttles from Caltrain, so it should be OK. And then I also plan to regularly ride my bike down - at 40 miles its enough to count as a good workout!

By no means was this an easy decision, but it is one that I’m happy about in the end. It was funny to realize that towards the end of my process, I found myself continually trying to justify Tesla over other companies. Fortunately, with a helpful recruiter (yeah, I wouldn’t expect to ever say that) they met all my concerns such that a justification was easy.

I’m excited to say that I’m joining Tesla to do big data and (hopefully!) help save the world.

Oh, and if you are interested in that kind of stuff - we are hiring :)

Hard isn't Valuable: Looking back on Fineo

2017-06-05T00:00:00+00:00

I’ve decided its time to wrap up Fineo. I took a shot for a while (nearly two years!), but I’m way past my original time deadline to get traction and well out of (allocated) money. I’ve spent the last few weeks writing up some of the interesting architecture/design work I did, so at least there is some decent record. At the same time, I’ve been reading a bunch to understand where and how I went awry.

Emotional Impact

Quiting Fineo felt a bit like getting broken up with as a teenager. Spending an inordinate amount of time together, Fineo became a large part of my identity (not a good idea). Once its over, it then still takes a while to really accept that it happened - its over; the necessary obsessing over (and over and over) what happened, why it didn’t work out, etc. At the end of it, it wasn’t necessarily something I regret doing, but it’s taken a bit of time to get used to the idea that I’m not going to be working on this thing that was a huge part of my life for nearly two years.

It’s been an emotional roller coaster, not to sound too cliché, filled with stress, depression, flow and yes, a couple wins. At the same time, its been a great pressure to expand my comfort zone and learn oodles of things I didn’t even consider before. I’ve said for a while that even if Fineo failed (predestination?), it would still have been worthwhile.

Business Path and Pitfalls

With any startup, there are inevitably challenges and mistakes made. Without a business person along for the ride, there were probably more than most. However, looking at back at some of the core missteps I made in starting Fineo, there are three over-arching personal challenges that made failure inevitable:

hubris
impatience
loneliness

compounded by a core mistake of conflating hard with valuable, and not spending enough time talking to a range of customers (Read “What customers want”, it makes this super obvious).

Oh, and running out of money - that didn’t help :)

Beginning: Streaming SQL + Analytics

Starting a company was something I’d always wanted to try and had saved money for years for just that reason. By summer of 2015 I felt I had a very comfortable amount and was in a position where I had relatively few responsibilities. Startups looked challenging and I’ve got a track record of finding that hardest thing that captures my interest and doing that - the more potential for pain, the better; harder must be better, right? Right?!

I looked around at the burgeoning IoT market and thought,

That’s the next big data challenge. Surely what we were doing at Salesforce is applicable out there. Managing and leveraging all the data is certainly going to be hard.

So, I started working on an idea I had been percolating to build a fast SQL data analytics tool for streaming and scalable data. Certainly seemed challenging and a whole lot more interesting that what I was working on previously.

Talking to Customers…Errr, right.

I didn’t go off completely without validation. I’d talked to a handful of folks and had some initial interest in a solution to “big data for IoT” - the buzziest of words. A couple people mentioned on the 10-100ms range for feedback (particularly from server logs, but that generalizes to devices, right?) and challenges at analyzing data at scale. At the same time, the partners I thought would come with decided to stay at their current positions (can’t fault them!), leaving me to strike out on my own.

“No matter”, I thought, “I can do this on my own. I can hustle. I can slog through the crap. I can sell this great idea. And I can code like crazy.”

Yeah, right.

Nope, nope, just… no.

So I spent that first month at home grinding out this super cool data tool. Coding 10, 12 hours a day. Finally, it got to the point I could show it and it looked like any time-series backend product… it wasn’t very exciting to look at. So, I spent some time hooking it up to an existing frontend UI and… it looked like just another time series database tool, and didn’t articulate why it was so much better.

But, I was finally starting to go the right direction - showing things to people, figuring out what people are actually struggling with. And it turns out that people almost never needed what I had built (5ms latency SQL-based analytics). I’d had one internal case from Salesforce, and my knowledge that it was interesting and novel technology, to base my work on; turns out, solution looking for a problem.

However, I’d still only shown it to a handful of people. This was not nearly enough to get a sense of customer needs (or desired outcomes) or have a base of people to whom I could sell the product (which should be those very people to whom I talked that had the problem).

Streaming Database Platform: Business Analysis

Basically, we were competing with a large amount of non-consumption. Major corporations’ - generally leaders in the open source data space - need for a scalable, SQL stream tool did not merge until nearly a year later, but it still wasn’t a common issue. For most, there wasn’t enough pain to justify a complete switch, while at the same time many tools emerged to help keep that original system alive.

I also didn’t focus on the IoT market - the platform background blinded me to just making a product that was tailor suited to the IoT/device needs. The emerging IoT market is still in the very nascent stages, where the winners are generally those who have an integrated, high-performing solution, rather than componentized architecture.

On the left, you generally are going to have the integrated companies that can move fast and deliver lots of end user value. On the right side, you have the componentized companies that are good at delivering incremental value powered by increasingly better components.

With the fully integrated stack, companies can more quickly deliver a product that directly solves the customer problem. As they become more componentized, the integrated stack slows them down as they cannot compete across the standardized ‘metrics of value’ for the customer, forcing the architecture to become more componentized with stricter interfaces between components. This allows companies to more quickly replace components with higher performing ones. This is also the point where the main product becomes commoditized, and the value gets driven into the component makers (e.g. an analytics/database layer).

Fineo was positioning as a better component in a space where an integrated product enables IoT companies to succeed. At the same time, the component market in which we played was crowded - each tool/component option provided a modicum differentiation, where Fineo’s did not either (a) make it clear why you want it, or (b) wasn’t useful enough. Again, focusing too much on the high level differentiation between databases, rather than solving for the focused IoT case (but, this also might not exist).

OK, back to the story so we can see how this gap in understanding continued to plague Fineo.

Pivot 1: Enterprise NextSQL

I’d been building enterprise big data for years, so certainly that is something worthwhile (or so the thinking went). Back to the code cave and to build with an architecture to execute SQL at scale (leveraging my recently gained knowledge, if not direct work) across online and offline data stores, while providing the flexibility of NoSQL - tada: NextSQL! (or metalytics, as I later came to call it).

Then I proceeded to spin a story around how IoT needed the flexibility of NoSQL, but still wanted to same SQL interface and obviously required the big data scale.

Taking that idea on the road a little bit, I’d talked to tens of companies and found some interest.

Only one had sustained any interest past the next week.

None bought.

However, I was out talking to people and getting some feedback. And I really don’t like talking to people I don’t know, so it all felt useful.

I was still missing that core component of an understanding of the problems facing these companies right now. Most of them had a database that worked pretty well for them (generally an RBMS, like Postgres or MySQL) and needed to focus on getting the core device sold. I was frankly scared to go to bigger companies - sales we outside my comfort zone, though I did make some initial attempts - that would be able to actually leverage what we were solving.

And even worse, I knew that was a key weakness - sales, business - and didn’t spend a majority of my time looking for someone to help fill that ability. I mean, it seemed like it was going OK and, come on, I’m certainly smart enough to do that work too…right?

Interlude: Contract Work

During the fall and winter of 2015 my father had health issues, so I was splitting much of my time between the business and trying to help him out (no mean feat, while living across the country). In the spring I took some contract work for a company using Apache HBase and interested in the SQL layer on top, Apache Phoenix, projects I’d been working on for the last six years and a core contributor to both projects.

This was a good reset, providing a more rigorous schedule and helped refill the coffers a bit. At the same time, I found my first ‘real’ customer through that work, interested because he was in the big data industry and understood the niche I was looking to fill.

Validation! So maybe I was building the right thing!

Now, I just needed to get back to work and finish the damn platform.

Building the Platform and Growing a Team

About this time I started to realize that I couldn’t manage it all on my own. Splitting my time between business and coding wasn’t working out. And quite frankly, I knew I was crap at sales and marketing, and needed some help.

So, started to plumb my network for folks that could help. I could put together an ‘advisory group’, but no one I could convince to come on full time. But always with tantalizing caveat of “sure, when you raise.” But, at least now I had a story around help from experienced folk in various fields that I didn’t know a damn thing about.

At the same time, the platform was starting to come together. I could read and write to it, we had solid testing infrastructure with comprehensive coverage and prod-like deployments. It was everything that I would want from how a ‘real’ system should be run.

However, we were still setup for a very “high touch” integration and lacked an easy way for customers to get started. Which means lots of talking to people and manual sales on a technical basis, since there still wasn’t a good visualization component.

There was some more interest from a couple of companies and a second company commitmented to being our Beta customers.

Recruiting

Around the same time, a year into the company, and I had grown quite lonely. An introvert by nature, I was to a point where I was craving more social interaction, but was still hamstrung by my timidity around meeting and talking to new people. I’d taken up drinking socially much more frequently (makes it easier just chat!) and personally noticed that it reached a somewhat concerning point, but shrugged it off in passing jokes.

Imagine my excitement when a contact reached out and was interested in joining me! Finally, someone else who (a) gets it, and (b) has time to work. With a seemingly complementary skill set and a channel to folks that also might be interested and have some time, I was pretty excited.

Finally, the momentum was picking up. I mornings full of meetings with potential new folks and had more things to manage. And busier is better, right!?

However, it was not a match meant to be. I ended up spending most of my time worried about what my potential co-founder was doing, how to best use them. All my suggestions of things to work on were met with positive responses, but things didn’t seem to be progressing; I was more and more busy, but less and less felt like it got done.

Queue lots of insomnia and mini-panic attacks worrying about making things work. I’d never had lots of issues with that and had started taking some drugs to help get to sleep regularly. Not a good situation for anyone.

In the end, I had to end the relationship and move on. Partially, so I could sleep and go back to manageable increasing anxiety, and in part so I could re-focus on building the product and getting customers.

Pivot 1b: Hosted Timeseries Database

We also started position ourselves as a hosted time-series database built for the enterprise (e.g. high availability, reliability, etc.). This is an even more limited market than the generic database market, with even smaller differentiation points. At the same time, there are huge switching costs for customers between databases (moving the historical data while transitioning new data access). This compounds with the fact that many of our prospective customers - small IoT startups - were still on those traditional RDBMS systems that were working well enough. So we had to offer a dramatically lower cost (hard) and better performance (nope, not yet).

At the same time, we also had a smaller range of features from a traditional database - we cut out certain capabilities to enable the wider scale. However, this makes it even harder to transition from the existing infrastructure.

Here, we could have focused on tools to enable transitioning databases or focusing on the wider ‘big data’ market where we could win on price at a lower SQL feature set (i.e. traditional low-cost disruption).

Pivot 2: Hosted IoT Platform

At this point, I came up to my original, arbitrary deadline of January 2017 to get funding. But, winter is a bad time to raise, so I pushed that deadline out further. I also knew that my pitch was hurting because (a) didn’t have enough validation (i.e. traction) and (b) no co-founder. We were also well past the funding hype of 2015, so turning a profit started looking increasingly important too.

A great piece of advice I got was to turn the challenges around and look at it as a distribution problem - make it a numbers game to make it at least seem like we had more traction.

Well, many of the people I talked to about what I was building were developers and they got it, understanding why it was interesting and novel and hard. Sounds like exactly the kinds of people to attract as users. Now, rather than going out and talking to a bunch of IoT developers for what exactly they need, I figured that as a developer myself I could certainly know what would work.

It was back to the code cave, this time build out a sign-up system for users and a UI dashboard so people could actually see and touch what I was building. If you can’t touch it, it doesn’t really exist.

The initial UI looked good and it was interesting to work with a new technology (even if it was wildly frustrating at times), so my technologist side was satisfied.

At the same time, I also started doing some more ‘sales-y’ work: scraping together a list of a couple thousand contacts to start cold calling. And I had some initial interest with that (50% open rates!), but it was well outside my wheel house and I found it hard to summon the mental effort to continually pursue it.

Developer IoT Platform: Business Analysis

Leveraging the big data expertise and our no-operations architecture model, we could compete well with the wider hosted database market (but that ‘s pretty crowded) or focus on the general IoT Platforms. The integrated platform worked competing against Amazon’s tool focus (making an integrated product against their ethos) and had the added advantage of many people using AWS by default (lowering the switching costs).

However, we are still looking at a relatively modular play, in an integrated market (we stopped at the edge of the cloud). We looked to move further up the stack (where there is higher value) with some basic dash-boarding capabilities, but that’s not necessarily something we could turn on overnight. The bet was around providing the core interfaces and DB capability that was suited to a market that didn’t want to deal with scaling and evolving data (which people rarely (unfortunately) think about deeply, upfront), while maintaining an familiar interface (i.e. SQL).

But there is also a core problem in a data Platform-as-a-Service (PaaS) - few companies are willing to give their data to a startup. They can lose it, are probably likely to go down and might not even be around in 6 months.

It feels like everyone is racing to provide the shovels,
while there are relatively few people actually digging for gold

The only people that can take that risk are startups themselves, but they are rarely going to need to scale benefits of Fineo that are so core to what we offered. So, why trust a startup when I could easily run MongoDB or Postgres for much of my data; at a bigger scale, I could turn to plenty of big companies (Samsung, AT&T, Google) that provide time-series database-as-a-service offerings.

I’m not sure how the proliferation of the ‘modular integrated’ IoT platforms will fare. These are things like ATT or GE that purport to provide a the core features you need for an IoT application in a quickly composable way (separate from AWS which enables all the things). It feels like everyone is racing to provide the shovels, while there are relatively few people actually “digging for gold”. It might be that we end up with a market that quickly moves to a componentized model because the value of the components is so high. Or it might be that the various cloud providers enable some entrenchment and can provide the integrated capabilities for a while.

Places Fineo Could Go

There are a couple of obvious things I could have pursued in the current climate:

Partner with an IoT gateway/platform as the time-series database component
- A more integrated experience for the IoT company that is focused on delivering value for their customer. We do have some of the best tech for this, if I do say so myself.
Provide a service for managing device data once it passes through the AWS IoT Gateway. Solve the ‘what now?’ problem.
- AWS still has lots of little caveats across a huge range of potential services.

But I’m getting tired and stopped having fun a long time ago.

Retrospective

What would I have done differently? A whole hell of a lot, but the core of it comes down to the fact hard isn’t necessarily valuable - customers are the determinant of what’s valuable.

Basically, I did everything completely backwards. Just wanting to start a company and having a couple of indicators you are on the right path isn’t enough. Here’s the order I’d try next time:

Apply tech to dramatically lower cost of solving problem
Find companies/user who have problem
Get them to sign a letter of intent for your solution. More is better.
Get a cofounder. a. Helpful if they are complementary skill-wise, but most importantly someone to completely trust
Raise money
Quit regular job

What went right?

I was looking an am emerging, disruptive technology (IoT) which was increasing the availability of data to people that previously didn’t have it (i.e. new market disruption). With our no-ops approach we could come in even lower cost the existing ‘incumbents’ in the market (i.e. low-cost disruption) and provided us with a strong competitive advantage in that we implement changes to the platform almost as fast we could write it. At the same time, we also had a strong technical advantage from my Big Data/Open Source background that enabled us to approach bigger data volumes than most of our competitors.

However, with an almost pathological avoidance of deep/wide customer conversations and a fundamental misunderstanding of the state of the industry, combined with a heaping amount of hubris, it was always an uphill battle.

In the end

Nearly two years into Fineo, I’ve run out of network contacts for a potential co-founders and, frankly, am tired and pessimistic about the prospects for another IoT platform. The struggle to call it quits has been rough: I’ve spent much of the last 4 months in a deep depression (rivaled only by one or two other episodes in my, admittedly short, life), but am still convinced that what I was doing was novel and interesting.

I’ve always been fairly successful at anything I’ve tried (middle-class white, male privilege helps a lot), so deciding to quit has been mentally hard to grasp - I’ve powered through a marathon on broken legs, sent myself to the hospital to finish an Ironman; certainly, this isn’t the line for me, is it?

Just like a after a breakup, I’m struggling to find something that really excites me. The experience has kindled an increasingly entrepreneurial nature; there isn’t a week that goes by when I’m not bugging my fiance with another ‘great’ idea. But, right now I need to get a real job to recover and provide some stability.

Unfortunately, generally only “lame” companies are using cool technology (i.e. advertising, sales, etc.), and vice versa - all the ‘save the world’ companies are using conventional software stacks.

But if you are working on something that fits the ‘worthwhile’ and ‘cool tech’ stacks, I’d love to hear about it!

What’s next?

For myself, I’m resolving to be more humble, more patient and more outgoing. The business side of things is definitely fun and something I’m going to be pursuing and reading about more more, but I doubt I’ll move too far from the keyboard yet.

Starting Fineo has helped me realize - truly, deeply know - how much I don’t know and can’t do alone. In fact, I suck pretty hard at parts of this startup thing (though I’ve gotten better at some bits). And I now get that change, regardless of context, takes a certain amount of time, of buy-in, of pure hustle and sometimes you just need to beat your head against it.

Am I happy with how this all turned out? No, not really. Would I do a lot of it differently? You bet your ass. Do I regret having tried? Nope. At least, not most of the time. Would I do it again? Yeah, I think so. Soon? Maybe not.

I couldn’t have gotten even half way through any of this if it weren’t for the incomparable support (emotional and business-wise), understanding and patience of my fiance, Megan - a brilliant product manager and transcendent baker in her own right. Thank you.

Building Up Fineo's Continuous Integration with Jenkins

2017-05-22T00:00:00+00:00

Getting a robust continuous-integration (CI) suite was an early priority at Fineo. By spending some upfront time getting good infrastructure in place we could move dramatically faster down the road; with a distributed, micro/nano-service based architecture, in-depth testing across the stack is a must.

In the beginning, we started out with just a bunch of unit and simple integration tests kicked off by Jenkins. We added in a suite of local end-to-end tests using resources as close to production as we could get. However, running in AWS means there isn’t always a good (or availabile) analog for services, so you have to test in the cloud.

We devised a basic set of ‘customer-like’ use cases and leveraged AWS Cloudformation templates replicate a production environment against which we could run the life-like tests. With this production-like suite we could confidently automatically deploy to production - there were no other tests to run that would give us more confidence.

Then we started to get fancy.

Debugging test runs was still a pain, involving manually parsing through kilobytes of logs and ugly messages delineating stages, like:

---- DEPLOY Started ---
** Stream Processing Deploy started **
...
... < lots of text >
...
** Stream Processing Deploy COMPLETED **
--- DEPLOY COMPLETE ---

We were also storing all the jobs as chains of scripts - Bash, Ruby and Python - which starts to get unwieldy very quickly.

Enter Jenkins Pipelines - a Groovy DSL for job definition and a good looking UI for tracking steps. Still a bit rough around the edges, but a huge step up from chaining bash scripts and reading reams of terminal output.

Quickly after discovering Pipelines, we started to slowly migrate jobs over to the Pipeline framework. Some of the jobs took tens of minutes to run - the perfect opportunity to work on the conversion of the long-running job.

Building to End-To-End Testing

The first stage in our build process is local unit and integration tests, for each component. Most of the infrastructure is Java based, and built by Maven, so this is as simple as mvn clean install. Often this will trigger downstream jobs to also build (they depend on the project at that changed). There are a couple of ‘sink’ jobs - jobs that don’t trigger a downstream job - that we monitor to kick off the local end-to-end testing.

The local end-to-end testing stands up local versions of all the AWS services we leverage - DynamoDB, mock Lamdba, Spark - and runs a set of tests that leverage the system in a ‘real world’ like cases as we can find, checking things like ingesting, reading, schema management and batch processing.

Each test is its own stage in the Jenkins Pipeline, allowing us to easily see progress in the test suite.

Testing in the Cloud

Supposing the end-to-end testing completes successfully, we then kick off the production-like testing pipeline against AWS. This has a two main phases: tenant-specific deployment and any-tenant deployment. The tenant specific validates that only a specific API Key can be used to access the data, while the non-specific merely ensures you have a valid API Key.

In both the tenant-specified and non-specific cases, we do a complete deployment of the software stack, just like in production. The only difference to production is the name of components and where they write data. A huge boon (and really the best/only way to do this on AWS) is AWS Cloudformation, which allows you to provide templates of your resources and semi-automatically upgrade them as the templates change.

At the end of the pipeline we request user input to verify the component(s) to deploy. This helps ensure we don’t have a lot of thrash in the infrastructure - code changes regularly. The input step is more a concern of too much automation, too early; as things no longer scale to manual automation we expect this step to be automated as well. In fact, as we mentioned above - there is no more surety we could have about the code because it captures the core user test cases.

Cloudformation is great, as long as you stop trying to manually manage the Cloudformation instantiated resources (just don’t, it **will break things **). Templates ensure you get the same thing every time and don’t have snowflakes. Our templates are mostly composed of pointers to resources, so the first phase of the deployment requires pushing the resources into S3, then updating the templates and finally actually instantiating the Cloudformation “stack” (i.e. set of resources).

Templating the Templates

In generating a production-like test environment, we needed a way to reproduce production. We were already using Cloudformation to ensure we had declarative infrastructure that could be deployed onto any AWS Region quickly, so it was naturally to look at reusing the templates for testing. Since the Cloudformation template files are just text (JSON or YAML), we used a templating library to ‘template the [Cloudformation] template’ to generate both the test Cloudformation template(s) and the production templates.

Specifically, we used Liquid templating language, as most of the test infrastructure is also ruby. This ended up being a huge time saver as we could reuse a lot of common definitions (i.e. s3 bucket locations or API gateway monitoring options) and have a common way to define test infrastructure (e.g. all s3 test paths start with /test).

We could then easily combine a set of declarative template and property files, deployment specific variables, and test transformations into one cohesive whole that that was succinct and readable.

Here’s a snippet of one of our Liquid-Cloudformation Templates:

It was so useful we also used Liquid templating in generating the API Gateway definition files. It was invaluable for things like create a common response language/protocol across multiple apis (and you don’t just jam all the endpoints into a single api, right?).

Drinking Champagne

Using our generated templates, a full production-like deployment can then be stress tested across a variety of real-life use-cases. Since we are a time-series platform, and its important to drink your own champagne, we capture the amount of time it takes to perform each of the tests and feed that data back into our platform, as well as the CPU/Memory consumption of our Jenkins Server.

If the time/resources for any component is wildly different from the historical values (avg, 75%, 90%, 95%, 99% - you are monitoring percentiles, right?) we fail the test run, roll back the test infrastructure and alert the folks with code changed involved; performance is just as important as correctness for us.

Each AWS end-to-end deployment is a Jenkins Pipeline job, which is itself kicked off by the full End-to-End testing Pipeline job, while the end-to-end deployment cleanup is a traditional Jenkins ‘freestyle project’. Eventually, we plan to move over the cleanup job to a Pipeline job, but until then we can easily run a heterogeneous set of jobs without issue.

Summary

Jenkins Pipelines enable code-based definitions of jobs and can be incrementally added to existing installations. And they are way better that chained scripts, once you get used to the DSL. By leveraging Cloudformation and some “templating of templates”, you can create production-like deployments for comprehensive integration testing. With these tools, there is really no reason to not be setup an integrated, automated continuous integration tool that automatically promotes changes into production with high confidence.

Want to learn more about the Fineo architecture? Check out the next (and final) post in the series: Harder isn’t Valuable: Looking Back on Fineo

Interlude: Self-Improving Architecture and Design

2017-05-18T00:00:00+00:00

A short break from the Fineo architecture. Recently get for the thinking about self-improving systems. Specifically, I liked the idea of a self-improving system where the actors in and around the system are incentivized to continually improve the system. This makes sense in context of a business or company, but can this be extending to a software code base? I want architecture of the system to dictate the lowest cost/lowest energy action to take within the system is also the best choice for the system.

This exploration was sparked by a recent episode of the Exponent podcast (specifically: exponent #110 - 10:45) discussed the conditions around how democracy in the 1930’s was incentivized to continue to improve democracy; it was in the best interest of the politicians (actors within the system) to continue to safeguard or improve the core system of democracy in the face of new technology.

The concern was over ability to maintain democracy when considering the rules for partitioning radio air space so it couldn’t be monopolized. A monopoly on the airwaves in a region would allow one entity to control the flow of information, effectively hamstringing the democratic processes. With the advent of the changes - new tech in radio - we not only preserved the state of the system, but actually sought to improve the system through changes and it was the easiest way to make the changes (otherwise they would have been voted out of office, or so they thought).

The system was seen to break down when partitioning the TV frequencies. Instead of information flow being the metric of success, profits become the considered metric. With the shift from the underlying metric of value to a proxy, we see a shift in outcomes, to preserving the structures for existing companies to drive more profits.

For many, capitalism and American Democracy are, if not one and the same, at least closely interwoven. On the micro-scale, the goal of capitalism for a single individual is to reap large profits. However, at the macro-scale, capitalism is about maintaining a large number of competitors wherein the best solution for customers gets the rewards, further encouraging innovation and growth to capture customer dollars. “Natural” capitalism is then a self-improving system where the reward of profits drives the creation of enterprises (1).

Self-Improving systems across industries

Self-improving systems are not a new idea and is seen across many industries. In education some have proposed designing the school system to grow leadership to drive more leadership across the different facets of the system link.

To many folks, the first thought of a ‘self improving system’ is to think of machine learning. In many applications, the goal is develop an algorithm/system that learns by its nature (even if we don’t fully understand it) increasingly good solutions(2).

There is also an interesting example from designing predictive and reactive architecture management, where they develop a feedback system that looks to optimize a set of goals based on a set of input metrics. In this case it’s more about how can we design the system to improve itself while running (e.g. feedback loop for real-life against an objective success function).

Self-Improving tech organizations

When building organizations and software, changes are often most easily made in ways that degrade the system - “just a X to do the job” - until you end up with a big ball of spaghetti. Unless you are explicitly managing the entropy of the system, it approaches chaos.

In fact, looking at entropy as a general theory for dissecting this problem, there might be evidence that tending towards cohesion (away from chaos) is impossible without the constant application of outside pressures/forces.

Much like a child, if not told to clean up their room, the toys will often end up strewn across the floor. As you grow older, however, one learns the value in developing a system of organization and returning things to their original place. Again, the application of outside force (your effort) moves the room from chaos to order because the long-term energy expenditure is worth the short-term expenditure (aesthetics aside).

Looking at that example, we can see a separate organizational structure at play that drove the right choice for the system even though it was more effort: the conditioning from your parents. By teaching you (imposing a mental framework) to seek an organization because the energy cost long term is lower its easier to consistently find the thing you want.

There may be hints of this in holocracy, most well known as the organizational structure at Zappos, where the structure of the organization pushes the responsibility for changes to the ‘edge’ - to the people most in tune with the local requirements - where it can do the locally optimal thing. The understanding then of the most optimal choice comes from the ‘top’ of the company which set the strategic goals that inform the the tactical choices made at the edge.

For an organization focused on innovation, the strategic goals should then be set around what outcomes the customer is trying to achieve and how well satisfied are those outcomes. Then its a question of just making what your customers want. However, if you leave the innovation up to a small number of individuals - e.g. a think tank - you are inevitably going to miss lots of opportunity.

In fact, that is exactly what Ben Thompson in that episode of Exponent mentioned above gets at with the inevitable stagnation of a centrally planned economy; there are just too many possible routes and options (essentially infinite), making it impossible for a small group to optimize. Instead, we should allow a general market to determine the best choices via awarding dollars to the success. Given that people will continually pursue things in their own personal best interest, the challenge then is to make risk/downside relatively small while encouraging a large upside.

Balancing Risk/Reward with Strategic Goals

Within a single company, of course you cannot pursue every possible idea for innovation someone within the company dreams up. They have to be evaluated against the those strategic goals that were derived from what the customer wants; essentially asking, “how well is this going to satisfy the customer?” Fortunately, this also provides a framework within folks can generate new ideas (often a boon to the ideation process - adding constraints helps creativity). Within a holocracy, people at the edge are then empowered to pursue these goals, proving them lots of potential reward (intangible, like seeing your idea become real, or tangible like bonuses, promotions and a rising stock value).

The risk for failure of a change is minimal for the implementors as management has to accept that they not longer have ‘control’ over the tactical things that occur. However, folks still need to be penalized for things not within the strategic vision of satisfying the customer. If an initiative fails, it then needs to become a lesson in understanding of what failed (i.e. the five whys), be in internal processes, market changes or imperfect understanding of the customer needs. In fact, these failures should be embraced because they provide the basis for a better approach.

And really that all comes back to a lot of the what Lean Startup get at with attempting to understand the customer, trying, failing and iterating. Keep in mind, following this process at an established company does not mean that you are ‘safe’ from disruption. Instead, its merely that you avail yourself of more sources of targeted innovation. A disruptive innovation could be used to grow into new markets, while a ‘sustaining’ innovation increases the value of your product for customers and helping grow market share.

Software Architecture

There is an interesting question as to whether a self-improving/continually improving codebase can be developed. From the systems example above, the state of the running system can be improved, but can the continual development continue to improve the system? In short, can you make a codebase wherein the developers best choice is to make the codebase better?

A simple approach might be tying job success metrics to code base improvements changes, i.e. ‘scout points’ for leaving it cleaner than you found it. Its then the organization and not the code that forces the improvement. This could drive innovation around improving the architecture such that the code is easier to maintain and understand (arguably more important, in all but a few situations, than performance).

Amazon seems to have a hint here as well with maintaining separate service groups where the goal is zero meetings, instead using well defined APIs to communicate. This architecture lets teams innovate internally, with minimal friction, while driving towards pleasing customers. Because codebases are physically separated they are less likely to become ‘infected’ with outside influence (and generally an argument to modularization).

However, if we look to startups, the key goal for the software stack is to get to something we can use to validate the hypothesis for pleasing the customer as fast as possible. Then, code quality has historically taken a back seat to ‘done’. Naturally, this can degrade quickly into a nearly un-maintainable mess, which has to painstakingly - module-by-module - be rebuilt into something manageable.

Maybe in taking a note from the designing predictive and reactive architecture management we can build in core systems that look at potential costs of change, or even look at trends in development, to determine what facet of implementation you should focus upon. Naturally, the granularity here needs to be managed - you wouldn’t want to thrash between refactoring and changes every day. Again, it comes back to what metrics of success are used and having the right gauges to understand the influence on those metrics.

Wrap Up

The idea of a system within which its actors are driven to improve the system is powerful. A key understanding of the actors then must be the metrics of success for the system to adjust actions for the those goals. In a startup, this could be number of new users or the amount of customer churn. For a democracy, this could be focusing on the availability of information. For capitalism, this could be amount of risk/reward for a new venture.

Of course, many of these systems intertwine, so its then incumbent upon everyone to want think about (a) what the goal of the system is and, (b) what metrics are used to achieve those goals. With that understanding, its becomes much more likely for folks to come up with the ‘right’ ideas and understand what makes those ideas correct.

To paraphrase from Ben Thompson, anything taken to its extreme inherently becomes no longer that thing. For instance, democracy to its extreme is mob rule. However, when looking at the extreme of the goals of democracy we see fair governing of a people based on their needs and desires (something that inherently doesn’t look like mob rule).

So maybe it comes down to keeping in mind the right metrics of success, not the existing system within which are operating. Any making those end goals understood across the organization.

Maybe? Still working on this one…

Notes

(1) This is not to say we need to be in an Ayn Rand fantasy world. Having a social safety net is good, having regulation around business practices to prevent harm (e.g. not polluting or having safe working conditions) are good.

(2) Please for forgive the overly simplified view of ML. I know, its a lot more complicated.

Handling Errors in Fineo

2017-05-17T00:00:00+00:00

Passing pipeline processing errors back to the user was not originally built into the Fineo platform (a big oversight). However, we managed to add support for it over only a couple of weeks. Moreover, we were able to make it feel basically seamless with the existing platform.

When writing to Fineo, you can get an error if the ingest buffer is temporarily full. However, you may have written a bad event and that won’t be detected until the event is processed a short time later. These errors can be seen in a special errors.stream table, but don’t touch any part of the ‘core’ data storage layers.

The overall Fineo platform looks like this:

Capturing Errors

An error can occur at each stage in the event processing pipeline (currently only two parts: raw and staged). The event, context and error messages are captured and sent to a special AWS Firehose, while the event is marked ‘successfully processed’ in the parent buffer (avoiding attempted reprocessing).

Firehose will periodically export data to an S3 bucket, partitioned by year, month, day and hour.

When evaluating possible options for serving this data to users - AWS Athena, ElasticSearch - I decided to serve it back through the main Fineo API and query engine. That meant extending the core to support this new data source, but was already had a SQL query engine that supported a partitioned, S3 storage mechanism; it seemed a lot easier that standing up a new service and learning a new set of infrastructure (and securing that too!).

Reading Errors

In the ‘regular’ Fineo infrastructure, the Tenant Key is part of the directory hierarchy. However, using Firehose for error capture meant that now the only place the Tenant Key is stored is the event itself.

Our errors.stream table can only point to a generic S3 ‘errors’ directory, requiring the Tenant Key to be filtered out of the event itself (ensuring tenants cannot access other tenants’ errors). This ended up being only a slight shift in how we were already managing queries as the tenant key was already being surfaced as projected field from every event on read. Thus, it came down to be only a relatively simple matter of some query translation and data source redirection.

Queries get translated quite simply.

> SELECT * FROM errors.stream

turns into

> SELECT * FROM errors.stream WHERE api_key = 'the_user_api_key'

for all requests.

Fineo uses Apache Drill under the hood, which natively supports reading JSON. It has the caveat that the JSON elements must not be comma-separated, but that actually works to our advantage as we don’t need to worry about separating error events when sending them to Firehose. Drill also supports directory-based partitioning, allowing us to very efficiently zero in on a user’s requested search time-range.

We then can easily read JSON data with the S3 storage engine with a hierarchy like:

/errors
 /stream
  /raw   <-- 'raw' stream processing stage
   /malformed  <-- the type of error in the stage
   /commit
    /year
     /month
      /day
       /hour
        some-firehose-dump.json.gz
  /storage  <-- 'storage' stream processing stage
   ...

and quickly zero in on the type of error or when it occurred, and can surface those as fields in the error event row.

Timestamp	Stage	Type	Message	Event
149504314500	raw	malformed	Event missing timestamp	{“metrictype”: “metric”, “value”: 1}
149504314502	staged	commit	Underlying server rate exceeded. Retrying.	{“metrictype”: “metric”, “value”: 1, “timestamp”: 149504314502}

The ‘zoomed in’ view of the error query architecture then looks like:

Informing Users

Now that we had a way to surface the errors, the question arose of how to make it easy for users to get that information. Yes, they could query the errors.stream table in a SQL tool, but that kind of sucks (it creates a lot of friction).

The last bit of work came in adding support for the error table to the Fineo Web App - a simple UI for users to view any errors they had.

While we support simple SQL queries sent to a REST endpoint, I didn’t want to overwhelm the web server with lots of requests or paging. However, because of the nature of reading errors is inherently sequential in time, I could easily page through the results based on the last received time stamp and the request error time range. This meant we didn’t have to leave open a socket connection or cursor in the database.

Summary

When you have a hammer, everything looks like a nail, but sometimes they actually are nails. With Fineo error management, we had a very clean integration with our existing infrastructure that let us implement a complete error analysis solution in under two weeks, from inception to deployment. And best of all? It fit well within our current user’s mental model and solved their issues.

Want to learn more about the Fineo architecture? Check out the next post in the series: Building a Continuous Integration Pipeline with Jenkins on AWS.

Supporting Schema Evolution and Addition in Fineo

2017-05-15T00:00:00+00:00

Fineo’s architecture is designed to help people go faster, while having to do less by leveraging our NextSQL system. At the surface, it’s not that much different from the the Lambda Architecture - a realtime serving layer and an offline batch processing step to reorganize data for offline analytics. However, we apply some “magic” to gracefully manage schema and provide a single interface for fast answers to realtime and offline queries.

In one of my very early posts about Fineo I talked about how users had one month for formalize schema or some data would be lost. But that kind of sucks - you should never have to get rid of data.

So, I spent some time digging into the particulars of our query execution engine (Apache Drill) and our batch processing engine (Apache Spark) to enable reading data against typed columnar rows and schema-less JSON based rows.

The batch processing step looks like this:

Get all the data from periodic S3 dumps from the ingest pipeline (orchestrated by Amazon Firehose)
Get all the known schemas for possible tenants
Group by tenant
Sub-group by schema-less and schema-ful
Apply schema to known rows
Write each group to own keyed directory hierarchy

Schema-ful rows can be written in an optimized, columnar format that is highly suitable to analytics. The schema-less columns (columns for which we have data, but no official schema) still need to be readable, but cannot necessarily be optimized because we don’t know its type; here we default to just storing the data in raw JSON format. Both types of data still require a timestamp and are partitioned based on that timestamp to enable quick point and range lookups across a time range.

Schema-ful Rows

The schema-ful data are those columns of rows that have a known name. Internally, once a column is ‘schema-fied’ it gets a ‘canonical name’ that we use across the platform. However, we still store data based on the incoming data name as it allows users to have more granular, on-demand history. Thus, a query for a column is a one-to-many lookup of column name -> canonical name -> column ‘aliases’ (or all the possible names for the column), which is internally used to generate the query.

Since we know the ‘type’ of each column/field in each event, we can store them in format suitable to analytics. We chose Apache Parquet) as a fast, easy to use and widely integrated columnar format. We then further optimized by lookups by storing all events in a directory hierarchy suitable for time-series queries, something like:

 /tenant id
  /metric
   /year
    /month
     /day

Unfortunately, there is not a lot of interoperability support between Apache Spark (batch processing engine) and Apache Drill (read engine). This means that writing partitioned data with Spark will not be readable by Apache Drill (this is because Spark writes partitions as key=value and Drill reads partitions as a nested hierarchy of value1,value2, etc.). That means we also have to extract the components of the timestamp and into the sub-partitions we want, construct those directories manually and write the Parquet data into the output directory.

Basically, we manually create the time-based partitioning. A bit hacky, but it works.

Schema-less Columns

Each well-formatted row is bound to a specific ‘metric type’ (think user-visible table), but schema-less columns are those columns but have not been formally added to the schema. We know the event is bound to a particular metric type (its required to be a valid row), but we don’t know how that field fits in with the schema - what is its type, is it an alias of an existing column?

These kinds of fields occur when you have changing data and don’t update the schema, either out of laziness or from a misspelling (e.g. a ‘fat finger’ mistake). In many systems, this unknown column will be a surprise and can either break your pipeline or get thrown out - bad outcomes either way. Generally, this requires going upstream to fix the mistake at the source (often painful and takes a long time), or requires lots of special casing code in your ETL codebase.

Fineo is built to handle this sort of change gracefully and without interruption.

These schema-less columns are grouped together in the batch processing step and written out to their own tenant-grouped, time-partitioned sub-directory:

/json
 /tenant id
  /metric
   /year
    /month
     /day

There is a slight trick with JSON storage that binary data needs to be base64 encoded to support storage as JSON and then auto-translated back in the read pipeline. Not terribly efficient, but it then works for all data types (oh, the joys of platform!).

This allows us to easily query either the JSON and/or Parquet formated data. At the same time, we can also periodically re-process the JSON data when there is a schema change to generate the columnar format, significantly speeding up access to that data. There is a bit of fiddling to ensure that the switch is atomic (or near enough), but that is left as an exercise to the reader.

Summary

Moving data to an offline, columnar storage allows us to efficiently support analytics queries. At the same ‘time’, time-partitioning allows fast answers on a very wide range of events because we can quickly pinpoint the range of data to access. But you can get that with a ton of existing systems (Hive, Kudu, etc.). The magic lies in how to support a fast-changing enterprise and dataset with features like column aliasing and renaming, and late/async-binding schema, so you can go faster, while doing less.

Want to learn more about the Fineo architecture? Check out the next post in the series: Error Handling and User Notification.

Scaling Out Fineo

2017-05-12T00:00:00+00:00

A deeper look into how Fineo manages its seamless scalability across the multi-layer architecture. By enabling each layer to scale independently and leaning on existing, fully-managed services we can enable wildly scalable infrastructure without notably increasing operations effort (and often decreasing it).

I had a couple of questions come up from the Fineo ingest post on how we make things scalable, what happens when throughput limits are hit, etc. that I thought would be interesting to explore. For a refresher, here is the high-level view of Fineo’s architecture:

Scalable API

At the top layer, we leverage AWS API Gateway to manage the external REST endpoints and user/device authentication. Its transparently scalable and integrates well with IAM credentials (device auth) and AWS Cognito (user auth); it’s our ‘hard shell’. For a relatively low amount of requests (like those seen in a startup), API Gateway is very cost effective at $3.50 per million API calls received + data transfer costs, saving us the burden of writing, running and managing another service. Eventually, we will need to move off of API Gateway, both from a cost perspective and to get around its 30 second request/response timeouts.

Managing buffers

The core of the Fineo write architecture is a series of AWS Kinesis stream, aka a fully managed, 24hr data buffer. Each buffer shard has a write limit of 1MB/sec and read limit of 2MB/sec and costs $0.015/hr and $0.014 per 1M ‘put’ requests. Again, for a startup this is significantly cheaper and easier than trying to run Apache Kafka (the open source equivalent) on our own. The question is then, “what do we do when demand increases?”

The writes for each tenant are hashed into the Kinesis shards based on the tenant id and the timestamp, approximating a uniform distribution of the events across all the shards. Then, we just need to ensure that capacity stays ahead of the demand. Writes per Fineo user are limited to 200 events/sec, so as long as events stay below 5KB each, we can approximately allocate one Kinesis shard per tenant.

However, this misses a couple of things.

First, we support multi-put requests, making it relatively easy to go above 1KB/sec for many use cases. At the same time, many users aren’t aren’t always going to be using the full capacity, so by keeping 1 shard per user, we are wasting capacity (and money). Finally, we still need to ability to scale up the number of shards to support demand spikes.

Enter the AWS Kinesis Autoscaling Util. A tool and standalone service that manages the amount of shard capacity based on the monitored PUT and GET rates.

As long as we scale up when PUTs exceed a fixed percent (e.g. 50% of capacity), we can quickly respond to user demand shifts, while remaining cost effective. Its not 100% perfect as very fast demand bursts will overwhelm the system, but it captures more than 80% of our needs.

Stream Processing

All of our stream processing is handled via AWS Lambda. Again, it has the same properties of being auto-scalable and fully managed, while remaining cost effective at relatively small scales. It starts to make more sense to move off of Lambda as the number of events increases, instead moving those functions into a standalone EC2 instance that leverages the AWS Kinesis client library to access the streams.

Fast, easy, simple to test. Check!

Storage

We have two main storage engines - S3 and DynamoDB. Both are fully, managed, scalable, etc. etc. - all the things we are also looking for above.

Firehose Streams to S3

S3 is fed by AWS Firehose, which lets us buffer data for between 60 seconds and 900 seconds (15 minutes), and by default 2,000 transactions/second, 5,000 records/second, and 5 MB/second. This handles much of our early scale and can be scaled up either by opening an AWS case or managing a set of Firehoses and distributing the writes across them. We then periodically batch process the Firehosed records into a partitioned, columnar format for use with client reads.

Because the Firehose copies are done at every stage of the stream processing, we have between 4 and 6 copies of the data at all times, across the Firehoses themselves, S3 and DynamoDB. This makes it very unlikely to lose data and easy to recover because its already formated in the stream processing layout.

DynamoDB Storage

We spent a bit of time thinking about the DynamoDB schema to ensure that its going to be reasonably scalable and avoid ‘hot spots’ (see DynamoDB for time series post for more details). Assuming that we have relatively uniform writes and reads, the remaining overhead is then to just ensure that our DynamoDB shard allocation is appropriate to our workload.

We leverage time-range grouped tables to partition events, allowing us to quickly ‘age-off’ older data (by deleting tables) and economically adjust the allocated capacity for each table as needed. The key assumption here is that more recently written data is also the most frequently accessed data. Its easy to go overboard and do something like a table per day, but at a limit of 256 tables, we can quickly run out of tables for production and test environments. At the same time, too few tables means allocating extra capacity to data that is rarely accessed, effectively wasting money.

We settled on a table per week of event time range, so everything with a timestamp in week 1 goes into table 1, everything in week 2 goes into table 2, etc.

Now, just like with Kinesis, we need to be able to turn up and down the capacity of the cluster. There are a couple of tools to do this: dynamic DynamoDB, that runs a server, or one of many lambda implementations. I’m partial to lambda functions for ease of deployment, but really its up what fits your deployment model. The only kicker is that DynamoDB can only be scaled down four times a day, so you have to be a little judicious in allocating capacity. Since we rely on lambda functions and an idempotent write model, we can support retries and being a little bit slower to scale, saving us money but at the cost of a slightly higher latency for users.

Schema Storage

The backing for our schema store is DynamoDB, which allows us the same scale, flexibility and operations overhead (or lack thereof) as the core data. Currently, we just directly read the data from DynmoDB and rely on machine local caches to minimize lookups for older schema. This does create a bit of interdependence in the architecture in that we have to be careful when upgrade the schema tools to ensure we remain backwards compatible on the storage layer.

We will eventually move to a more fully managed, internal ‘schema service’. Nothing needs to change from the external schema user perspective, as we are just swapping out an implementation. The fully managed service lets us scale and cache more aggressively, but means there is another thing we need to manage, deploy, etc. that didn’t seem worth it with the small size of the team (relying instead on our high communication bandwidth the manage the coupling in the deployment/code).

Query Execution

We have two main components to query execution: a query server and an Apache Drill cluster. The query server runs as a simple AWS Elastic Beanstalk Java application. Because it is essentially stateless, we can transparently scale up and down the number of servers behind the load balancer based on user demand using standard AWS rules. Upcoming work includes adding client pinning to servers so we can support larger queries that take multiple round-trips.

Apache Drill similarly supports a dynamically scalable cluster. As resource demands grow, we can just add another node to the cluster to pick up the extra work. We trust to decent AWS network architecture to avoid major data locality issues (all the data is stored in DynamoDB and S3 anyways, so its not going to be truly local regardless). Similarly, as work drops below a given level, you can decommission nodes. This is more of a manual process, or driven by a custom watcher, and can be down via a separate monitoring server or in AWS Lambda, just like with Kinesis and DynamoDB.

Wrap Up

At Fineo we designed for scalability from the beginning, while still remaining cost effective. By thinking about not only how we are going to scale now, but also what that is going to cost and how to support the same (or better) scalability down the road at a lower cost, we can move at startup speed and cost. By separating out the architecture into different layers and ensuring that they are independently scalable ensured that we have no bottlenecks. Since we are a small shop, it was imperative that we cut down on operations work, so we could focus on building new features and growing the business. In leveraging full-managed services and auto-scalability monitors we not only freed up our time, but run a better service at a lower cost.

Want to learn more about the Fineo architecture? Check out the next post in the series: Supporting Schema Evolution and Addition.

Using DynamoDB for Time Series Data

2017-05-10T00:00:00+00:00

Time is the major component of IoT data storage. You have to be able to quickly traverse time when doing any useful operation on IoT data (in essence, IoT data is just a bunch of events over time).

At Fineo we selected DynamoDB as our near-line data storage (able to answer queries about the recent history with a few million rows very quickly). Like any data store, DynamoDB has its own quirks. The naive, and commonly recommend, implementation of DynamoDB/Cassandra for IoT data is to make the timestamp part of the key component (but not the leading component, avoiding hot-spotting). Managing aging off data is generaly done by maintaining tables for a specific chunk of time and deleting them when they are too old.

Not unexpectedly, the naive recommendation hides some complexity.

Overlapping Timestamps

At Fineo we manage timestamps to the millisecond. However, this can be a problem for users that have better than millisecond resolution or have multiple events per timestamp. Because we are using DynamoDB as our row store, we can only store one ‘event’ per row and we have a schema like:

Hash Key	Range Key
API Key, Table	Timestamp (ms)

This leads us to the problem of how to disambigate events at the same timestamp per tenant, even if they have completely separate fields. If we were using something Apache HBase, we could just have multiple versions per row and move on with our lives. Instead, we implemented a similar system with DyanmoDB’s Map functionality.

Leveraging Maps

Each write that comes in is given a unique hash based on the data and timestamp. The hash isn’t a complete UUID though - we want to be able to support idempotent writes in cases of failures in our ingest pipeline. Instead, we get an id that is ‘unique enough’.

For each row (Api Key, Table | Timestamp), we then have a list of ids. Each field in the incoming event gets converted into a map of id to value. Thus, to read an event from a row, you would first get the list of ids, then ask for that value for each ID in the map.

For example, suppose you had an api key ‘n111’ and a table ‘a_table’, with two writes to the timestamp ‘1’, the row in the table would look like:

Column	Value
apikey, table (range key)	n111,a_table
timestamp	1
ids	[1234,abc11]
field1	{1234: “a”, abc11: “b” }

Where 1234 and abc11 are the generated ‘unique enough’ IDs for the two events.

Drawbacks

There are two major drawbacks in using this map-style layout:

DynamoDB has a max of 250 elements per map
Optimize for single or multiple events per timestamp, but not both

The first is a hard limt and something that we can’t change without a significant change to the architecture. Fortunately, this more than fulfills our current client reqiurements.

The second comes from how DynamoDB handles writes. If we assume that there is generally only one event per timestamp, we can craft a request that creates the id list and column map immediately. If that fails, we could then attempt to do an addition to the column maps and id list.

Alternatively, we could attempt to update the column map and id lists, but if these lists don’t exist, DynamoDB will throw an error back. Then we need to go and create the maps/list for the row with the new value.

Another valid approach would be to assume only one event per timestamp, and then rewrite the data if there is multiple events, but that leads to two issues:

handling consistency when doing the rewrite (what happens if there is a failure?)
multiple data formats on read, increasing the complexity

In the end, we decided to pursue a map-first approach. However, there is still the trade-off of expecting new timestamps or duplicate repeats; heuristics like “if its within the last 5 seconds, assume its new” can help, but this is only a guess at best (depending on your data).

Either write approach can be encoded into a state machine with very little complexity, but you must chose one or the other. On the roadmap is allowing users to tell us which type of data is stored in their table and then take the appropriate write path.

Handling Time-to-Live

Our schema ensures that data for a tenant and logical table are stored sequentially. DynamoDB push-down operators (filter, scan ranges, etc.) allow us to quickly access time-based slices of that data on a per-tenant basis (e.g. we can go to the correct section because we know the hash key and the general range key).

However, DynamoDB can be expensive to store data that is rarely accessed. A common pattern is for data older than a certain date to be ‘cold’ - rarely accessed. It would be nice if the database automatically handled ‘aging off’ data older than a certain time, but the canonical mechanism for DynamoDB is generally to create tables that apply to a certain time range and then delete them when the table is no longer necessary.

But what about data in the past that you only recently found out about?

Its kind of a weird, but unfortunately, not uncommon in many industries. For example, with smart cars, you can have a car offline for months at a time and then suddenly get a connection and upload a bunch of historical data. This data is both old and new, ostensibly making it even more interesting than just being new.

To that end, we group tables both by event timestamp and actual write time. Since tables are the level of granularity for throughput tuning, and a limit of 256 tables per region, we decided to go with a weekly grouping for event timestamps and monthly for actual write times. This also fit well with our expectation of the rate data goes ‘cold’.

Event Time Prefix

We want to make it as fast as possible to determine the ‘correct’ tables to read, while still grouping data by ‘warmth’. Since DynamoDB table names are returned in sorted order when scanning, and allow prefix filters, we went with a relatively human unreadable prefix of [start unix timestamp]_[end unix timestamp], allowing the read/write mechanisms to quickly identify all tables applicable to a given time range with a highly specific scan.

Write Time Grouping

Then we added on a description of the more easy to read month and year the data was written. This allows to find all the tables for which data was written a while ago (and thus, likely to be old), and delete them when we are ready. Because the deletion process is out of an any critical path, and indeed happens asynchronously, we don’t have to be concerned with finding the table as quickly as possible. Instead, we can add the month/year data as a suffix to the event time range.

We can easily find the tables to delete once they are a few months old and unlikely to be accessed (and whose data scan still be served in our analytics organized offline store), while not accidentally removing data that is ‘new and old’.

Resulting Table Names

This gives us a table name schema of:

[start unix timestamp]_[end unix timestamp]_[write month]_[write year]

which generates names like:

1491696000000_1492300799000_4_2017
1492300800000_1492905599000_4_2017
1492300800000_1492905599000_3_2017

A reasonable compromise between machine and human readable, while maintaining fast access for users.

Summary

Since DynamoDB wasn’t designed for time-series data, you have to check your expected data against the core capabilities, and in our case orchestrate some non-trivial gymnastics. On the whole DynamoDB is really nice to work with and I think Database as a Service (DaaS) is the right way for 99% of companies to manage their data; just give me an interface and a couple of knobs, don’t bother me with the details.

However, in a timestamp-oriented environment, features databases like Apache HBase (e.g. row TTL) start to become more desirable, even if you have to pay a ingest throughput cost for full consistency. At the same time, events will likely have a lot of commonality and you can start to save a lot of disk-space with a “real” event database (which could makes reads faster too).

That said, managing IoT and time-series data is entirely feasible with Dynamo. There is a trade-off between cost, operations overhead, risk and complexity that has to be considered for every organization. For Fineo, it was worth offloading the operations and risk, for a bit more engineering complexity and base bit-for-dollar cost.

Or you could just use Fineo for your IoT data storage and analytics, and save the engineering pain :)

Want to learn more about the Fineo architecture? Check out the next post in the series: Scaling Out Fineo.

Multi-tenant SQL Security In-Depth

2017-05-08T00:00:00+00:00

Multi-tenancy is an abstraction for a big, hard group of problems that touches on security, scalability, resource consumption and quality of service. Generally attempting to back-fit multi-tenancy is, at best, hacky and less than satisfying; at worst, its a recipe for disaster.

With Fineo, we designed for multi-tenancy from the start. Part of that comes from my background at Salesforce, where multi-tenancy was baked into everything we did. The other part comes from our SaaS business model and the desire to scale users super-linearly to costs (so profit increases with the number of users).

My biggest concern was making sure that user data was completely inaccessible to other tenants, while remaining co-located. Concurrently, access had to be fast on a per-user basis. Finally, I also wanted to ensure that we could easily fork a single tenant environment or migrate a group of users with zero downtime.

Confused Deputy Problem

Ensuring that a user is authenticated on the ‘hard shell’ of a system is a relatively easy problem handled in standard web architectures with things like LDAP or one of many user authentication tools. However, it is crucial to ensure these credentials, or some form of them, are passed through a multi-tenant application to ensure that sub-layers cannot inadvertently allow a user access (through bugs or malicious use) to unapproved data. This is known as the confused deputy problem, or as wikipedia puts it:

A confused deputy is a computer program that is innocently fooled by some other party into misusing its authority.

The unintentional release of data can occur maliciously or accidentally through a bug, but I wanted to ensure that at every level user data was segregated and required information from the level above to provide access, avoiding any leakage.

In all its dirty glory, here’s the entire read architecture, with our security broken out into layer.

Let’s step through the precautions at each layer.

Layer 0: API Gateway

At the very edge, Fineo uses the AWS API Gateway to handle all of our authentication and simple access control. Each device has an access/secret key pair used to sign requests, while users credential (username/password) are managed via AWS Cognito.

Our simple ‘hard shell’. Yes, hard shell’s are known to not be viable in the cloud, but it does enforce a minimum effort to attack. Each layer below is also controlled with AWS IAM controls to ensure only the specific services can make requests.

This outer layer ensures that we don’t have unauthorized access. We plan to move to an internal authentication service as a cost saving measure, but for now, AWS provides a quick and easy way to get going fast.

Layer 1: REST Service

All SQL read requests are passed from the API Gateway and sent to a simple REST server. The REST server is glorified proxy for the underlying distribution query planning and execution engine.

The REST server also ensures that all requests must come tagged with an API Key for the user (this is baked into our JDBC client driver as well).

After extracting the API Key we take a preliminary pass at parsing the SQL request (via Apache Calcite). From there, we inject a WHERE clause into the request, enforcing that the API Key from the request matches the Tenant Key in the data. This ensures that users cannot access other users data, at a query level.

Layer 2: Query Planning/Execution

The query planning and execution layer takes the SQL request and breaks it down into its component parts, determines the optimal way to execute the query and then distributes the work to a group of workers that each process a chunk of the data, before passing it up the execution tree and eventually to our proxy server, and finally back to the user.

Fineo’s core execution engine is Apache Drill, but with a somewhat invasive ‘FineoTable’ layer that translates a user query into an execution across several different data stores. In the query plan generation, our custom query rules ensure that there is an API Key filter in the query (inserted from Layer 1) before the query can complete the planning phase (see the translating SQL blog post for more).

Additionally, all table metadata is translated from the multi-tenant schema store and only returned on a per-tenant basis. This ensures that other tenants cannot even accidentally see the tables for another tenant.

Layer 3: Data Storage

Fineo transparently leverages two different data storage layers to enable both low-latency, row-oriented queries (what happened in the last 5 seconds?) as well as deep, cross cutting analytics and ad-hoc data science (what’s the average number of users with at least 10 interactions in the last 5 years?). The disparity in type of query also predicates a disparity in the type of storage, if you want to ensure fast answers.

We use Amazon DynamoDB, with a tenant and timestamp oriented schema, to handle the low-latency queries. Analytic style queries are handled via Apache Parquet columnar-formatted files, stored in S3.

Our goal was to build the simplest system we could, that supported a broad range of use cases.

3a: DynamoDB

Choose to use DynamoDB because it was a fully-managed data store (saving huge operations overhead), wildly scalable, row oriented and supported a good amount of operator pushdown.

As a time-series oriented service, we still had to do a bunch of work around aging off older data (time-range tables) and managing data recency with write-time tables (see Using DynamoDB for Time Series Data for more info).

The trick then is figuring out the correct schema to ensure that tenant are separated, access is fast and not impacted by other tenants. The schema we came up with was not ground breaking:

Hash Key	Range Key
Api Key , Logical Table Id	Timestamp

To access any data, you must provide the tenant API Key and the logical table id (from our schema service, itself organized per-tenant) before being able to read a row.

This schema also ensures that access to different logical tables for the user (i.e. one for each of their products, like ‘temperature senors’ or ‘vacuums’) is fast, since user/table data is:

being co-located
ordered by timestamp (our ‘primary’ key - it’s a timeseries platform).

The chances of two tenants accessing the same dynamo instance are very slight and even less of a concern given a stable auto-capacity monitoring and management layer (running out of read capacity? automatically turn up the server capacity!).

3b: Amazon S3

Fineo optimizes for access to recent data by leveraging a row store. However, we also wanted to support analytics, data science, and ad-hoc queries; queries that can span huge swathes of the data and tend to be more columnar based. We turned to to using Apache Parquet sorted in Amazon S3, grouped by tenant and date. This gave us a folder hierarchy like:

  /tenant api key
   /year
    /month
     /day
      /hour

Which allows our query execution easily avoid data for other tenants (remember the API KEY filter injection?), but also prune down the potential directories to search to a very specific range. Using Parquet’s columnar formats allows us to quickly and easily answer roll-up style queries.

S3 storage is also notably cheaper than running an online database and provides us even more flexibility in speed of storage by enabling Glacier support.

We could have added in another layer of protection using tenant-specific access control and encryption, but given that DynamoDB doesn’t support server-side encryption, it didn’t seem worth the effort (and there was no pressing user need for it!).

Per Tenant Deployment

A single tenant deployment can be necessary when the tenant has specific requirements around security (e.g. custom keys) or data commingling, among many other potential reasons. We built this into the architecture from the start and ensure that every single pre-production test run also executes against a single-tenant architecture (exact same calls, different backend), even if we didn’t have that requirement yet. It also helped when developing a sandbox for local testing: we could stand up the whole infrastructure, just like production.

The core separation for a single tenant came in the REST layer (Layer 1). Here, the tenant’s API Key was required to match one and only one API Key (rather than just matching the one assigned to that user) bound to the server’s deployment. If the API doesn’t match, then the request is rejected. Otherwise, we use the same separation described above in query execution and data storage, but on physically separate resources.

Zero Downtime Migration

Tenants might need to be migrated to other servers if they tend to have a dramatically different workload than other customers or group of customers to help balance access, as well as diversifying risk.

Leaving data in place

When a tenant is being migrated, but is happy to remain in the same AWS Availability Zone, we can just spin up a new instance of the query execution engine and point the dynamo access to the old tables and the new per-tenant prefix. Then, as the data from the old tables naturally ages off, we eventually will only be accessing the new, tenant specific tables.

Note that S3 doesn’t actually need to be ‘migrated’ because it is already tenant separated and provides no control over location.

Physical Migration

If we need to move a tenant to another location, for instance because they want even lower latency access, it’s a similar activity to a standard migration. We start with a bulk copy of the S3 data to the new region. All new writes get sent to the new tables, while reads will be done from both the old and new tables. However, because these are geographically distributed, this is notably painful solution - AWS egress data costs hurt and latency will be generally very bad.

However, we can backfill the ‘new’ tables from the S3 files that overlap the time-range that is not aged off. At the same time, we can do copies of the current table in the background to ‘catch up’ any data that has not been converted to long term-storage. This enables us to minimize the window of ‘slowness’ for the user during the migration.

At the end, asynchronously turn off the access to the ‘old’ table and return to a fast, single-location access for the migrated tenant, all with zero-downtime.

Wrap Up

When designing for multi-tenancy, its best to build it in from the beginning. It’s easy to setup per-tenant instances/infrastructure, but tends to lose when considering the bottom line. Instead, you need to carefully consider how to logically separate data while preserving a high quality of service, eventual migration and preventing unauthorized access - even unintentionally.

Want to learn more about the Fineo architecture? Check out the next post in the series: Using DynamoDB for Time Series Data.

Translating SQL queries for schema on NoSQL

2017-05-05T00:00:00+00:00

Fineo uses a novel semi-schemaful approach to unlock the potential of NoSQL data stores, while simultaneously enabling ‘metalytics’ queries by providing an engine that seamlessly supports everything from nearline, operational queries (e.g. low latency, small scale) to deep, ad-hoc analytics. Primarily powered by Apache Drill and a complex set of query plan steps, we can find the optimal representation of the data to answer a query and push down work to edge, making answers fast.

Drill’s planning engine is built on Apache Calcite - a generic SQL planner and in-memory execution engine. Fineo adds a custom ‘storage adapter’ that translates queries into a series of transformations, that eventually generates a limited number of plans; the plans vary in the underlying storage they query. For instance, one plan could query all the DynamoDB tables, while another queries one DynamoDB table and a number of S3 directories.

Because we are focusing on the time-series domain, we just need to find the best way to get the data for the user’s query over a given time range. Each storage engine provides information about the ‘hardness’ of the query, which is surfaced into the query planner and the leveraged to find the lowest ‘cost’ plan - the one that answers the user’s query as fast as possible.

We force the query optimization process into a set of stages with a set of custom ‘marker’ relations that can only be handled by the Fineo rules, helping ensure the process remains understandable. Each stage handles some aspect of the query generation, be it translating the user’s schema into queries for underlying storage or finding the right storage components to support the specified time range.

Stage 1: Managing Schema

The execution of each query starts with a request to the schema store for the available schemas for the tenant. We can thus match up the user’s expected fields to the query fields, translating the query into something more extensive in the underlying store to encapsulate all the fields. The schema also gets passed the down to the edge operators so it can understand the fields that are coming back from the raw storage.

A user query can start as:

> SELECT temp from MY_TABLE

and then get translated at the database interface layer to something like (we also enforce the tenant id for all results)

> SELECT temp, temperature, tmp FROM FINEO_TABLE WHERE tenant_id = 'some-tenant-id'

But really, we start with a query plan for the Fineo table that looks like:

LogicalSort
  LogicalFilter
    LogicalProject
      FineoRecombinatorMarkerRel
        LogicalTableScan
        LogicalTableScan

The FineoRecombinatorMarkerRel acts as the first gate of translation and the two logical table scans are for the potential read of DynamoDB and S3. This marker will also eventually get translated into a physical “Recombinator” with the job of recombining the underlying data fields into a coherent, user-facing representation.

Stage 2: Translating Table Types

The next stage translates the logical table types to actual queries on the underlying engine reads. Here, we inject tenant, metric and time range filters based on the query. We also expand the original query fields into the full range of potential field names.

LogicalSort
  LogicalProject
    FineoRecombinatorMarkerRel
        LogicalFilter
          LogicalTableScan
        DynamoRowFieldExpanderRel
          LogicalTableScan

We also can inject casts for known fields to the correct type from the underlying store. If the fields are already the known type, the cast has not performance penalty, but eases the translation of fields that previously did not have a type (and thus stored as a string).

Stage 3: Logical Planning

Now we actually get into translating logical table scans into a scan of the DynamoDB table(s) and/or the S3 files. Standard Drill rules also attempt to prune the directories or tables to read, based on the timestamp/time-range requested by the user by ‘pushing down’ these filters into the respective scans.

The Fineo rules also generate a set of plans to query an overlapping range of data between DyanmoDB and S3, from which the lowest cost plans survives to the next stage. DynamoDB tables cover weekly chunks of data and are removed after a few months; S3 partitions down to the day granularity, but covers all history. We lazily update the S3 storage (the data is already available in DynamoDB), allowing us to merely consult a water-mark for the latest S3 translation and generate potential query plans from there.

For instance, consider that we have the following DynamoDB tables:

2017-04-16_2017-04-22
2017-04-23_2017-04-29
2017-04-30_2017-05-06

and a watermark at 2017-05-02 (May 2nd, 2017). For a query that does not include a timestamp, we could then generate plans like (intermediate steps removed):

## Plan 1
 FineoRecominator
  DynamoTableScan(2017-04-16_2017-04-22)
  DynamoTableScan(2017-04-23_2017-04-29)
  DynamoTableScan(2017-04-30_2017-05-06)
 FineoRecombinator
  ParquetScan(s3://data.fneo.io/stream/parquet/tenant_id/2017-01-01,..., s3://data.fineo.io/stream/parquet/tenant_id/2017-04-15)

## Plan 2
 FineoRecominator
  DynamoTableScan(2017-04-23_2017-04-29)
  DynamoTableScan(2017-04-30_2017-05-06)
 FineoRecombinator
  ParquetScan(s3://data.fneo.io/stream/parquet/tenant_id/2017-01-01,..., s3://data.fineo.io/stream/parquet/tenant_id/2017-04-22)

## Plan 3
 FineoRecominator
  DynamoTableScan(2017-04-30_2017-05-06)
 FineoRecombinator
  ParquetScan(s3://data.fneo.io/stream/parquet/tenant_id/2017-01-01,..., s3://data.fineo.io/stream/parquet/tenant_id/2017-04-29)

## Plan 4
 FineoRecominator
  DynamoTableScan(2017-04-30_2017-05-06)
 FineoRecombinator
  ParquetScan(s3://data.fneo.io/stream/parquet/tenant_id/2017-01-01,..., s3://data.fineo.io/stream/parquet/tenant_id/2017-04-29)

## Plan 5
 FineoRecominator
  DynamoTableScan(2017-04-30_2017-05-06)
 FineoRecombinator
  ParquetScan(s3://data.fneo.io/stream/parquet/tenant_id/2017-01-01,..., s3://data.fineo.io/stream/parquet/tenant_id/2017-05-02)

Each plan uses progressively more of the underlying S3 storage fields, rather than the DynamoDB tables, up to the water-mark.

Similarly, for a bounded time range we could prune more of the parquet scan down or even eliminate it. Additionally, we can also push down the time-range into the DynamoDB request, generating very specific query timeranges for the tenant, further limiting the amount of data that is read.

In this stage we also have some ‘push down’ rules that allow any user projections (e.g. SELECT field1 ... would become Projection relation on field) or filters to pass through our Recombinator and into the underlying scan, again helping to limit the amount of data necessary to fulfill the user’s request.

All of this translation is managed by a custom rule that executes against an injected ‘rel’ that can only be removed by the rule; there is only one valid path to process the query plan to the next ‘stage’ and it has to go through each custom rule.

We rely on Drill to select the lowest cost plan, based on the expected cost of each plan in CPU, memory, and network use.

Stage 4: Physical Planning & Execution

Finally, the logical plan is converted into a physical execution plan; this plan can be pushdown down the Drill worker nodes and executed in a tree. Drill is very good about minimizing the time to a result by generating optimized code for each query and ubiquitously leveraging zero-copy buffers.

The main work for Fineo is the in the FineoRecombinator. When we execute the plan we turn the dynamic column results for each underlying table into a coherent set of columns for the user, based on the schema we loaded at the start of the query planning. For instance, dynamo might return a row with the values:

Column	Value
temp	null
temperature	24
tmp	null
timestamp	149377839700
tenantid_metricid	tid_mid

and we auto-magically translate it to:

Column	Value
temperature	24
timestamp	149377839700
tenant_id	tid
metric_id	mid

based on selecting the first non-null value for the user-visible columns. We ensure that all rows have a tenant id, metric id and timestamp, but rely on an ‘upstream’ filter to match the tenant and metric id, as well as filtering out any errant rows outside the requested time range.

Summary

Fineo provides an novel layer of schema flexibility that helps users unlock the power of NoSQL data stores - the flexibility and speed of development. Free to change the data model or absorb mistakes, users can focus on deriving value from that data, rather than trying to clean it and put it into the right place. Simultaneously, Fineo also enables an unheard of range of queries in a single API because we can dynamically select the optimal representation and push down multiple components to make queries return blazingly fast.

Want to learn more about the [Fineo] architecture? Check out the next post in the series: multi-tenant SQL security in-depth.

Implementing Dynamic Schema At Scale

2017-05-02T00:00:00+00:00

Rehash of the Dynamic Schema at Scale for how we implemented reading semi-schematized data from a NoSQL. I touch on the query translation and processing of updating schema.

One of the more innovative things we developed at Fineo are our ‘No ETL’ infrastructure. Much of this came from the ideas at stitchfix that engineers shouldn’t write ETL, but instead core building blocks enable ETL for the people who care (e.g. data scientists). The Fineo toolkit enables our customers to evolve their schemas at the push of a button, throwing out much of the traditional ETL grunt work. Beyond just eliminating basic transformations, Fineo also enables storing and querying data without any sort of schema! And that means you can move faster than ever before.

Continuing the ‘behind the curtain’ look at Fineo infrastructure I’m going to dive into how we make dynamic schema possible across our stack.

Traditional Challenges

Schema changes in traditional RDBMS installations are a non-trivial operations, fraught with peril and often closely managed. Even worse, they can often gate work across multiple teams as you wait for a central data store to be upgraded. On top of that, if you are changing the aspect of any fields (e.g. exchanging fahrenheit for celcius or sending “Temperature” rather than “temp”) you have to layer on additional ETL steps before the data can even get into the database. Compound that with any downstream data stores that are maintained to achieve orthogonal goals like analytics v.s. low-latency access to recent data.

Basically, its a giant pain in the ass.

And its not until you have done all this work (which can happen frequently in a fast moving company) that you can even start to look at the data you would send.

Fineo: Easy Ingest, No ETL

With Fineo, you don’t have to do almost any of that work. Schema is defined with a straightforward API, or in our web app, and can use the exact same data definitions are you would send to the API.

// Example definition
class Event{
	public long timestamp;
    public String metrictype;
    public int value;
}

// Skeleton command line request for creating the definition
$ java -jar fineo-tools.jar create-schema --class Event

If you don’t define fields before you send them, our backing store (based in NoSQL) transparently stores the data so you access it later.

And then it gets really cool.

You don’t even need to do any of that schema management until you have time to get around to it. Instead, you can start querying for fields immediately, just by knowing the names and what type you expect it to be. From there, you can use the whole world of SQL to slice and dice data, so you can quickly and easily get running.

At some point later, you can ‘formalize’ your schema by defining the expected fields, types and aliases. Formalizing the schema will dramatically speed up any analytics and enable auto-complete queries. However, you can also continue using the alias names for different fields that you had developed before formalizing the schema, so the queries you are already running will continue to run just fine.

Avro All Around

The root of our schema management process uses Apache Avro. Avro is great - it has schema evolution, field aliasing and self-describing serialization. Sounds like we are done! Just use Avro everywhere.

Ummmm, not so fast.

To start with, you need a way to keep track of all different schemas for each customer. Enter the avro schema repo, based on some work Jay Kreps did in AVRO-1124.

The Avro Schema Repo (ASR) is a REST-based service that lets you manage the evolution of schema for a logical ‘subject’, a collection of mutually compatible schemas (the changing schema of the ‘thing’ you are managing). The ASR comes with a couple of nice default database adapters - Zookeeper and a local FS (if you poke around, there is also a JDBC-based backend on github).

At Fineo we strive for ‘NoOps’, choosing instead to rely on hosted services to minimize overhead and automate as much as possible. To that end, we wrote a custom schema store backend on top of DynamoDB. AWS also has relational database (RDS), but is managed by a VM, rather than by the request as with Dynamo, leading to more ops that we really wanted. The trade-off with Dynamo is that we will incur 2-3x write overhead to ensure that we get consistent results [1]. Fortunately, schema doesn’t need to be blazing fast - the write pipeline is asynchronous.

We are already using Dynamo for our near-realtime store and it has nice NoSQL properties that let us really leverage dynamic field names, so it was a pretty easy choice. Fortunately, ASR has a pretty lightweight requirement on the database adapter and already has a caching strategy, so this was pretty easy to implement (especially using Dynamo’s simple ORM tools).

We hope to open source the DynamoDB adapter for ASR soon. Keep an eye out!

Scan-time, Party Time

At Fineo, we wanted to make it as easy as possible to push any and all data into the database and make it instantly queryable. If we are allowing any sort of data going in, we just need a way to understand what we have. More specificially, we need a way to translate a single column lookup into a lookup for any ‘alias’ of a column.

Part of the power of using a NoSQL store is that we can just stuff in fields without having to touch any extensive DB DDL tools (though our schema management really is “DDL as Metadata”). Since we know the field names, we can then later just query what we expect is in there, and have the database tell us what actually is there.

When reading the data back, we push the expected schema down the processing tree to the very edge node. There we process each row to convert each stored column into the expected column. This way, we can leverage the flexibility of the NoSQL store, but with the usability (e.g. sanity) of a schema.

A user query can start as:

> SELECT temp from MY_TABLE

and then get translated at the database interface layer to something like:

> SELECT temp, temperature, tmp FROM DYNAMO_TABLE WHERE TENANT_ID = 'some-tenant-id'

And then we select the column that actually has some data for each row (first available wins) to fill contents of the temp field to return to the user.

Our Dynamo extension of the ASR has support for tracking unknown field names and potential types. When we receive events that have ‘unknown fields’ we update the unknown fields list for that Metric type in Dynamo and then write the unknown fields into columns by the customer specified name as simple strings. When customers query for fields that have not been formalized they have to provide the expected type of the field by casting it (or just accepting a simple string representation, the most common of denominators).

When we know the field/type, we can pre-cast the fields from the weak-typing done in DynamoDB into the ‘real’ types the user expects.

Avro Schema Repository Access

In our drive to achieve ‘zero-ops’ we throw out the ASR REST layer and just query DynamoDB directly using the ASR api[2]. It still provides all the caching you would want, but saves us a network hop. Naturally, the trade-off is that we need to be very careful with how we evolve the schema and access patterns, but as a small shop with high visibility into the code effects, we made the choice to simplify ops over later complexity.

Keep in mind that the schema repository has two touch points, (1) the ingest pipeline, where we track new schema and apply existing schema to incoming events, and (2) the external-facing web server, which needs to understand schema to serve reads and for admins to manage the schema.

Since these are stateless services, we could deploy them via ElasticBeanstalk as containers and even replace the direct DynamoDB access with the REST endpoint with minimal changes. For now, we just use AWS Lambda to handle the scalability and availability of the schema service.

Exposing User Schema

All schema is exposed to users via simple REST requests, managed through an AWS API Gateway (for authentication and authorization) and then sent to a schema AWS Lambda function. This ensures that all schema access is scalable and also easily tested.

We could also use the same API internally, and probably should as we grow, but for now its cheaper and faster (both in development and request time) to directly access the underlying schema store. This means a bit more coordination on the testing side to ensure that there are not breaking changes to how schema is stored, so each lambda function, ingest processing and SQL query execution all work on the same data underlying schema data store.

Managing multiple entities

As a multi-tenant platform we naturally have to manage multiple customers. Each tenant (customer) is assigned an Id - a tenantID (did you guess it?) - which is then used to lookup the possible schemas for that customer, each assigned a schemaID (I know you didn’t guess that one).

Remember how we mentioned that you can rename things on the fly?

Well, that means we can’t actually store ‘real’ names, but instead have to use aliases. These aliases are stored alongside the customer schema so we can manage those aliases directly as part of the schema. So the schema for a thing is an instance of a schema.

Thus the schema for a given ‘object’ (event) is a combination of the tenantID + schema ID + schema alias(es). We have schema for describing a tenant + its known schemas (Metadata) and then each schema has its own schema (Metric). Then for a given type of ‘thing’ for a tenant, we store instances of the metadata and each metric. This leads to a schema repository that looks like:

subject id	schema
_fineo-metadata	Metadata.schema
_fineo-metric	Metric.schema
data production inc.	Metadata.instance
n1	Metric.instance
n2222222	Metric.instance

Ok, that is going to take some explaining. The Metadata.schema and Metric.schema are actually the following Avro schemas[3]:

 record Metadata {
    string canonicalName;
    union {null, map<array<string>>} canonicalNamesToAliases = null;
  }

  record Metric{
    Metadata metadata;
    string metricSchema;
  }

These schemas are then used to to understand the Metadata and Metric instances we get per customer. Going back to our example of DPI above, your first level Metadata instance will look something like:

{
  "canonicalName": "n1",
  "canonicalNamesToAliases": {
    "n2222222": ["machine1", "machine1b", "machine1c"]
}

So the customerId is n1. This customer only has a single schema, with the canonical name n2222222. However, we might get multiple different device name types that are really the same “thing”. This is useful when you have devices from different manufacturers that produce different metrics, but are really the same thing.

From there, you can also lookup the schema instance for the device n2222222 (which to the client looks like they are looking up machine1 or machine1b or machine1c). That will give you something like:

  {
    "metadata": {
      "canonicalName": "n2222222",
      "canonicalNamesToAliases" : {
        "f1": ["field1"],
        "f2": ["field2", "field2b"] 
      }
    },
    "metricSchema": "\\ some encoded avro schema based on a BaseRecord \\"
  }

We have a known set of fields that are included in every record, comprising a BaseRecord and its BaseFields:

  record BaseRecord {
    BaseFields baseFields;
  }

  record BaseFields{
    string aliasName;
    long timestamp;
    map<string> unknown_fields;
  }

For now, its enough to understand that this is the basic building block of an ‘object’ schema. Shortly, we will discuss how its actually used.

Wait. What are you keeping track of?

A logical machine (e.g. a thermostat) is actually an instance of a Metric, that has an instance of its own metadata to map canonical field names, keep track of its own name and then store the schema for the actual customer record. This allows us to evolve how we define a generic schema for a tenant or metric, as well as evolving how the schema for a given ‘thing’ looks.

Using schema to define schema… followed by big of pile of turtles to the bottom :)

Note that Avro’s standard aliasing only applies to records, which means that every field becomes its own record instance, which quickly gets to be a pain to manage and also prevents easy alias logic reuse. I’m not saying you couldn’t do it, it just gets to be a pain (left as an exercise to the reader).

Building schema - modern DDL

The schema for a given field is programmatically built through our DDL Apis. By using an instance of the Metadata we can dynamically rename fields without actually changing any underlying data. Eventually, we also want to do dynamic type conversion and lazy ETL.

Each Metric instance’s metricSchema (I know, the naming is a touch confusing - I’m open to suggestions) is actually an extension of the BaseRecord. Each event in the platform is expected to have a couple things when it reaches the ‘write’ ingest processing stage:

a timestamp
a customer specified alias (which we remap to a canonical name)
some number of unknown fields

Going back to our example of DPI, they brought a new machine online which has a couple of metrics: temperature, pressure and gallons. After connecting it to the platform, we will end up with a record that looks like:

  {
    "source": "new machine",
    "timestamp": "January 12, 2015 10:12:15",
    "temperature": "15",
    "pressure": "4",
    "gallons": "5"
  }

Which gets remapped via the [Fineo ingest pipeline] to a simple BaseRecord instance:

  {
    "alias": "new machine",
    "timestamp": "1421057535000",
    "unknown_fields": {
      "temperature": "15",
      "pressure": "4",
      "gallons": "5"
    }
  }

The unknown_fields then get stored as simple strings in DynamoDB, which we can read later (through some gymnastics) without having defined any schema or types. At some point later, an user goes in and formalizes the schema to types that we talked about. The ‘extended BaseRecord’ and machine Metricinstance then looks like:

{
   "metadata": {
      "canonicalName": "n2222222",
      "canonicalNamesToAliases": {
        "f1": ["temperature"],
        "f2": ["pressure"],
        "f3": ["gallons"]
      }
    },
    "metricSchema" :
      "record: BaseRecord {
        BaseFields baseFields;
        int f1;
        long f2;
        int f3;
      }"
}

Since we are backing everything by Avro, we can cache schema until we find it is out of date, and only then request a new one. Further, by storing all the fields by tenant and schema, we have a very high throughput, multi-tenant access that probably doesn’t need much of a cache, which backed by DynamoDB gives us highly scalable schema evolution.

Future Work

While this gets you pretty far, we do see somethings that we think customers would find helpful:

field type prediction
dynamic field typing, so you can change the type of data
advanced sanitization and transformation
missing field alerts

Want to learn more about the Fineo architecture? Check out the next post in the series: translating SQL queries.

Notes

1. Dynamo Schema Repo

We can actually be a bit lazier here and not read/write with full consistency, instead relying on the mutually compatible evolutionary nature of Avro schema. We should be able to step through old versions to read data from data serialized with an older schema. In fact, we can keep track of which schema number (schema-id) the data was written with and just use that schema to deserialize data.

2. Elastic Beanstalk

We could actually use AWS Elastic Beanstalk to do a lot of the ops for us in deploying the web service. However, they is still another moving part. It gets us nice separation and ability to evolve schema, but that seemed minor gains right now compared to the overhead of running another service. Of course, as Fineo grows this will not always be the case and the advantage of using a more SOA style architecture will be increasingly compelling.

3. Metric Fields

We also have the ability to ‘hide’ fields associated with a machine. This allows us to do ‘soft deletes’ of the data and then garbage collect them as part of our standard age off process.

4. Realtime

For some definitions of realtime. Currently our ingest pipeline is less than 1 minute, though we have extensions that allow querying on data within tens of milliseconds of ingest. Talk to Jesse if that is something you are interested in.

[Fineo ingest pipeline]:

Scaling up for an IoT World

2017-05-01T00:00:00+00:00

With Fineo’s Beta availability (link), I thought it would be interesting to look at how Fineo actually supports IoT-scale ingest and eliminates the need for traditional pipelines and the maintainence of several data stores. The transfer and conversion of data between these data stores (known as an Extract-Transform-Load, or ETL, process) tend to be very manual and fragile, making them a constant pain point. By eliminating the core ETL processes, instead driving it into the core of the platform Fineo frees people from the burden of data cleanliness and management, allowing them to focus on their business.

The advent of the Interet of Things (IoT) means almost every industry is generating several orders of magnitude more data than they have ever seen. ‘Traditional’ web companies are the only place to come close to this scale of data. Unfortunately, the standard Big Data tools tend to be unwieldy and capital intensive (even though they run on “commodity” hardware). While many companies recognize the potential of Big Data, few can actualize it due to the difficulty finding experts to manage these distributed systems for many industries (i.e. its hard to convince engineers that counting bolts is interesting).

Fineo is a SaaS Big Data platform designed from the ground up for the brave, connected world in which we now find ourselves. Beyond completely elastic scalability with enterprise grade tooling, we are also looking to change how people manage their data with our No ETL tools.

Access

You can write in two modes: streaming or batch each of which has a similar, though independent, pricing model. This makes is very simple to scale - everything just works as you get more devices and data.

All reads - analytics, ad-hoc queries, daily operations - are handled by a standard JDBC driver (ODBC coming soon!). That means you can just plug it into your favorite analytics tools and everything just works. Or, if you are in the homebrew camp, you can easily roll your queries with standard SQL.

(No) ETL & Late-Binding Schema

Traditional ETL is widely considered a painful, thankless process that is necessary to achieving business objectives by providing low latency access to data, deep analytics and ad-hoc data science. Fineo’s No ETL tools make it easier than ever to iterate and manage a heterogeneous device environment. You no longer need to worry about simple things like renaming database columns (and managing the transformation of data from legacy devices) or completely changing a columns type (e.g. Celsius vs. Fahrenheit data when changing device components).

What would be a full time job for several engineers completely disappears in the Fineo framework, while simultaneously replacing the need for multiple data stores with a single, unified API.

In the future, we want to automate the entire ETL process. That means your Data Scientists can focus on insights, not being Data Janitors. This would be things like type clustering via Machine Learning, so new devices/events are instantly accessible and intelligible, so you can focus on using that data.

Behind the (Ingest) Curtain

Originally, the Fineo platform was built on entirely open source components enabling public or private cloud deployment. Our Beta will only be available on AWS - talk to us if you are interested in other/private cloud deployments - and to help move more quickly we carefully selected SaaS based replacements for some of the services. This allows us to run a nearly completely “NoOps” platform and focus on providing the truly innovative Fineo components.

Leveraging AWS

We leverage a host of AWS services for a couple of reasons:

as we scale up, cost scales with us
operational burdens are nearly zero.

Without further ado, here is the entire streaming ingest pipeline [2].

Basically, its a light stream processing layer over a standard lamdba architeture. Pretty simple, right? There are some subtle elements of this architecture that give us some pretty fantastic abilities when building for ‘enterprise grade’ infrastructure.

Outside In

One the edge sits the AWS API Gateway. Its a powerful tool that lets us easily define APIs and then interact with backend AWS services or our own API endpoints. Additionally, it also provides very strong, fine grained authentication and authorization services, making it a great basis for the user-visible side of things.

From there, we process the events in a series of ‘stages’ backed by Kinesis streams (essentially large, distributed, durable queues). We archive the results of each stage for backups and subsequently build multiple representations of the data for fast queries.

Making One Size Fit All

One single database/system rarely supports all the use cases; low latency is almost always at odds with high throughput. This is exactly why we leverage multiple data representations, so we can pick the right one for the query and mash up multiple sources for a optimal representation of the data under query.

The common ‘web’ case mostly cares about the most recent data, the events occurred in the last day or week, and fairly small volume: on the order of a millions of events. For this case, we leverage DynamoDB as our ‘nearline’ data store. It provides fast access to row-level data and scales dynamically with customer data needs.

We also have a secondary representation that is well suited to supporting Data Scientists and general analytics: a shredded columnar format (via Parquet) combined with the cutting edge read capabilities in Apache Drill and Spark to make deep, adhoc analytics blazingly fast. When leveraged with our No ETL tools Data Scientists can now more quickly and easily then ever investigate their data to derive insights that help drive deep understanding and decision making.

What’s really exciting is that from the outside, it all just looks like SQL! But instead of querying across a minute, you can query across a day, month or year and get blazingly fast answers.

The Stream Processing Pipeline

Kinesis acts as a core buffer for managing each stage of the stream processing pipeline. Each stage is implemented as an AWS Lambda function. The first stage processes the raw events into an Avro schema that we understand or kicks it out to an error stream. The valid records are then sent onto two places: the raw archive and the ‘Staged’ Kinesis stream.

These schemazited records are then processed by the ‘Staging’ Lamba function. Similar to above, we Firehose the incoming events (the schematized records) and error records to S3. The actual “work” of the stage is writing to Dynamo DB, so we can serve near-line queries. At this point, you could query the data through our standard JDBC driver.The archived stream is also the data source for our batch transformations that enable fast-restore backups and our deep analytics tools.

Batch Transformations

The S3 “staged archive” location is processed periodically with an EMR Spark cluster to do a few things:

deduplicate records
extract schema changes
format records for read
build a fast-restore backup

The key part of this job is transform events that have a known schema into a highly optimized, columnar format which enables the blazingly fast speed for ad-hoc analytics. We also process the columns without schema so we can still read them in an unoptimized, ‘flat’ JSON format, but lack some of the speed optimization of known data types. If we don’t recognize some of the data types, we will notify you so you can integrate it into schema or fix the error.

Since all the data is present in DynamoDB already, we can be a bit lazy about doing the batch transformations - taking days or weeks. This gives us a lot of flexibility around things like cost optimization, retries and extensive testing.

Pipelines Replayability Wins

Since each stage is stored in a new Kinesis stream (e.g Kafka topic) we have extensive replay abilities. Each Kinesis shard comes with 1MB/sec ingest and 2MB/sec reads. This gives us the ability to dark launch a completely parallel set of resources (lambdas, s3 files, etc.) at every stage, giving us deep confidence when rolling out a new release.

As mentioned above, we also leverage Firehose at each stage. On one hand, we get backups of each stage with the exact data. This allows us to recover from downstream processing errors (i.e. raw -> schema transformation has a bug) or act as several sets of backups. On the other, we now also have a complete record of events that we can use as another level of testing for new code. Rather than relying on Kinesis, we can replay the events directly ensuring that we can exactly mirror customer workloads in testing (hugely valuable for a enterprise environment).

Each stage can also see two main types of error - ingest/customer errors from bad data and commit/processing errors. For each error type we write them to a different Firehose stream. This lets us then tie in AWS notifications to alert when we get an error (as an S3 file). This can either be a notification directly to the customer - e.g. bad data - or waking up the Ops team in the middle of the night. Because the errors are archived into S3, we also can allow users to use Drill to query the errors with SQL.

OSS or SaaS

In the above architecture, you could replace Kinesis with Kafka, S3 with HDFS, Firehose with a number of open source batch engines, Lambda with Storm (or Flink or Samza), and DynamoDB with an open source NoSQL database (e.g Cassandra, HBase, etc.). Beside a few quirks, a heap of operational overhead, and the non-trivial overhead of running the servers for a small startup its a straightforward switch. We have the added advantage of being able to easily calculate the exact costs per tenant and can pass the costs directly onto users (so we never need to worry about running a cost-deficit).

However, as experts in distributed systems with a pedigree in Open Source, we can quickly shift to a completely OSS stack to either run in private clouds or to help drive down costs later 1. In fact, most of this will not be new for many folks at web companies. However, its often difficult to manage all these services and combining them all into a cohesive whole is certainly not trivial.

Wrap Up

As a SaaS provider Fineo gives you all these great things you would want with a flexible ingest pipelines, fast, IoT-centric storage and enterprise grade tools, without all the overhead of actually running it yourself.

Fineo really shines in three places:

SQL everywhere
Universal, low latency queries
Dynamic schema at scale

The first two are pretty cool. Being able to use SQL everywhere means quick adoption across the company and natural, powerful query semantics. This power is accessible both through the web application, a JDBC driver or programatically through our web API.

Our cutting edge dynamic schema support brings the flexibility of NoSQL into a manageable framework with coherent schema changes and evolution. It helps customers move quickly without breaking things and quickly recover from mistakes, without losing information.

Really good ideas never seem to be uniquely developed - also true of quite a few bad ones - and such seems to be the case here. Our ingest pipeline looks a heck of a lot like Netflix and our DynamoDB schema looks similar to a common IoT style use case. However,we have some twists that make Fineo eminently attractive: SQL access, enterprise security and availability, low latency query and dynamic schema.

Want to learn more about the Fineo architecture? Check out the next post in the series: implementing dynamic schema.

Notes

1-costs

With economies of scale it can be much cheaper to run your own services, rather than leveraging SaaS. You are paying a premium for someone else to deal with managing the service - keeping it up, running quickly, etc - so you can focus on your business. In fact, this is the same logic for why you want want to use Fineo in the first place; we handle all the glue and management so you can focus on using the data.

2-batch

The batch mode is very similar, but also supports ingest via S3 files or larger batches (up to 10MB right now) of events. Its preferable if you are cost sensitive (it can be 10x or more cheaper) and can tolerate some lag between ingest and being able to read the data. Be on the lookout for a follow up post on how we manage the batch process!

An investment thesis

2017-04-27T00:00:00+00:00

What’s new and interesting? What’s worth focusing on? Its worth taking a step back and looking at the larger picture. It helps make sure you are doing the right thing, for the right reasons. Doing that every day means nothing gets done, but too infrequently means getting lost in the weeds and, personally, losing a sense of purpose for what your working on.

Recently, I stepped back to look at the themes, industries and technology that I’m excited about; things that I think are important & useful, if not just flat out cool.

General Trends

Looking across technology and industries, there are a couple of general trends that I think are going to drive a lot of changes and innovations. Call it my investment thesis.

Leap-frogging technology
- e.g. cell phones over landlines in Africa
More data > more math
Environmental disaster is pending/here
Information overload is coming/here
Increasing productivity
Increasing abstraction and automated fixes/recovery

In fact, I think everyone (with the flexibility to choose) should develop their own investment thesis, just as much as any venture capitalist. In fact, you could argue refining this thesis is more important for people as you must make a bigger bet (40-60hrs/week of your time) on a single company.

Finding the right trend and the right industry can often be a case of right a rising tide lifting all ships.

Exciting Emerging Techology

Emerging tech, here, is that technology just starting to see general use in the mainstream, rather than just emerging from the laboratory. In no particular order, some of the cool potential tech:

Wireless power
self-driving cars
holograms
Augmented Reality/Heads Up Display
Ubiquitous internet connectivity
EM Drive
AI/ML

These are probably the top of any nerd’s sci-fi wish list, but now its starting to get to a point where it feels tangible; you can almost taste it. Independently, these technologies can enable some really interesting things: ubiquitous monitoring, increasing free time/quality of life and, in the extreme, saving the world.

However, raw technology is not interesting without context and usage within industries.

Important Industries

There are a handful of industries poised to break out given some of the above emerging technologies. At the same time, advancing them is crucial to the advancement of society (not to be too dramatic):

Space
- exploration
- mining
- colonization
Alternative Energy (Schwarzenegger has a great view on this)
Health Care
Education

Companies doing things like decreasing supply chain costs, increasing marketing quality or driving sales are distinctly less interesting. They are helping to drive consumerism and line people’s pockets; inherently unsatisfying when considering the number and scale of problems we are facing.

Founding Fineo

And a lot of the above is why I started Fineo. As we get or pass data saturation there is a growing need for the tools to harness it. Unfortunately, many of the existing tools are wildly hard to use, requiring tens of people and 10’s to 100’s of thousands of dollars really use. These tools should be there to help us do the real job - fixing the environment, going to mars, etc. - and then get the hell out of the way.

As we look to being an increasingly connecte (and monitored) world, every company has to become a data driven company. Moreover, they have to become a big data company to make a big impact; the value is in the data and being able to leverage it.

Its so easy to get caught in the hype of IoT - probably why there are hundreds of IoT platforms - and miss seeing the end goal of making people’s lives better. As a technologist deep down, I still attracted to the ‘new shiny’, but now (I hope), tempered by the question of ‘but why?’.

So, no I don’t need an internet connected juicer. And probably no one else does either.

Neuralink

The latest startup from Musk deserves its own little section, if only because (as any of my friend’s will tell you) having a brain-computer interface has long been a dream of mine (and likely many other nerds). The vision for a company driving the innovations to enable that are terribly exciting and potentially game changing for a variety of reasons.

That said, we are still a ways away from anything huge. Call it 10 years from sparking a new industry. For folks with the time and money to wait that long, there are few other endeavors I could recommend pursing. Rememeber, Tesla started in 2003 when one automaker had a hybrid car and an eletric car hadn’t been seen in nearly a decade (GM’s EV1). Fastforward 12 years and every automaker is worried about the rise of electric cars, starting their own production models and nearly everyone has a hybrid.

Alexa as an API makes smart homes a reality

2016-03-30T00:00:00+00:00

For years, hobbyist having been hacking their homes to create smarter parts that respond to their every whim. Smart homes were thrust even further into the public concience with the first Iron Man (2008) movie. Suddenly, everyone wanted self tinting, weather informing windows or a home assistant smart enough to do their job for them (or at least turn up the music).

In the last six years, we have come a long way to fulfilling that vision, with smart thermostats, lights, blinds, the list goes on. However, each of these things operates on its own, requiring you to learn a new interface, a new set of commands, not to mention the sometimes onerous installation process. This has left the true automated homes still out of reach and weak semblances in their place, accessible only to those people willing to brave the cutting edge.

It should not be this hard.

Alexa and Amazon Echo

One of the most interesting new pieces of technology that has mostly slipped under the public’s radar is the Amazon Echo - a remarkably smart tube you put in the middle of your living space. With some of the best voice reconigition and AI we have seen, certainly rivaling Google Now and far surpassing Apple’s Siri, it could be the fulcrum around which the Internet of Things (IoT) revolution pivots to enable smart homes for the masses. A recent Exponent podcast called out the Echo as being a brilliant play by Amazon to become the hub of the smart home, the driver for everything else, by selling this at-cost little column that has just enough functionality, but is positioned as a tool, an enabler, of the rest of the ecosystem (hint: much of AWS is built around the same idea).

Right now the Echo, queried with the name “Alexa”, is still somewhat limited in functionality. But at $180, its a pretty compelling piece of tech for most nerds. If you are already invested in some other smart homes and know about IFTTT, you can add voice commands to Alexa that control things like you Nest or Hue lights; it takes a little work, but it can be done by those willing to google a little.

Other Home Hubs

This play for the “smart home hub” is not new - Google is trying to pull it off with the Nest and Apple seemingly with the AppleTV. However, both fall short for their own reasons.

Nest

The Nest is a cool device and many people now have them, but the additional ecosystem they are building is around more devices connected to the Nest and driven from your phone. For things like their camera, this makes complete sense - you will monitor it from your phone when you aren’t at home.

But what is the one place you aren’t nearly as likely to have your phone? At home.

So then you are back to non-on-demand actions - presets and learned functions for your devices, rather than easy, at-will control. Don’t get me wrong, it is very much the right thing for devices to learn what they should do, rather than having you tell them. However, humans are notoriously capricious, and need a way to change it right now.

Google could turn this around pretty fast by embedding a microphone and Google Now into their thermostats. Now, you have many of the same capabilities you get with Alexa, but its also integrated into the rest of the ecosystem Nest has been building up. I wonder though if Google can make this happen - they are very focused on search, which is the not at all what you are doing with voice commands; instead, you ask for what you want.

AppleTV

The AppleTV is a weird product. Its meant to be a home entertainment hub, but doesn’t play well with anyone else, and then has problems when you are interrupted while AirPlaying. It inherently is neglecting the new paradigm of multiple screen entertainment. For example, when watching a sportsball game, people are also live tweeting the game and checking facebook and messaging with their friends (I know, kids these days).

Futher, you have to go through iTunes, which quite honestly, sucks. I only use it as a last resort to get to content. In fact, I’m much more likely to torrent something than I am to use iTunes becuase it is hard to figure out where things are, even if I would gladly pay for it. This is a scary proposition for something that wants to be the new hub of entertainment.

Remotes suck

Remotes were fine when we just wanted to change channels. They were ok when we also wanted to watch VHS movies. Once we started trying to navigate DVD menus, remotes started to suck. Then we got smart TVs and remotes really and truly started to suck.

Apple hasn’t really done anything to fix that. The AppleTV remote adds a touch pad, but we are still pointing a thing at the TV and trying to drive a screen feet away from us.

I’m convinced that screens more than a couple of feet from our faces are inherently harder to use; just think about trying to navigate a mirror of your desktop when you hook up to a presentation… it feels a hundred times harder and its exactly the same of the screen you use every day!

Recently, Vizio had an opportunity to redefine their TV experience and decided to completely ditch the remote. Instead, the remote is a standard Android tablet they package with the TV and you control everything on the TV through Chromecast.

Yes, yes, oh god yes

Chromecast is brilliant. You do all your searching on your phone or on your computer and then get the Chromecast to go to the same URL and stream from there (in most cases, but you can also stream your exact screen over the WiFi). The local screen is the ideal place for search, rather than trying to wave a remote at a screen.

Chromecast then becomes a much more natural implementation of the entertainment center and the phone as our means to find that entertainment.

And what do we have with us all the time? Our phones.

… except in the home.

Vizio gets around this by getting us to develop new habits around replacing the tablet ‘remote control’ with some pretty smart psychological hints.

Smart watches to the rescue

The only thing that we are less likely to take off than our phone is the smart watch. Right now, they are kinda useless devices. Yes, you can see/dismiss/auto-respond to texts and take calls on speaker and track your activity, but that is only on the nicest ones. I’ve found most people are interested mostly in the health tracking information and often find the rest somewhat annoying. Lastly, many of them are ugly; there are a couple that look OK, but nothing that is truly good.

Right now, smart watch battery life is a bit weak, so we are still taking them off regularly to charge. However, this we are still in the early days and will only see that improve. Further, some people are working on truly wireless charging (so you can be anywhere in the home and charge your phone), so we will soon see smart watches worn just likely regular watches, especially as they become aesthetically as pleasing.

Alexa as an API

Amazon is am API company and the Echo is very little more than a light frontend for the Alexa API. What is interesting is that Amazon is currently making Alexa available to developers, so it can be embedded in devices.

Wait a second. Couldn’t we put that same voice recognition smarts into our watches? Now we don’t need to be anywhere near the Echo - in fact, it will always be on us - to drive all of our home automation. Search will still reign supreme on the screen, but the watch will soon be the remote control for the rest of our life.

And suddenly that smart watch starts to make a whole lot more sense… and lets be honest, Amazon was never much of a hardware company.

Closing the loop

We can use our voice to get what we want and a screen to find the things we didn’t know we wanted.

So, what’s left?

Well, we still have all these different ‘smart’ devices connected. When each one comes on, it has to slowly learn your habits and is driven from a device - right now, your phone, later on demand by your voice-over-smart-watch. Wouldn’t it make more sense if there was some central hub all into which all these devices connected? Or, so you don’t need to take over the world, a common protocol for exchanging information, which could lead to a marketplace of hubs.

Then when you buy those smart blinds, they can find out from your thermostat when you get up and can lower the shades when the thermostat finds its getting too hot (rather than turning on the AC).

All the manufacturers need to do is implement the protocol and you pick the hub to which the device connects. From there, we can enable manufacturers to monitor their devices as much as you desire (not at all, minimal functioning, full data) so they can do proactive maintenance, make suggestions and build better products.

To quote William Gibson, “the future is here, it’s just unevenly distributed”.

Dynamic, Lazy Schema at Scale

2016-03-09T00:00:00+00:00

Schema management is some of the most painful database work and anything you can do to make it easier can dramatically reduce an enterprises’ iteration interval. At Fineo we are focused on delivering a scalable, enterprise grade time-series platform. While we do lots of the expected enterprise-y things - backups, end-to-end security, audting, etc - and some things enable us to iterate quickly (like in my Fineo ingest pipeline post). However, today I’m going to talk about how we enable customers to have completely dynamic and lazy schema.

Dynanic, lazy schema means that at any point you can:

change the names of fields
group multiple physical names into a single field
query data before you have defined the schema

When it does come time to formalize your schema (up to a month after data has been written), Fineo will make suggestions about what type we think the data is and if it might actually just be an alias for another logical type, all based around the queries you have made on the un-schema’d data.

Lets take a look at an example where dynamic, lazy schema can be really useful.

Data Production Inc.

Suppose you work at Data Production Inc. (DPI) and are tasked with onboarding a new production line. You have a lot of machines to connect and then want to quicly analyze how the line is running so you can tweak it quickly. Then lets suppose you have a couple manufacturors of the same type of machine in your line - each has a slightly different name for the same kind of metrics, some metrics are from one machine and not in another.

In the traditional RDBMS world, this problem can be a huge pain all by itself. You have to figure out all the different possible fields you will receive. Then you need to manually write the mapping to an known name (normalizing field names), manage empty values and retest the pipeline multiple times until you are are sure you caught all the bugs.

Basically your standard, pain-in-the-ass ingest work. Only after you have done all of this massaging can you even begin to look at your line and determine how its running.

Fineo = Easy Ingest

With Fineo, you don’t have to do almost any of that work. In fact, once you point your machines at our ingest endpoints, we will automagically tell you all the different fields. From there, you can point-and-click your way to the schema you want. Field name normalizations (aliasing) is a simple drag-and-drop. Empty values are automatically handled by our NoSQL backend.

And then it gets really cool.

You have up to a month to formalize your schema. Formalizing the schema makes it so we can auto-complete queries (from the UI) and dramatically speed up any analytics. However, you can also continue using the alias names for different fields that you had developed before formalizing the schema, so the queries you are already running will continue to run just fine.

Schema Management Internals

In the Fineo ingest pipeline post, it looked like we only had one touch point with the schema store and that it was stand alone. That was a simplication of what it really looks like:

Ok, that really isn’t too much more, but those simple boxes hide a host of complexity.

Avro All Around

Not so fast.

To start with, you need a way to keep track of all different schemas for each customer. Enter the avro schema repo, based on some work Jay Kreps did in AVRO-1124. The Avro Schema Repo (ASR) is a REST-based service that lets you manage the evolution of schema for a logical ‘subject’, a collection of mutually compatible schemas (the changing schema of the ‘thing’ you are managing). The ASR comes with a couple of nice default database adapters - zookeeper and a local FS. If you poke around, there is also a JDBC-based backend.

At Fineo we try for ‘zero-ops’, choosing instead to rely on hosted services to get us running with minimum overhead and automating everything else. To that end, we wrote a custom schema store backend on top of DynamoDB. AWS also has relational database (RDS), but is managed by the machine, rather than by the request as with Dynamo, leading to more ops that we really wanted. The tradeoff with Dynamo is that we will incur 2-3x write overhead to ensure that we get consistent results [1].

We hope to open source the DynamoDB adapter for ASR soon. Keep an eye out!

Avro Schema Repository Access

Continuing to zero-ops we can actually throw out the REST layer and just query DynamoDB directly using the ASR api[2. It still provides all the caching you would want, but saves us a network hop. Naturally, the trade-off is that we need to be very careful with how we evolve the schema and access patterns, but as a small shop with high visibility into the code effects, we made the choice to simplify ops over later complexity.

Since these are stateless services, we can deploy them as need be and even replace the direct DynamoDB access with the REST endpoint with minimal code changes (the client now talks to Also talking directly to our Dynamo endpoint gives us the ability to read and use previously unknown fields (discussed later).

Managing multiple entities

As a multi-tenant platform we naturally have to manage multiple customers. Each customer is assigned an Id - a tenantID (did you guess it?) - which is then used to lookup the possible schemas for that customer, each assigned a schemaID (I know you didn’t guess that one). Remember how we mentioned that you can rename things on the fly? Well, that means we can’t actually store ‘real’ names, but instead have to use aliases. These aliases are stored alongside the customer schema so we can manage those aliases directly as part of the schema. So the schema for a thing is an instance of a schema.

Let me say that again - the schema for a given ‘object’ for a customer is actually a combination of the tenantID

schema ID + schema alias(es). We have schema for describing a tenant + its known schemas (Metadata) and then each schema has its own schema (Metric). Then for a given type of ‘thing’ for a given company, we store instances of the metadata and each metric. This leads to a schema repository that looks like:

subject id	schema
_fineo-metadata	Metadata.schema
_fineo-metric	Metric.schema
data production inc.	Metadata.instance
n1	Metric.instance
n2222222	Metric.instance

Ok, that is going to take some explaining. The Metadata.schema and Metric.schema are actually the following Avro schemas[3]:

 record Metadata {
    string canonicalName;
    union {null, map<array<string>>} canonicalNamesToAliases = null;
  }

  record Metric{
    Metadata metadata;
    string metricSchema;
  }

{
  "canonicalName": "n1",
  "canonicalNamesToAliases": {
    "n2222222": ["machine1", "machine1b", "machine1c"]
}

  {
    "metadata": {
      "canonicalName": "n2222222",
      "canonicalNamesToAliases" : {
        "f1": ["field1"],
        "f2": ["field2", "field2b"] 
      }
    },
    "metricSchema": "\\ some encoded avro schema based on a BaseRecord \\"
  }

We have a known set of fields that are included in every record, comprising a BaseRecord and its BaseFields:

  record BaseRecord {
    BaseFields baseFields;
  }

  record BaseFields{
    string aliasName;
    long timestamp;
    map<string> unknown_fields;
  }

For now, its enough to understand that this is the basic building block of an ‘object’ schema. Shortly, we will discuss how its actually used.

Wait. What are you keeping track of?

A machine is actually an instance of a Metric, that has an instance of its own metadata to map canonical field names, keep track of its own name and then store the schema for the actual customer record. This allows us to evolve how we define a generic schema for a tenant or metric, as well as evolving how the schema for a given ‘thing’ looks.

Using schema to define schema… and then a big of pile of turtles at the bottom :)

Note that Avro’s standard aliasing because it only applies to records, which means that every field becomes its own record instance, which quickly gets to be a pain to manage and also prevents easy alias logic reuse. I’m not saying you couldn’t do it, it just gets to be a pain (left as an exercise to the reader).

Building schema - modern DDL

The schema for a given field is programmatically built based on what the customer sends us. By using an instance of the Metadata we can dynamically rename fields without actually changing any underlying data. Eventually, we also want to do dynamic type conversion and lazy ETL.

a timestamp
a customer specified alias (which we remap to a canonical name)
some number of unknown fields

  {
    "source": "new machine",
    "timestamp": "January 12, 2015 10:12:15",
    "temperature": "15",
    "pressure": "4",
    "gallons": "5"
  }

Which gets remapped via the Fineo ingest pipeline to a simple BaseRecord instance:

  {
    "alias": "new machine",
    "timestamp": "1421057535000",
    "unknown_fields": {
      "temperature": "15",
      "pressure": "4",
      "gallons": "5"
    }
  }

The unknown_fields then get stored as simple strings in DynamoDB, which we can read later (through some smart gymnastics) without having defined any schema or types. At some point later, an admin goes in and formalizes the schema to types that we talked about. The ‘extended BaseRecord’ and machine Metric instance then looks like:

{
   "metadata": {
      "canonicalName": "n2222222",
      "canonicalNamesToAliases": {
        "f1": ["temperature"],
        "f2": ["pressure"],
        "f3": ["gallons"]
      }
    },
    "metricSchema" :
      "record eBaseRecord {
        BaseFields baseFields;
        int f1;
        long f2;
        int f3;
      }"
}

Lazy schema - not your grandmother’s…schema

Part of the power of using a NoSQL store is that we can just stuff in fields without having to touch any DB DDL tools (though our schema management really is just DDL). Since we know the field names, we can then later just query what we expect is in there, and have the database tell us what actually is there.

Our Dynamo extension of the ASR also has support for tracking unknown field names and potential types. When we receive events that have ‘unknown fields’ we update the unknown fields list for that Metric type in Dynamo and then write the unknown fields into columns by the customer specified name as simple strings. When customers query for fields that have not been formalized they have to provide the expected type of the field. We use this expected type to parse the field and read it into our query engine, but also keep track of the requested type along side the unknown name.

Thus, without scanning a single row, we know if the fields the customer is requesting could be present. We can also use this type information to suggest to the admin - who does the schema formalization - what type(s) probably describe the field. This makes it wildly easy for admins to easily formalize the schema from the way they already query the data. We could later, as part of our ingest pipeline, also do some simple field parsing on unknown fields to attempt to identify what types it could be.

Nearline to Offline Query

DynamoDB, and other row stores, act really nicely as a near-line data store. You can write data pretty quickly and don’t have to do a lot of expensive work to read relatively large swaths of it back again for smallish analytics (millons of rows).

However, once you come to doing large analytics over a wide time range (10s of millions of rows), these tools start to fall down and more batch-oriented computation over columnar stores starts to look a lot better.

Enter Redshift - columnar store well-suited to doing analytic style queries.

Unfortunately, Redshift isn’t completely dynamic, so we need to have some handle on the types and fields going into it. Thus, we eventually - generally about every month - require that you finally get around to formalizing the schema so we can finish the ingest portion with a large Spark ETL job that does the final step to convert the schematized records from the ingest pipeline into Redshift-ready data.

Naturally, we don’t want to completely rewrite the Dynamo data when we customers formalize the schema - that quickly becomes cost and time prohibitive. Instead, we keep around the old names (remember that alias field?) and query based on the normalized name we generate during schema formalization and the old alias name, in case we have fields that were written with the old, pre-formalization name. We keep different tables for different time ranges (similar, though more manual that doing a TTL in HBase) and age off old tables, eventually letting us limit the fields we query to just the normalized field. Since we know when the data was written - everything has a timestamp - and when the schema was formalized, we can be very specific about which field name we expect.

Enterprise-y extensions

Dynamic schema management and lazy evolution gives users lots of power to manage their data. At Fineo we take security very seriously - every event is monitored and is auditable. Schema changes create their own ‘schema change event’ (which itself has its own schema). So do queries - on dynamic and known fields. Now you can see exactly what data came in and who changed what when. And you can do it all in SQL, so you know its easy.

We also leverage industry-standard, fine-grained, role-based access control. This lets you choose who can write data, make queries, create and trigger alerts and formalize schema.

In conclusion…

As a customer of Fineo you can write almost any data your want, whenever you want at pretty much whatever rate you want. We trust Amazon to handle whatever load you can throw at it (they’ve gotten pretty good) and then load it into our query platform in realtime[4]. You can then immediately query it, without having someone ahead of time figure out the types or complaining when the wrong fields are sent.

Future Work

While this gets you pretty far, we do see somethings that we think customers would find helpful:

field type prediction
dynamic field typing, so you can change the type of data
advanced sanitization and transformation
missing field alerts

Please let me know in the comments or email me if there is anything else you would want to see!

Fineo is also selecting its early beta customers so please reach out if you are interested in getting involved in our upcoming rollout.

Notes

1. Dynamo Schema Repo

2. Elastic Beanstalk

Ok, we could actually use AWS Elastic Beanstalk to do a lot of the ops for us in deploying the web service. However, they is still another moving part. It gets us nice separation and ability to evolve schema, but that seemed minor gains right now compared to the overhead of running another service . Of course, as Fineo grows this will not always be the case and the advantage of using a more SOA style architecture will be increasingly compelling.

3. Metric Fields

We also have the ability to ‘hide’ fields associated with a machine. This allows us to do ‘soft deletes’ of the data and then garbage collect them as part of our standard age off process.

4. Realtime

Fineo Internals - Simpsons Did It

2016-02-28T00:00:00+00:00

I’d like to talk a bit about the AWS-focused ingest pipeline that we developed at Fineo. Not too ironically, its very similar to the pipeline that Netflix discussed by in a recent article , highlighted by the wonderful Hadoop Weekly. This was almost a classic case of “the Simpson’s did it”.

Now, as with all Simpsons instances, the key comes in the differentiators. Our pipeline is very similar to the one at Netflix, but is also leveraged to enable real Enterprise SaaS requirements: end-to-end encryption, backups, and validation. Additionally, our design allows for easy, rapid prototyping and deployment of new components.

We leverage a host of AWS services for a couple of reasons:

as we scale up, cost scales with us
operational burdens

are nearly zero. Instead of storing data in Kafka, we leverage Kinesis, which has very similar semantics. Kinesis also integrates with a variety of end points - web APIs and Amazon’s new IoT service which we look to adopt soon.

A series of AWS Lambda functions then process the records off the raw ingest Kinesis stream. The first converts the raw record in a schema that we understand or kicks it out to an error stream. The ‘valid’ records are then send onto two places: a archive (used for backup and scalable replay) and the ‘Staged’ Kinesis stream. The Staged stream is then processed by the ‘Staging Ingest’ Lamba function. Similar to above, error records are kicked to Firehose, along with another archive. Additionally, this stage also writes to Dynamo DB, so it can serve near-line queries. Because each event is unique we don’t have to worry too much about Dynamo’s eventual consistency, though we can turn up the consistency as needed (e.g. for historical corrections).

The endpoint S3 “staged archive” location is then processed with an EMR Spark cluster to do a few things:

deduplicate records
extract schema changes
format records for ingest into Redshift
archives raw records to S3 Glacier (nearline backup)

From there we periodically bulk load into Redshift from the output S3 files after processing via EMR. Note, we can be lazy about this since the data is already served from the nearline storage. The schema changes get sent to the customer to validate so we can formalize the schema for records. Note, we already store the records, before formalizing the schema, in Dynamo. With some slight smarts we can query the records back out again, without knowing their types or ‘official name’ (more on this in a follow up blog post).

And that’s the whole pipeline! So what does all that buy us?

Rapid development and ease reading from the Kinesis Streams, without impacting customers
Continuous, staged backup
Long SLAs on Redshift ingest

Note, we can just point our ingest pipeline at an S3 file and just as easily handle batch processing records - handy for more “traditional” companies that do bulk exports.

Firehose Benefits

Firehose has a couple of key benefits. First, it acts as a low operational overhead backup system for relatively little cost. S3 is hightly durable (99.999999999% durability), but also has built in encyption, hitting many of our core requirements.

Since we Firehose at each stage, we also get infinite replay for each stage. This is necessary when Kinesis only keeps events for a certain time, but also useful to handle cases of data corruption issues from a given stage - we can just deploy a new version and replay from the previous stage’s archive. Its also nice if we want to do more extensive testing.

Each stage can also see two main types of error we can see - ingest/customer errors from bad data and commit/processing errors. For each error type we write them to a different Firehose stream. This lets us then tie in AWS Notifications to alert when we get an error (as an S3 file). This can either be a notification directly to the customer or waking up the Ops team in the middle of the night.

Pro Tip: the default firehose limit is only 5 streams. With 2 stages, each with an archive and two different error streams, you already exceed that limit. Its possible to combine your error streams and then do some post processing in EMR to separate the components… or you can just request a limit increase - Amazon is pretty responsive :) Just make sure you plan for production and dev!

With basically no operations, Firehose is an incredibly useful tool we have leveraged in a couple of ways to make our infrastructure both highly fault tolerant and highly testable. There are a couple of open source projects that can do the equivalent work of Firehose - batching up writes and dumping to a DFS (i.e. HDFS); Firehose is nice in that you don’t need to run any of your own infrastructure.

The Pitch

As a SaaS provider Fineo gives you all these great things you would want with a flexible ingest pipeline and fast, IoT centric storage, without all the overhead of actually running it yourself. While a Netflix-style pipeline may not be presented to you directly as a customer, you get the rapid development, testing and iteration a staged, streaming architecture.

Beyond the usual time series monitoring services we are also foremost an enterprise company. With the push of a button you can encrypt your data from end-to-end. And access control? We provide fine-grained, role based access control.

Beyond the standard enterprise-y features, Fineo really shines in three places:

SQL everywhere
Low latency query and alert
Dynamic schema at scale

The first two are pretty cool. Being able to use SQL everywhere means quick adoption across the company and natural, powerful query semantics. On the Fineo Data Platform we take it one step further, adhoc analytics can be turned into a real-time monitoring alert with the push of a button. Then, if that alert goes off, you can do deep investigation with the same SQL tools. This power is accessible both through the web application, a JDBC driver or programatically through our web API.

Built on cutting edge stream processing technology we can respond to queries on the stream in milliseconds. Then a fast, scalable KeyValue store enables your near-line analytics. Finally, we also store data in a scalable columnar store which allows you do complex analytics blazingly fast.

Some of the most interesting work in Fineo’s platform is around schema management. Traditionally, you would have to define a schema before you can query your data. This is an extra hurdle to data integration and red tape you don’t need. From the ground up we are built to be multi-tenant, meaning we have a more rich key-space than proposed by dtrapezoid.

Fineo also enables you to send and query data immediately, as long as you know what you are looking for. We will quickly notice when you have new events (that EMR job I mentioned above) and alert you so you can either handle it as an error , merge it into your current schema, or create a new event type that you want to monitor. Since we know what fields you sent in each event and how you have been querying it, we will suggest fields and their types.

Even better, you no longer have to be concerned about the same field having multiple different names. We can dynamically map two (or more) different field names into the same logical name. The only reason you need to approve schema changes is so we speed up your queries. Until you specify types we have to treat everything as strings and do matching and conversion from there.

In another post I’ll talk about how we actually go about doing dynamic schema at scale.

Wrap Up

I don’t think the Simpson’s did this one.

Choosing Hadoop Deployment tooling

2015-10-26T00:00:00+00:00

Picking the right tools for deployment can be tricky and have long-lasting effects on your organization. Over at Fineo we have the luxury of doing everything from scratch. This means no concern with legacy tools, monitoring, or just cruft. Instead, we have the opportunity to make all new legacy code :)

At Fineo, we manage a hadoop-focused stack, with multiple layers of dependencies. On top of that, we are providing a multi-tenant, hosted service, so we need to be able to deploy, redeploy, and fluctuate capactity at a moment’s notice. Since we are a lean shop, we wanted to write as little code as possible to get up and running as soon as we can. Naturally, we turned to open source!

Fortunately, the great folks at Hortonworks have been doing a lot of the work we need already with Apache Ambari - the ability to deploy, configure, orchestrate and manage a stack of services, where most of the sevices we already need (a hadoop stack) are already written for us. Now, all that’s left is adding the custom services we need, tuning and actual deployment.

Ambari is great for getting bits on boxes, but doesn’t really help us manage our server fleet when running in a cloud-native environment. We want to be able to run Fineo across multiple data centers and on various different public clouds, depending on where are customers are already running. Suppose you were going to build this all from scratch; you would probably use Docker to manage containers, Swarm to manage the fleet of Docker containers, something like Consul for discovery and configuration services.

Enter Cloudbreak. Docker/Swarm/Consul on Ambari with plugins for all the existing major cloud providers. Pretty cool (yay open source!).

Now we have a hadoop-focused stack built on the best in class technology, all without writing a single line of code. That’s all about to change.

Continuous deployment as Life

Key to getting anything ready for production, is giving developers the tools to replicate a production-like environment locally and test without too much overhead. To do this, I’ve setup a series of Jenkins jobs linked to maven builds and Vagrant virtual machines. A single check-in causes a cacade that builds brand-new RPM of our custom component(s) (I talk about that here), then triggers a build to recreate a VM for hosting those RPMs, alongside a specified verion of the HDP and HDP-UTILS rpm repositories¹. From there, you can run a virtual machine to do a from-scratch install of the whole stack via Ambari.

That same VM that is used for RPM hosting, with a couple of environment variable changes, can also be used to deploy those RPMS to our S3 repository, which is used for hosting our production, staging and testing environments in the cloud. Oh, and of course you can trigger this via a Jenkins job too.

Vagrant is great in that it gives you exactly the same environment everytime you run it, so we leverage it to also create a Vagrant VM of the Cloudbreak stack (which is actually a VM hosting a Docker instance, which spins up multiple VMs… its turtles all the way down). With that VM we bundle developer credentials so you can then deploy using the existing RPMs or the ones that you pushed to to S3.

This means developers can easy so go from raw source code to a tested, built RPM which they can use to test in a production like deployment, all on their local machine, a mix of jenkins + local machine, a mix of local + cloud or entirely in the cloud.

Pros/Cons

Let’s go to the high-level list of what I like and don’t like about Ambari and Cloudbreak. In the end, the issues were not insurmountable and we trust in the power of a strong open source community to remedy many of these issues.

Ambari - Pro

REST API for everything
looking nice UI + extension framework
‘stack’ based approach, which inheritance, so you don’t need to copy-paste everything.
Not just Hortonworks stack - BigTop also has its own ‘validated’ version
common components that us you ‘just use’ when creating your own stack, which covers most of the services you care about

Ambari - Cons

Many of these issues are due to the relative immaturity of the platform and probably not as big of an issue as if we had decided on Cloudera Manager, but the community support for Ambari is pretty strong, and growing, while the issues are minor at best - RTFS and everything is ok! In the end, they were not insurmmountable and let us get up and running relatively quickly.

terrible documentation. The starter project gets you a basic set of scripts deployed, but skips deploying an RPM package. There is no real documentation on other options or usages.
Error messages are obtuse - spent days debugging a messed up configuration issue
The code is hard to follow, so debugging is a bit of a pain. Probably my lack of experience with the codebase, more than anything though. Would be nice to have any kind of component diagram.
UI still immature - installation steps are somewhat opaque (which can be good sometimes too)
stack inheritance can make things really hard to understand, if you get more than 2 layers deep
you can’t do cross-stack inheritance
still very HDP oriented

Cloudbreak

I’ll be honest, we still haven’t done very much with Cloudbreak - its still very early days. However, the platform does everything that I would have designed alreday, so I can’t complain too much.

Pros

All the ‘cool’ technology of next-gen devops - Docker, Swarm, Consul
Great looking web UI
Nice shell and exposed API for our own tooling
Pluggable cloud APIs with existing support for all the major clouds
Recent acquisition by Hortonworks indicates continued suppport and development (maybe apache project??)
Great existing support on forums, mailing lists (they already rolled their own AMIs for an easy start!)

Cons

Local deployment guide could use a little bit of work
Needs some tricks to run in Vagrant (though this is really a Docker bug)
Requires a bunch of microservices, which can make setup/management a bit of a pain for newbies

Like I said, this is still early days - I’m sure other things will come up as we use Cloudbreak more. However, great support and signalling for increased development and support via Hortonworks, combined with the open source nature of the project bodes very well for cloudbreak.

Summary

Devops - deployment, orchestration, management - is hard, particularly for the hadoop stack. Luckily, people have been doing a lot of open source work on this already, which lets us jump-start our own process with just a bit of elbow grease and source code reading. Ambari is a great platform for deploying, configuring and managing your hadoop cluster - that’s months of development time saved!

As a green-field, ‘cloud-native’, multi-tenant platform we have to be able to deploy to multiple clouds and datacenters, as well as respond dynamically to shifting load requirements. Enter Cloudbreak - cloud-native deployment built on Docker and Swarm (that’s another 6 months of hacking saved, but with a tested, relatively ‘solid’ platform) that manages clusters through Ambari.

Fineo is still young, so we haven’t seen all the rough edges that Ambari and Cloudbreak will have, but we are jumping in head first, trusting to a strong open source community to move these projects along quickly. There is a sharp learning curve, but they have proven to be a powerful springboard to get our hadoop-based stack running.

Combined with Vagrant and a few continuous integration jobs, we built a full development pipeline that is amenable to local testing, cloud-testing or production deployment. With the invaluable APIs exposed by Cloudbreak and Ambari, we can build our completley automated infrastructure and a true CI environment - a true rarirty in the ‘big data’ world!

At Fineo we belive that the Internet of Things means that every company has to become a big data company. Unfortunately, big data tools are still hard to use (even with all these great tools!) - Fineo brings big data to enable IoT for the rest of the world, backed by a world-class platform built on tools like Cloudbreak and Ambari.

If you are interested in solving these kinds of problems or bringing big data and IoT to the world, drop me a line.

Footnotes

#1 Speaking to the power of open source, at the recent Ambari Hackathon, the folks from Pivotal made an extension to allow additional repositories. This will let us then separate out the external dependencies (which we host locally, for full control, managment, access, etc) and internal service RPMs.

Credits

Turtles all the way down: Lynn Blog: June 2010

Did some prettyfying

2015-10-02T00:00:00+00:00

Updating the layout/look and feel of the site a little bit. Please let me know if things no longer work for you!

Building RPMs from Maven projects on OSX

2015-08-21T00:00:00+00:00

Packaging software is a necessary evil, and for enterprise software RPMs even more so, but you might as well find out how to manage it when you really want to just work all from your Mac.

Starting Out

The quick and easy solution would be to leverage the rpm (and rpmbuild) tool can be installed via homebrew and the rpm-maven-plugin. Unfortunately, this is in no way recommended as a way to build packages for production. Futher, I’ve found that any package I built could not be installed on a CentOS box, even though it was built with no specific architecture; when the RPM is packaged, the source operating system was imprinted on the RPM, preventing it from being installed.

Preparing for production

I still recommend using the rpm-maven-plugin - its pretty convenient and works mostly as expected, nothing crazy to point out here (beyond a decently documented maven component!).

To build packages for production, you really need to build the software in an environment matching that on which you will run. This ensures correctly library linking and native bits. During my time at Salesforce, we were actually building on a completely different OS than we were running and had only gotten lucky that the same C-library was used on both the build and production operating systems (specifically, we were building snappy). We only found the issue when we upgraded production and suddenly our packages no longer worked as expected! Lesson learned.

Fortunately, there is a solution for this, and its come quite a long way in the last few years: Vagrant!

Originally, I wasn’t even going to post this because it was so very simple to spin up a VM - a mere two hours from conception to working instance. However, there were some subtleties that are worth pointing out.

My target OS is Centos 6.5.3 - the latest commonly available CentOS release - as I’m trying to build for RHEL, but don’t want to spend the money (at least right now) on a RHEL subscription.

Build requirements

The build I was working with required a few elements:

rpmbuild
maven 3
java 8
protobuf 2.6.1
custom forks of open source libraries

rpmbuild is a standard yum package, but from there the components become increasingly more complicated: maven needs to downloaded from a release tarball, java8 requires a special cookie on download and protobuf requires a full build from source (as of writing, only 2.3.1 is readily available).

As its just me, I’ve set to estabilish an accessible maven repository, so my custom forks are instead leveraged from the local .m2 repository. As such, I’ve also linked my local .m2 repository to the VM’s path. The upside to this is that we don’t need to rebuild or redownload any jars when we want to build our project, instead relying on the ones stored on the host OS (it saves space, time and bandwidth). Remember, this is fine because compiled java jars are OS agnostic, so we can leverage the same jars regardless of whence they originated.

Here are the scripts I used:

# -*- mode: ruby -*-
# vi: set ft=ruby :
VAGRANTFILE_API_VERSION = "2" 

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
  # Default provider VirtualBox
  config.vm.box = "CentOS-6.5-x86_64"
  config.vm.box_url = "https://github.com/2creatives/vagrant-centos/releases/download/v6.5.3/centos65-x86_64-20140116.box"

  config.vm.provider "virtualbox" do |vb|
      vb.memory = 2048
      vb.customize ["modifyvm", :id, "--ioapic", "on", "--cpus", 2]
  end 
  config.vm.provider "parallels" do |prl|
      prl.customize ["set", :id, "--memsize", 1024]
      prl.customize ["set", :id, "--cpus", 2]
  end 

  # add the maven repository as a synced directory. Saves from rebuilding custom dependencies and downloading the internet again
  config.vm.synced_folder "~/.m2/", "/home/vagrant/.m2"

  config.vm.provision "shell", path: "provision.sh"
end

and the provisioning script:

# !/bin/bash
# Provisioning file for the Vagrant-based VM for building RPMs
# This will install all the necessary software to build the rpm

set -o nounset

function install {
  rpm -qa | grep -q $1
  if [ $? -ne 0 ]; then
    echo "Installing $1 ..."
    sudo yum install -y $1
  fi  
}

function add_to_path {
  echo "export PATH=${1}:\$PATH" >> ~vagrant/.bashrc
}

function install_mvn {
  ls -1 /opt/apache-maven-$1 &> /dev/null
  if [ $# -ne 0 ]; then
    echo "Installing Apache Maven $1"
    cd /tmp &&
    wget -q http://archive.apache.org/dist/maven/binaries/apache-maven-$1-bin.tar.gz &&
    cd /opt &&
    sudo tar -xzf /tmp/apache-maven-$1-bin.tar.gz

    add_to_path "/opt/apache-maven-$1/bin/"
  fi  
}

function install_protobuf {
  if [ $# -eq 0 ]; then
   echo "No value given for protobuf installation!"
   exit 1;
  fi  
  echo "Installing Google Protocol Buffers $1"
  protobuf=protobuf-$1
  cd /tmp &&
  echo "-> downloading..." &&
  wget -q https://github.com/google/protobuf/releases/download/v$1/$protobuf.tar.gz &&
  sudo tar -xzf /tmp/$protobuf.tar.gz &&
  cd $protobuf &&
  echo "-> configuring..." &&
  ./configure &> protobuf-configure.log &&
  echo "-> building..." &&
  make &>  protobuf-make.log &&
  echo "-> installing..." &&
  sudo make install &> protobuf-install.log
  
  add_to_path "/usr/local/bin/"
  echo "... done!"
}

install rpm-build
install wget

echo "Installing java8..."
JAVA_RPM=jdk-8u60-linux-x64.rpm
wget -q --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u60-b27/$JAVA_RPM
install $JAVA_RPM

install_mvn 3.0.4

# protoc
install gcc-c++
install_protobuf 2.6.1


echo "Provisioning completed."

Sorry for the blog thrash

2015-08-20T00:00:00+00:00

Don’t know if you noticed, but there may haven been a bunch of RSS thrash from updates to my maven shade post. Jekyll had updated some dependencies, which made my local jekyll instance’s output not even nearly match the actual, leaving me to test ‘in production’… admittedly not great a blog + rss.

Anyways, it should all be working smoothly now (post coming soon to help verify that). Also today, I should be pushing a highlighting-fixed version of the maven-shade post - let’s hope its the last.

Using Maven Shade to Run Multiple Versions in a JVM

2015-08-17T00:00:00+00:00

The default java build tool is still, unfortunately, Maven - despite some great work in things like Gradle and Groovy (lotta ‘g’ names, weird) - because it can do everything you could possibly need and then some. Unfortunately, as many know, it can be particularly obtuse. For now, lets talk about using the maven-shade-plugin to build a custom artifact that allows you to run two different versions of the same library in the same JVM.

Yup, its a little bit of a weird case, but more common than you would expect; I’ve found it traditionally comes up when running a web server and integrating with established java libraries (e.g. dropwizard/ratpack and Hadoop or Calcite, often due to older versions of Guava).

In this case, I was running a ratpack front-end and leveraging camel-netty4-http to receive messages from my stream processors. The split was made a Camel provided a quick and dirty internal facing endpoint with a lot of ‘nice’ tooling around send/receive pipelines, tracing, etc., while at the same time Ratpack was picked for the client facing work since it has a lot of UI facing niceties (easy separation of static assets, built in websocket and server-sent event support) and is streaming/async native, allowing for minimal overhead for the client interactions.

At some point, Camel will probably be replaced with Ratpack, but to enable running both at the same time (long term viability testing, etc) there is a fundamental mismatch - both libraries leverage different versions of netty!

To resolve this, one of the things the maven-shade-plugin does is allow you to rebundle libraries under a different namespace, which resolves classpath clashes. Then the module that rebundles that jar can be used as a dropin replacement… with some caveats.

Rebundling Camel Netty4

Lets start with a simple pom that shades the primary dependency and the transitive dependencies that we care about.

We include all the dependencies in the shaded jar, but only shade the maven parts. This lets us make the shaded jar a drop-in replacement for all the camel libraries and their dependencies.

Now, in the module where you actually care about running both libraries you would do:

<dependencies>
    <!-- Camel as an abstraction for interacting with the webserver -->
    <dependency>
      <groupId>com.jyates</groupId>
      <artifactId>camel-netty4-http-shaded</artifactId>
    </dependency>
...
</dependencies>

Caveats

The plugin only supports including dependencies that are compile or runtime scoped. Unfortunately, this means when you depend on this module (well, the output shaded jar) you will also pull in all the transitive dependencies… which means you end up with the same classpath conflict we tried to avoid originally! Ideally, we would want to have them at ‘provided’ scope, but alas, the maven-shade-plugin does not include dependencies ouside of compile/runtime (yeah, you could fork the plugin code and make it so, but… that seems like too much effort)

Ok, you can get around it by bundling the exact jars that you want in your runtime application and never running the two components together in the same JVM while testing. However, that is pretty unsatisfying and will likely end up with a lot of runtime debugging.

Managing transitive dependencies

The natural thing you would now is just exclude the dependent artifacts from the dependency. However, that was a lot of dependencies we need to exclude and its easy to miss one, which leads to hard-to-debug classpath issues. As of maven3 (really, you are still using maven2? sorry, its manual for you), you can do glob exclusions:

<dependencies>
    <!-- Camel as an abstraction for interacting with the webserver -->
    <dependency>
      <groupId>com.jyates</groupId>
      <artifactId>camel-netty4-http-shaded</artifactId>
      <exclusions>
        <exclusion>
          <groupId>*</groupId>
          <artifactId>*</artifactId>
        </exclusion>
      </exclusions>
    </dependency>
...
</dependencies>

Fortunately, this actually does everything we want - it excludes all the transitive dependencies and lets us drop-in replace them with the custom, shaded jar we built.

Only downside? You build gets a some nasty error messages:

[WARNING] 'dependencies.dependency.exclusions.exclusion.groupId' for com.jyates:camel-netty4-http-shaded:jar with value '*' does not match a valid id pattern. @ line 70, column 20
[WARNING] 'dependencies.dependency.exclusions.exclusion.artifactId' for com.jyates:camel-netty4-http-shaded:jar with value '*' does not match a valid id pattern. @ line 71, column 23

Oh well, at least everything works.

Generally, you won’t need to worry about these kinds of transitive issues if you are building a framework that runs external code and want to support artibitrary user code (I learned a lot from looking at the Apache Storm pom.xml), but if you want to do some crazy stuff like running two different Netty web servers in the same JVM? Well, now you are covered.

Happy shading!

How to setup your tri bike for a fast, smooth race

2015-08-13T00:00:00+00:00

This is a fairly indepth writeup on how I setup my tri bike, in all its gory detail, for Ironman Cozumel. There are some notes in there about how I would modify for HIM races too. I wish I kept the links to the research that led me to make these descisions, but c’iest la vie, eh? First and foremost though, you should get a solid bike fit. If you are constantly sitting up, any aero gains you make from spending money on gear are pretty much going to be completely negated. But then again, I’m just a nerdy tri guy, so take all of this wtih a grain of salt.

I think all told, this setup ended up running about $500. Its going to seem like I’m paid by XLab, but I promise, that’s not the case - they just make the stuff that I want :)

Ok, lets take it from the front to the back.

Cockpit

The stuff:

I like the torpdeo for a few reasons.

Zero, its really easy to refill and drop in Nuun tablets. Especially when considering you are probably pulling bottles from the aid stations, then filling up the holder and then adding your powder/tablets/etc. No messing around with screwing/unscrewing anything. But that is table stakes to me.

First, its aerodynamically invisible, if you have it positioned correctly - too far forward or back, and its not as much, but it should be hidden by your arms from the side view. If you put the drinking straw under the notch, its hidden from the wind. Lots of others leave the straw up and that creates a major point of drag.

Second, the velcro lets you attach up to two Salt Sticks to the bars without any extra junk, I use one (as in the second picture, though it is missing the characteristic bright red cover). Bonus is that its already at hand, so no fumbling around and also aerodynamically neutral, since its with the profile of the bike.

Third, it also fits a regular water bottle just as easily as the torpedo water bottle. Not ideal for racing, but makes life much easier for just spinning at home or lazy training rides.

Lastly, the torpedo has the option for a bike computer mount. Note, it only works with the smaller bike computers, so I had to buy a separate bar mount. Still, a very compact profile on the bike and all together works really great.

Downsides?

I went through a straw in about 7 months. But it could be I was chomping down harder than most people. It was $12 to replace it, NBD.

Mid frame

The stuff:

Dark Speed Works - Speedpack 483D

Keeping it really simple here. Just the top tube bag. I really like Dark Speed Works b/c it integrates into the top tuber ferrings pretty cleanly (well, had to swap one of their screws for one of the down tube screws, but besides that, it was fine). Its a decent size bag - fit most of my food for IM, with just a few bars/gels that I had to stick in my jersey. I don’t know of any product big enough to fit all the food one would need for an IM, but its plenty for HIM.

Otherwise, I don’t mess around with any bottles on the rest of the frame. Research shows a lot of different things, sometimes just the seat tube is best, sometimes none, etc. Basically, I trusted that the Cervello knew what they were doing and built the frame to be aero - any additions starts detracting from there. Now, there are a few bikes that have integrate water bottles already - that is the only time I would put a water bottle on the frame because its designed with that in mind.

Backside

The stuff:

XLab Turbo Wing
XLab Gorilla cages
XLab mini bag
XLab Nanoinflator
XLab Sonic Nut - attaches inflator to cages

Ok, that’s a lot of crap in a very small amount of space. Unfortunately, this is where XLab kinda gets you on random crap (e.g. sonic nut), but you can find most of it on EBay for reasonable prices.

Its all built up around the Turbo Wing. Now, right behind the rider is the most aerodynamic place to stick a bunch of stuff. Its kind of a dead zone as your body is blocking a lot of the wind. Now, as you get further back, this becomes less true, so be wary of hanging too much stuff (I push it a little bit with the tool bag).

Gorilla cages are excellent and I’ve only ever had a bottle eject twice, both on ridiculously bumpy roads and never during a race. Two bottles lets you store enough water for a good long ride (including the torpedo in the front) and to not sweat it during races. It also gives you the option to swap a bottle for a tool kit (see below).

I do something a little unusual, at least from the standard directions and attach both the inflator and the mini-bag. Generally, its one or the other. This gives me the peace of mind to be able to fix 2 flats and not worry about it. The inflator is great, just practice using it before race day (I didn’t and regretted it Santa Cruz 70.3 last year… DNF’ed). Since the inflator/CO2 sits in line with the bike and lower, its out of the wind.

In the minibag I’m storing a tube, levers, patch kit and bike tool. Its all come in handy and I wouldn’t race without those things. A nice alternative, especially for shorter races is using the mini-pod. Stores all your fix your bike crap and doesn’t suffer aerodynamically.

Lastly, I also attach an extra tube to the wing, under the seat with a reusable zip tie (so its easy to take on/off). This gives me the second tube, ready to go, if I need it. This was a late addition, but was an extra piece of mind with relatively little weight.

Wheelset

I generally ascribe the “go slow, to go fast” model of training. In this case, it means training with a freaking heavy wheelset (generally whatever comes stock on the bike) and a set of Gatorskins, which are slower tires, but you will almost never get a flat; I’ve only ever had two flats on them, both on crappy city roads. They are worth considering for races where the course is not as well groomed (i.e. Santa Cruz 70.3, which I DNF’ed after 3 flat tires… yeah). To that end as well, I tend to prefer clinchers - they are much easier to change in case of flat, though they might not be quite as comfortable or crisp as a ride.

Come race time, I recommend riding the deepest rim wheels you can handle. If its windy and you aren’t comfortable on the wheelset and are constantly sitting up to control the bike, you are going to loose any aerodynamic gains from the wheel.

Rear wheel

At Cozumel they don’t let you ride full disc wheel because it is so windy so I opted for a HEAD Jet 9 - its a decently deep rim for a reasonable price. However, doing it again I would spend the more money for a nicer Zipp or Enve rear wheel; HED wheels just aren’t as reliable as the higher end wheel sets (plus they are little heavier). However, the Jet9 is just as aerodynamic. Your call.

Front wheel

In the front I ran a Zipp Firecrest 404. Zipp just makes a solid product - never had any issues - and have a great cusomter service department.

Race Tires

For longer races, e.g. half or full ironman distance, I would recommend Continental GP 4000S II tires. Pretty fast spinning but still really good puncture resistance; the five minutes you spend changing your tube because of a flat is going to negate whatever speed you might gain from a faster tire (unless you are in the .01% who is winning races, but then, you probably aren’t reading this :). Just use the GP 4000s, its what everyone else is doing.

Wrap up

I tend to redundancies to ensure, as much as posssible, that I can at least finish any race I start. So yes, that does mean that I carry a bit of extra stuff, which weighs more, so maybe it negates the gains… I dont know. Really, unless you get into a wind tunnel and really measure stuff, its hard to say how much any of this really helps. Hey, maybe the biggest gain is shaving your legs rather than spending money on any of this stuff. But at the very least, its certainly entertaining.

Hopefully this helps in some small way. See ya on the road!

Dev Tip- Using Gradle without hating it

2015-06-18T00:00:00+00:00

Gradle is starting to become mature enough to be used as a ‘real deal’ build system. However, when trying to build with gradle there can be some easy idioms to help you up the learning curve.

The build script is groovy code

Honestly, this is one of my favorite features of gradle - it makes it incredibly powerful. That said, it can be a bit of a pain if you aren’t willing to spend some considerable amount of time learning groovy; instead, you will probably just end up doing an immense of googling to finally cobble together something that works.

An snippet that I found particularly useful, particularly when working with hadoop, but extensible to other depenencies is maintaing a list of basic dependencies, and then iterating them as needed.

 
def getHadoopDependency(componentName) {
    "org.apache.hadoop:${componentName}:$hadoopVersion"
}

This will build you the full dependency name of the hadoop component with the right name.

You can store the dependencies you will need across projects in a List

List standardHadoopComponentNames = ['hadoop-client', 'hadoop-common', 'hadoop-hdfs']

and then use that later to include the dependencies fairly easily:

project('myproject') {
    dependencies {
...
       for (component in standardHadoopComponentNames) {
            compile(getHadoopDependency(component)) {
                exclude group: "org.slf4j", module: "slf4j-log4j12"
            }
        }

This includes each of the ‘standard’ hadoop components from above as compile-time dependencies and cludes the slf4j-log4j12 dependency from each of the components(1).

I still haven’t found a clean way to not have to include all these lines in all the projects (i.e. an inheritance, like you would expect with Maven). However, this is a very simple idiom that becomes very expressive and powerful. It works great for a single project, but can get frustrating with multiple.

Testing independently

If you are starting to build up a lot of integration tests that are long and likely conflict on the same JVM, you probably want to add a new project and fork each test:

project('myproject:it') {
    test {
        // fork a new jvm for each test class
        forkEvery = 1
    }

Different Scala versions

This was a tip I picked up from browsing the Kafka and Samza gradle builds.

Suppose you want to include Scala in your project, but want to support running against a couple of different versions. In you top-level build.gradle file you would just add:

apply from: file('gradle/dependency-versions.gradle')
apply from: file("gradle/dependency-versions-scala-" + scalaVersion + ".gradle")
apply from: file('gradle/wrapper.gradle')

This lets the developer configure, either through the config files for default or on the command-line for dyanmic setting, which version of scala to use. Then in depenency-versions.gradle you would have all the versions of dependencies (like the section in a maven pom)

// All the versions of the dependencies, in one place
ext {
    gradleVersion = "2.3"
...
// You can also include lists of dependencies here too!
 jacksonModules = ["jackson-annotations", "jackson-databind"]
...
}

Note though, we don’t include the scala version in this set of properties - instead it goes into the top-level gradle.properties. Here’s one of mine:

group=com.salesforce.my-project
version=1.0.1-SNAPSHOT
scalaVersion=2.10
org.gradle.jvmargs="-XX:MaxPermSize=512m"
systemProp.file.encoding=utf-8

By default, our build will then look at the dependency-version-scala-2.10.gradle file:

ext {
    scalaTestModuleVersion = "scalatest_2.10"
    scalaTestVersion = "1.9.2"
    scalaLibVersion = "2.10.4"
    // Extra options for the compiler:
    // -feature: Give detailed warnings about language feature use (rather than just 'there were 4 warnings')
    // -language:implicitConversions: Allow the use of implicit conversions without warning or library import
    // -language:reflectiveCalls: Allow the automatic use of reflection to access fields without warning or library import
    scalaOptions = "-feature -language:implicitConversions -language:reflectiveCalls"
}

where we can set all the scala properties we need when adding scala to our standard build cycle:

// all projects assumed to have scala, but you can add this just to a specific scala project too.
allprojects{
    // For all scala compilation, add extra compiler options, taken from version-specific
    // dependency-versions-scala file applied above.
    tasks.withType(ScalaCompile) {
        scalaCompileOptions.additionalParameters = [scalaOptions]
    }
}

plugins.withType(ScalaPlugin) {
    //source jar should also contain scala source:
    srcJar.from sourceSets.main.scala

    task scaladocJar(type: Jar) {
        classifier = 'scaladoc'
        from '../LICENSE'
        from scaladoc
    }

    //documentation task should also trigger building scala doc jar
    docsJar.dependsOn scaladocJar

    artifacts {
        archives scaladocJar
    }
}

Leveraging projects built with maven

Some projects are built with maven and assume that the environment from which the tests are run is also maven (I’m looking at you Hadoop and HBase projects). For the most part, this is fine… until its not. Frequently, you will end up with cases where your tests create an extra {project}/target directory and store temporary data there. To fix this, you can add a cleanup for that directory to every project pretty easily.

allprojects{
   ...
    task deleteMavenBuildDirs(type: Delete) {
        delete "target/"
    }

    // Add removal of the maven build directory since the HBase/Hadoop tools all assume maven build
    cleanTest.dependsOn deleteMavenBuildDirs
}

This way, everytime the ‘cleanTest’ target is run, you will also delete all the target/ directories.

Building Jars & Tars

There are probably a bunch of jars you will want to build for your project. For a single-project gradle build, this is pretty straight forward from the docs. However, once you are into multi-project gradle builds, this can start to get a bit more complicated, especially when looking to release.

For the below, I’m just using a single build.gradle file - I find its easier to reason about the different projects when you can see them all together. However, gradle also lets you have a build.gradle per project directory, allowing you to decouple things when they starting getting too complicated.

Tests

By default, the ‘java’ plugin will just build a java jar. However, you frequently will want to reuse your test sources across projects. To do this, you need to build a “tests” type jar (the maven equivalent is the com.mycompany:project:1.0:test artifact) which can be depended on by other projects:

subprojects {
    jar {
        baseName = "$project.parent.name-$baseName"
    }

    // build a testjar so we can use the tests resources other places
    task testJar(type: Jar, dependsOn: testClasses) {
        baseName = "test-${project.archivesBaseName}"
        from sourceSets.test.output
    }

    configurations {
        tests
    }

    artifacts {
        tests testJar
    }
}

This will build not only the standard jar with an intelligent name - by default, it would just be the name of the project, but you may have multiple sub-projects with the the same name, and hence no way to differentiate them - but also the tests har with the standard maven naming conventions.

Building a distribution tarball

Ok, now you have some artifact that you want to package up all the hardwork you have done and make a release of the build jars (e.g. something that would run on another box).

For this case, consider two projects: myproject:fs and myproject:rest. We want to package up these two projects into a single gzipped tarball.

apply plugin: 'base'

// Building the distributions
// --------------------------

configurations {
    distLibs
}

dependencies {
    distLibs project(':myproject:fs'),
            project(':myproject:rest'),
}

task distTar(type: Tar) {
    description = "Build a runnable tarball of all the subprojects"
    duplicatesStrategy = DuplicatesStrategy.EXCLUDE
    compression = Compression.GZIP

    // set the base directory for all the files to be copied
    into("$baseName-$version")

    // generic directory/file includes
    into("conf") {
        from 'conf'
    }

    into("bin") {
        from 'bin'
    }

    // other helpful/necessary top-level files
    from("LICENSE.txt", "README.md")

    // depend on all the sub-projects
    into("project-lib") {
        from { subprojects.jar }
    }

    // brings in all the runtimes dependencies of the sub-projects
    dependsOn configurations.archives.artifacts
    into("lib") {
        from configurations.distLibs
    }

In that tarball we are going to have a handful of directories:

/myproject-1.1.0-SNAPSHOT   // basename + version as we defined in the gradle.properties
../bin
..../start.sh              // a start script that is in the top-level bin directory of the project, just copied in here
../conf
..../conf.xml              // basic config files, also from the project's top-level /conf directory, copied into the tar
../lib                     // all the dependencies for each of the projects, put into the same directory.
..../some-lib-0.1.jar
..../another-lib-1.1.jar
   ...
../project-lib            // the jars from the projects that we wrote
..../myproject-fs-1.1.0-SNAPSHOT.jar
..../myproject-rest-1.1.0-SNAPSHOT.jar

All we need to do to build that tarball is then just run $ gradle distTar

And the final tarball will be in the ‘distributions’ directory.

Summary

Hopefully, this has been somewhat useful. We’ve covered how you can leverage some the features of having the groovy language in your build scripts, how to add new tasks, managing your scala versions and how to roll a distribution and its dependent jars.

NOTES:

(1) Using hadoop with other projects that do logging ‘better’ generally means having to exclude this dependency as other projects will use slf4j-over-XXXX as the adapter, rather than the direct slf4j-log4j pipe, causing a runtime conflict. There are newer log systems (logback, log4j2, etc) that are faster and more efficient - slf4j-log4j12 is just good enough to ge by, but you can - and should! - do a lot better.

Scalable Real Time Query

2015-04-23T00:00:00+00:00

How do you manage a realtime queries and analytics over the same logic data?

Disclaimer

a lot of the ideas here come from a post by Confluent, repackaged over software that exists today
I haven’t actually built any of this - its merely a thought experiment to flesh out some ideas that have been circulating. Your milage may vary :)

TL;DR

Real time queries get answered by a combination of a stream processor watching for matching events and a single lookup in a row store. Older and larger queries (e.g. roll ups) are served via a column-oriented store, which takes longer but works really well for analytics. There is some copied data, but its aged-off the row store to keep the row store fast and save space.

The Setup

Say you want to query about logs that have the word “ERROR” in them from 2hrs ago, up to ‘now’. What you want to see is all the logs that exist at that time… but wouldn’t be nice to see any updates for new log lines that come in? Thing about when you search twitter or scroll the facebook wall - that little blue bar telling you there are new updates?

That is the real-time query problem.

Historically, this is managed by a single query against your DB of choice to populate the initial results. Then periodically, you re-run the same query on the DB and just look for anything that occurred after the previous query.

This ends up being very costly and scaling poorly as you increase the number of queries on a partition.

Digging In

Instead, imagine that the query can sit on the stream of incoming updates and only updated the queryer when there is a new document/record that matches the query[1]. Then you only need to do the historical lookup once and then register a listener any new updates that match your query.

Now, we can stick full-text search engine into a stream processor (say, Samza or Storm or Spark Streaming), but the question is how do we get the queries to the search nodes? The simple answer would be “put the queries into the stream as well!”. However, for things like Samza, there is an outstanding bug that doesn’t make this possible. To work across the stream processor of your choice, we can use Zookeeper as a realtime monitor of queries and a basic RPC/notification mechanism.

Ok, so now your stream processors as watching for new queries, each keyed to a query ID. Those queries get there from your endpoint of choice (lets go with, say, a web service!), which then:

listen on the bus for the messages keyed to the that ID
register the query in zookeeper

When the user no longer cares about the query - they log off, you deregister the query and stop listening.

Ok, so now you have a way of pushing out queries and getting updates from the log bus.

Historical Lookup

Keeping with the stream processing everything (or the Kappa Architecture) we would want to say that we just reprocess the historical data for documents that match our query and then stick them on the bus for the listening service.

The problem with this is you need to store all the events, but you don’t necessarily know which queries are going to match. If you don’t have more events than fit on a single machine, you are golden - just use a standard DB off you go. Maybe it even fits in memory across machines, then standard Spark can mange your query quite well.

Chances are, you probably dont, so queries are going to be expensive and you don’t want to keep rerunnig them (this was a premise above).

Things like HBase can help you cheat a bit on the time since you can filter by timestamp and it will only even read files that might have that time range. In the end though, you are still going to end up scanning a good bit of data.

This is where you need to start managing schemas. Backing with a NoSQL store means you can keep evolving schemas for the same type and still keep them all in the same table - you need a schema service to help you manage the different types[2]. When you have defined columns, you can then filter on just the ones that match your query and do very fast scans of a lot of data. These kinds of scans are still only good for looking a relatively small slice of the whole dataset (see you favorite query engine for the cross-over point for a single query engine vs. a bulk framework).

Age-off and Parallel Query

The only problem with a row-oriented store is that its going to get slower or cost more as you scale out, when you add more data. In the end, you will still end up having to either read through more data off disk or go through more bandwidth - its going to hit a break point and just take forever.

You manage this by aging off data. You know your query pattern (for more real time queries) is going to mostly hit recent data - the stuff in your row store that you can access quickly and sequentially. Events older than a certain time - 2 days, 2 weeks, whatever - are removed (many systems can do this naturally, like HBase/Accumulo’s TTL).

While you have been writing events from the bus into your row store (you have been reading off that same bus you are using to answer your real-time queries, right?), you also need to be writing into a column-oriented store. This gives you a way to keep data around after the age-off, but optimized for a slightly different access pattern (see Managing Analytics below).

Managing Analytics

This is where a column-oriented store makes perfect sense. Anayltics don’t care about being answered ‘right now’ (though faster is always better). Since we control who is sending the queries, we can direct those queries to the right source.

When you are updating rows in the column store, you are going to take a hit if you are frequently updating the same columns - you will be building up inreasingly large lists of source row IDs. Since we know we are managing immutable events, we take the standard tricks:

files are immutable and you just write events as you get them
leverage time range bounded partitions (e.g. one partition per week)

Updates now go away - when you get a new event, that state just gets written down (with the timestamp[3]) and you just use the new value for the columns you care about. Then when displaying the results you end up with nicely ordered historical changes.

So your analytics jobs (forked becuase the query asks about a large timerange or are doing rollups) just query the column store, leaving your realtime store alone; these kinds of queries on a row-oriented store would be very poor performing.

Age-off Spanning Queries

For queries that span the age-off time, we can serve immediately queries from the row store and, as the user scrolls through, serve the remaining updates from the colum store. This is nice as the column store will not be interacting with changing data at this point, but instead be already just ‘warm’ - there are no in-flight new rows, those are all buffered in later time partitions - letting us have much lower contention on the access.

Reconciling old updates

Common in some IoT architectures, like home power metering, you need to fix older events. In this case, you need to go and find the source file for the event and build up a ‘modifications’ file. This modification file is always also read in as you read in the data in the source file and is joined to the original data (modfications are assumed to be smallish) and reconciled to the new event.

It would be nice to rebuild the original file so we don’t need to keep around the modifications forever and to prevent modifications from growing too large. What’s nice, is this kind of data is built to be processed in a fairly large batch (by something like MapReduce or Giraph). Thus, as we are building the result set from the client, we can fork-off a ‘compaction’ (to steal from HBase parlance) to rebuild the file.

This is not a novel idea - Hive does compactions already in MapReduce - but is nice in that it becomes just a normal part of the computation. Like with HBase compactions you will see increasing latency during compactions…but there is no magic :-/.

Wrap Up

This kind of architecture is nice if you are looking to get rid of tools like Splunk that aren’t able to handle the kind of scale your company needs.

There are a good amount of prerequisites:

log bus (e.g kafka)
stream processor (samza, storm, spark streaming)
metadata service
scalable row store
scalable column store

Currently, there is no open source solution that I’ve found that can manage schema as a service that works on a heterogenous data set (update in the comments if you find something that works!).

Chances are you probably already have the last two in some form. If not, just rolling out a standard row store will probably be fast enough to answer your real-time queries and a column-store will let you answer the big queries.

What we are talking about here is how to manage all of them together without having to mess with extra bulk-processing for ingest and having blazingly fast and useful real time queries.

Notes

[1] you can make this even faster using the ideas discussed in the blog post by leveraging Luwak to only search documents that might be used by your query.

[2] There is some work in Hive coming that kind of does this, but its a ‘everything fits in Hive’ model, rather than external service that you can plug into your own framework/tools. The problem with ‘everything is Hive’ is that, well, everything isn’t Hive; the more you bend the model, the worse mechanical sympathy you get and then things get slower, and code becomes harder to read and reason about (basically, it all goes to shit).

[3] This is a bit harder to manage. Most column stores will just store the column value, so if you append the timestamp to every column value you barely have any matching values and a whole lot of bloat. What you need is tags per-cell (like cell-level tags in HBase/Accumulo) that are not used for comparison. Maybe this exists already, but if not it seems like a reasonable add to [your favorite column store here].

Ironman Cozumel 2014 Race Report

2014-12-06T00:00:00+00:00

Overall:

11:31:09
26th division
226th gender
275th overall

But clearly, that doesn’t tell the whole picture :).

We arrived on Cozumel 3 days before the race. Enough time on the first day to get in a short swim.

Nothing in the Bay Area can compare to the current and the waves on Cozumel. Rough does not even begin to describe the water. But we took heart - at least the current was in the right direction for the race.

The next few days were a flurry of short workouts (swims and bike rides), checking in equipment and picking up the last support crew members.

Unfortunately, it was not smooth sailing even going into the race. But, its not an Ironman if everything goes as planned - its what you do when it doesn’t.

Bike Preparations

Two days before the race, I picked up the bike from TriBike Transport (a company that will ship you bike down - fully built - to various races) something had happened with the frame/bottom bracket and in the heaviest 2 gears the pedal would clip the chain.

This means that, essentially, I was riding without my two hardest gears (the two smallest cogs in the back) - from a 22 speed bike (2x11) to a 20 speed (1x11, 1x9). Fortunately, Cozumel is completely flat and you don’t need gears with that much torque.

Certainly not the end of the world, but the First sign this wasn’t going to be smooth.

Race Build

My race build was provided by my coaches over at purplepatch(who are fantastic) and included a variety a swims and runs over the next few days. While they weren’t always the most fun (e.g. only hour long rides and mostly below race pace), they were inteded to ‘prime the pump’ and get the body ready for the race.

The build also gave me a chance to sample the course a bit - Cozumel is basically only one road, so all the taper rides were on the course. So come race day, I wasn’t surprised to hit the monstorous winds on the backside of the island, the very helpful swim current was welcomed heartily and the heat was…to be expected.

Race Day

0400

Brrrrring! Brrrring!

Its up early for Ironman athletes on race day - you need plenty of time to get your food in, drop off bags in transition and go the bathroom a good 10 times (nerves).

The only hiccup of the morning was with the Nuun tablets (electolyte tablets that dissolve in water) that I had left on my bike overnight. Left in a ziplock bag in my bento box, they had congealed together into a yellow, sticky mass - not exactly amiable to easy consumption. Fortunately, I had brought extra backups and a ziplock bag (always pays to be doubly prepared for ironman), so I could scramble and make the switch.

0701 - 2.4mi swim

Into the water and we are off! Not much time for warmup, so its a bit of a slow start in the water.

Even less help was the fact that that current was against us, going north, a rare, but not unheard of occurnace on the island. We were lucky in that about half way through the current changed directions 180 degrees and finally gave us poor swimmers some relief as we headed towards the swim exit.

1:03 swim - exactly what I was targeting. I could have gone faster, but was having weird troubles breathing out while drafting (seems to happen every couple races), so relegated to mostly swimming on my own and not always the straightest. But I’m not complaining.

0805 - Transition and 112 mi bike

I had a 4:06 transition - not exceptionally fast, not exceptionally slow, especially given how far we had to run the bikes out (far). In the showers after the swim exit I was hoping to catch Jan (my mom’s now husband, then fiance) during the swim, but this was good enough for me.

Then it was onto the bike.

The course is 1 loop of 34 miles followed by two 39 mile loops. Each loop is comprised of a neutral-wind section(maybe a bit of a tailwind) heading south, then a turn left at the southern point and into a very strong headwind back north, up the east side of the island (very beautiful, very desolate), followed by another left turn back towards downtown (and happily out of the wind).

At least, that is what it was supposed to be.

It ended up being that that headwinds started even before you reached the southern point, and were pretty much neutral for the entire stretch south. One could wax poetic about the furor of those winds, but I’ll leave it more simply - they sucked, they sucked hard.

For a while I debated which depth wheels to run. As you can see in the picture, I ended using a Zipp 404 (about 40mm deep) in the front and a HED Jet 9 (about 90mm deep) in the back. While not always the most comfortable with the cross-winds, I’m happy I went that deep. The general advice I follow is, “go as deep as you can handle”; you lose all the aero gains if you are constantly sitting up to keep from falling over.

Overal, I’m ‘eh’ about my bike. It took a bit to start to find the legs, but by the end of the first loop I was bang on my goal pace - 1:40 per loop. However, I was not meant to hold that. I had been training to hold 200 watts for the entire bike - in fact, I had done for the a 125mi training ride in Tahoe 2 months before. Alas, training rides are not races.

It was somewhere around mi 60 or 70 that my nutrition started to go off the rails. I just didn’t want to eat any more. I’d gotten down 3 bars (370 calories each), a couple of tubes of shot blocks, and all my salt tabs. By the last water stop I had finished my Nuun tablets, but was fine for most of the race so didn’t worry there. However, not wanting to eat meant I wasn’t getting nearly enough in for the last half…. something that will play a major role in the run.

There were some minor inner thigh cramps around mi 80, but it went away quickly. A little more hydration and keep rolling.

To cap it all off, I had 2 mechanical issues. (1) End of the second lap I slowed down to pee (trouble while pushing power) and the back wheel started rubbing the brake. After hopping off and messing with the wheel alignment for a bit - to no avail - I ended up just taking off one of the pads; no back brake anymore, but you don’t need one for this course. (2) Soon after the start of lap 3 the right shifter starts rattling. At first, not much of a concern… but then it got worse. 20mi later the shifter had fallen out of the bar and became very difficult to use - not an option for the backside where I needed to shift. Queue another stop, this time at an aid station where the mechanic took apart the shifter. I got back most of my gears - not perfect, but usable for the last 20mis.

I ended up with a 5:54 bike; not something I’m entirely happy with, but certainly not something to scoff at either.

1402 - Transition 2 and Run

The run is a simple out and back. You start right downtown, which is lined with tons of amazing folks cheering, and then head out of town into progressively thinning crowds until you make the turn ~4.5 miles out in a dquiet - though nicely shaded - part of the island. Then you turn around and come back into down. Then you turn around and do it again. Then, because you loved the view so much, you turn around and complete the loop a tthird time, when really, all you want to do is head down that bright, beckoning finisher chute.

T2 took all of 2:59 - could be better, but I’m happy. Should have spent an extra minute and gotten some heavy duty sunscreen… yup, definitely should have.

So I was off to the run. Already, the pros were well into their run, but at this point I was just looking for a solid performance. Alas, things are never as smaooth as one hopes, and certainly never in Ironman.

First 5 mi, hitting sub-8s and feeling good. In the second half of the lap start walking the aid stations, but the first lap was exactly where I wanted to be - 8min/mile. I was eating fine and though I dropped my salt tabs around mi 6, I had some spares in my special needs bag; I just had to make it back there.

However, my fingers had started to tingle. Didn’t affect anything right way, so I didn’t worry about it too much (though I did ask the aid station for advice and they said more water… a fateful comment, as I came to find out).

It was great being able to see my SF Tri peeps (Liz Abbet, Chris Segler) and Jan - great energy boosters and something to look forward to in what becomes a very repetitive run.

Shortly into the second lap I started to not feel so great. Suddenly, it was walk every aid station and then run between, rather than the every-other I was doing.

Queue the nausea. Now its a matter of how fast can I go without throwing up. 8:30/mi. 9:00/mi. Around the 12mi mark, I completely lost my appetite. 10:00/mi. Then 15:30/mi (speed walking) - as fast as I could go without hurling. But at this point, I knew that even if I walked that last damn 5 miles, I would make it under the cutoff.

Fortunately, this means my ankle didn’t take too much damage on the run. I felt some twinges around mile 16, but that was shortly followed (2mi later) by mostly walking, so there is some positive to this!!

I ended up with a 4:27 run - better than my worst marathon, but not nearly where I wanted to be (close to 3:30).

For the last two miles I decided I could go deeper and ran them. Ended up doing ~8min/mi, but knew knew knew something was wrong.

1832 - Finish!

Finally, I was across the line…and right into the medical tent.

After a cup-o-soup and sitting for over an hr, made my way out to the family, but after getting back to the hotel, some vomiting and not being able to get anything more down, it was off the the hospital.

I had hyponatremia - turns out that much clear urine is not good.

It took a good 18hrs of various drips and bed rest, but got released on Tuesday and was feeling much better at that point (at least good enough for a greasy quesadilla!). A few days later and I was walking around without pain and just excited to come back the SF.

Retrospective

Overall, I’m happy with my experience. You can’t control everything, so there were some good learnings from the bike. Also, definitely more swim bricks to get the legs straightened out there! I’m thinking about switching to calories in the bottles, just to get enough down (was maxed at 450/hr), but something to play with.

But next year, no Ironman. A few halves and focusing on getting faster, better swim technique and bike power. Training has just been too much and this race was just so damn hard.

But we’ll see about in two years :)

Ironman Cozumel 2014

2014-11-17T00:00:00+00:00

On November 30th I’ll be racing Ironman Cozumel.

You can follow the live coverage ——–> here <——–

I’m number 418.

I’ll be updating as I can - not sure about internet down in Mexico.

Yes. Ironman.

That’s a 2.4 mile swim, followed by a 112 mile bike ride and then a 26.2 mile run (or as its more commonly known, a marathon).

And lets not forget that Cozumel at this time of year is a balmy 85F with 100% humidity and headwinds that will knock you around like a toy.

So it’s looking like a fun day out :)

And it hasn’t been an easy journey - here’s the numbers from the last 8 months of training:

450+ hrs Total
120hrs of swimming - 160,000 yds
240hrs of cycling - 3600 mi
90hrs or running - 575 mi

And that’s just in training hours. It doesn’t take into account time foam rolling, sports massage and doing pyhsical therapy.

I’ve been incredibly lucky to have supportive people in my life that enable me to do it - my loving girlfriend Megan for putting up with everything, my friends for still wanting to hang out (even if its just once a month and all I talk about is training) and my tri club for helping feed the addiction.

HBase Consistent Secondary Indexing

2013-06-11T00:00:00+00:00

When using a database you don’t always want to read the data in the same way each time - it may match sense to query it on “orthogonal properties” For instance, if you have a database of food in your grocery store, you might have your database sort them by name, but some days might want to query the database for all the items that came in on Monday. The brute force approach would be to scan through all the items in the database to find just those that came in Monday - clearly not a very scalable solution!

Secondary indexes allow you a ‘secondary’ mechanism by which to read the database; they store the data in an index which is optimized to be read for an orthogonal facet of your data (for instance, date of arrival).

The Oracle documents on BerkleyDB define secondary indexing as:

A secondary index, put simply, is a way to efficiently access records in a database (the primary) by means of some piece of information other than the usual (primary) key.

Traditional, single (or a small cluster) server databases achieve secondary indexes by updating a ‘index table’ which stores the store in the query-organized layout in a transaction with the update to the primary table. This works fine because there is very little overhead - no need to go across the network or rely on complex coordination. Everything is nicely ACID and works with the existing model.

HBase doesn’t play as nice.

HBase Problems

HBase is built to scale by sharding the data between different ‘regions’ that could live anywhere on the cluster. Each region is (almost) entirely independent from every other region in the cluster - this allows us to scale up the number of regions as our data size grows and not worry about performance.

The problem with secondary indexing then is that we are then attempting to add this cross-region interaction on a system whose very basis is to not have cross-region dependencies! At first blush, this is the very definition of an impedance mismatch.

Old News - An Overview of Existing Options

People have tried many times in the past to implement secondary indexing over HBase - things like Lily and HBaseSI attempt to tackle the problem head on.

Lily builds its own Write-Ahead Log (WAL) framework on HBase - this gives us most of the expected semantics but at a rather high latency cost. For some use cases, this is fine - if this is you, you can stop reading and go call up the Lily folks.

HBaseSI is an alternative approach but doesn’t work well with Scans - its designed for point Gets. However, the general use-case for HBase is multi-row scans, this doesn’t translate into a general solution.

People have also attempted to do full transactions on HBase, things like Omid and Percolator - once you have full transactions between tables, adding secondary indexes are trivial, they are just another transaction. The downfall here, as expected, is the overhead. In a distributed system, you end up creating massive bottlenecks that dramatically reduce the throughput (and increase the latency) of the entire system. For most people, this has proven too much overhead.

Then people have attempted to do secondary indexing through the application. While this could very well work, it is rarely going to be generally applicable and further, is going to be very brittle. Secondary indexing is properly a function of the database and should be closely tied to its internals to support efficient and correct implementations. In particular, dealing with failure scenarios to guarantee correctness outside of the database layer is often a losing proposition.

Recently, a some work has come up to provide in-region indexing. Essentially, we provide a secondary index on a given Region. Then when querying along the index, we need to talk to each reqion’s index to determine if that region contains the row. The obvious downside is a dramatic effects on throughput on latency. Where previously we only had to talk to one server, suddenly we have to talk to all the servers and cannot continue until we get a response back from all of them (otherwise, we might miss a positive response). If you are willing to take this latency hit, it can be an acceptable solution - its fully ACID within the HBase semantics.

There has already been a lot of published thought on other ways we could do secondary indexing - I took a crack here and my colleague, Lars Hofhansl has written some thoughts here and here. However, all these proposals are either (1) wrong in small corner cases or (2) inefficient. We can do better…

Redefine the problem

What if we don’t need to support full ACID semantics? I mean, HBase doesn’t provide them, so why should our indexing solution need to provide stronger guarantees than HBase?

Hmmm, okay, maybe this could take us somewhere…. Let’s look at what we do need to provide.

Durability

First, we certainly need to provide durability (aciD).

Lily does this by providing its own WAL implementation. A simpler version (remember, we don’t need ACI of ACID) would write to a WAL table then then replay the WAL when doing updates. However, this ends up requiring at least a 4x write of the data (once to the WAL table, which writes to an HBase WAL, and then to each of the involved tables). And keep in mind that you also need to read the WAL each time you are doing a read and then merge those changes back into the Results on the client. This is going to get rough really fast.

Well, wait a second - HBase already has a WAL! Maybe we can tap into that…

By tacking on custom KeyValues (lets call them, oh, I don’t know, IndexedKeyValues) to the WALEdit we can serialize our index updates to the WAL using the usual Writable#write method. In fact, this means we don’t even need to have a backing byte array in our IndexedKeyValue! In HBase 0.96/0.98 it’s a little bit different, but conceptually the same.

The only problem then is making sure that we can read these edits back again. In HBase 0.94.9 (the next release), we can provide a custom WALEditCodec which manages the reading/writing of KeyValues in the WALEdit to/from the WAL - this is by far the cleaner mechanism and exactly how we would support indexing on 0.96/0.98 (we don’t yet, but it’s a minor port). In <0.94.9, we need to provide a custom HLogReader - an IndexedHLogReader - that can figure out the type of the serialized KeyValue, either an IndexedKeyValue or a regular KeyValue.

Great! Now we have durability of our index update AND a way to read it back.

Getting back to ACID

Now, what kind of guarantees can we provide? So far we only have the “D” in “ACID”. We were able to make some big strides by thinking about how we can leverage HBase, lets see if we can do that again.

Previously, we always expected the client to define all the index updates to make at write time. It was always smart enough to break out the update into the required updates to the index table and then just send all of those to the database. The database here just needs to provide the base intelligence to apply the updates.

What if, instead, we push down the work to the server? It would be the same amount of data transfer. Originally, it was once to the primary table and then once to each index table. If we push down to the server its a primary update to the primary region and then from there out to each of the index tables. There is a bit of an throughput concern here (we have to serialize the process a bit, rather than making the updates in parallel), but its relatively minimal… and we’ll talk about how we could alleviate this later.

Since the region - or rather a RegionObserver Coprocessor - builds and writes the index update it should be able to manage the consistency (aCiD) of the updates. Remember that HBase doesn’t make any serializability guarantees between clients (see my previous blog post about managing this with external time) - all we need to guarantee is that the index updates eventually make it.

Therefore, lets tie the primary and the index updates together. When we get a write, the coprocessor builds up IndexedKeyValues that contain the index update information and we attach them to the WALEdit for the primary table Mutation. Once this gets written to the WAL its expected to be durable - we can then attempt to send the index updates to the index tables.

Facing Failure

If any of the index updates fails, we need to ensure that it gets reattempted. The simplest way to do this is to kill the server, which will trigger the standard WAL replay mechanisms. By hooking into this replay mechanism, we can pull out our index updates and replay them to the index table, which has hopefully recovered by this time. Killing the server has a dual benefit of being hard to miss - if the index table is incorrectly configured (i.e. it doesn’t exist), your cluster will quickly shut down, altering you to the problem. This gets us atomicity and isolation in the HBase world - updates to the index will always occur, but are not guaranteed to be performed at exactly the same time or order with other updates.

That is a bit of an extreme failure scenario, but follows a ‘fail fast, fail hard’ paradigm - not always robust, but ensures correctness. There are other, potential mechanisms to handle missed index updates, for instance, marking an index as invalid and rebuilding later. However, this is a bit more complex to handle and outside the scope of this ‘bare bones’ indexing solution.

HBase ACID

If you are using HBase, there are some things you give up - cross-row guarantees. However, once you can see the data, it’s durable. By leveraging the WAL replay mechanisms in concert with careful management of the WAL (ensuring the correct edits get replayed) we get the same ACID guarantees with our index updates that HBase makes of our primary row updates.

See the HBase Reference Guide if you want a more in depth treatment of the what ACID guarantees HBase makes.

Not just fluff

The above discussion is not just a theoretical investigation on how one might implement secondary indexing - this is actually what we have done. Initially, hbase-indexis being released as a subproject under Phoenix, but there are discussions around moving this into the HBase core.

hbase-index is designed to be a transparent layer between the client and the rest of HBase - nothing is tying it to Phoenx and can be used entirely independently. However, Phoenix support for hbase-index is currently in progress at Salesforce (see the github issue where James lays out the internals) and will be completely transparent to Phoenix clients. If you don’t want to use Phoenix, you can easily create your own IndexBuilder to create the index updates that need to be made.

Constraints

There are several constraints of the current implementation. None are insurmountable, but merely artifacts of a new project.

First, we only support Put and Delete mutations. This is because they provide sufficient hooks into the WAL for a RegionObserver. There is no theoretical reason we can’t support other types for HBase, but rather the practical matter of getting the support into HBase.

That brings us into the realm of HBase versioning - we only support WAL Compression on HBase >= 0.94.9 (soon to be released). As of 0.94.9, we can plug in our custom WALEditCodec which manages the compression logic. We don’t support HBase 0.94.[5-8] as there are several minor bugs that prevent the hbase-index from functioning. Indexing is supported in HBase 0.94.[0…4], but without WAL Compression. Currently, only the 0.94 series is supported as we are initially targeting Phoenix adoption, but moving to 0.96/0.98 should be a trivial matter.

Right now, we don’t support the built-in HBase replication. The major problem here is that the replication mechanism are not pluggable, making it impossible to use the same Custom KeyValue mechanisms that we previously employed. This is not an insurmountable change (and plays well with much of the current work on the HBase internals already being done in the community), but merely one that takes time.

Also, as mentioned above, we end up putting a lot of load on each Region - it has to build up all the index updates and write them to the other tables, all while doing all its usual work. This is could slow us down a little bit. Alternatively, we could use the same WALEdit/IndexedKeyValue mechanism but just provide a locking mechanism on the WAL. The client is required to make the index updates after making the indexed Mutation to the primary table - all the server does is ensure that we don’t roll the WAL until the index updates have been made. While this sounds great, it introduces a lot more complexity around when to trigger failures and managing client writes concurrently with the WAL.

Conclusion

There have been a lot of discussion and work around secondary indexes over the last few years. Everyone wants it, but no one is willing to give up certain things (speed, traditional ACID, HBase features) to get secondary indexing. We aren’t proposing that this solution is a one-size fits all; if you need full consistency between indexes and the primary table, then this won’t be enough. However, if you are already using HBase and willing to continue those semantics, hbase-index provides an easy framework to build your own indicies.

By leveraging a RegionObserver that creates custom KeyValues we can be sure all updates are stored into the WAL, giving us the expected durability. This coprocessor then also makes the index updates and fails the server if we cannot make them, triggering a WAL replay and another attempt to update the index. While a bit drastic, these ‘fail hard’ semantics make it difficult to avoid seeing an error - quickly alerting when your index table is misconfigured.

This isn’t a vaporware, the code is already out there [on github] and support is coming to Phoenix. Think this stuff is cool? Then we would love to have you comment on the project or even write some code!

Upcoming

Now the careful reader will have noticed one glaring omission in this blog post - how do you actual maintain the index in a way that makes sense? We mention an IndexBuilder, but not how you would use it. hbase-index comes with a very simple implementation of an IndexBuilder - merely how one would create and publish updaets to the index. However, this example doesn’t cover how it would translate to scanning; in fact, it translates very poorly - there is no index cleanup and makes it very difficult to reason about at scan time.

This is not to say its not possible to create a fully-covered index using the IndexBuilder model. However, it starts to get somewhat complex (a future blog post) - you have do deal with data table lookups and managing which index elements can be be deleted at which timestamps.

Guest Blogging

2012-12-01T00:00:00+00:00

I was recently asked to write a few guest blog posts about HBase. I’d had some ideas bouncing around for a while (and a little personal brand expansion is never a bad thing), so I started working on it. Here’s some of my thoughts from the experience.

The posts that are I wrote were all drawn from my recent experiences working with and on HBase:

HBase Replication - Promise and Peril
- A discussion on how cross-datacenter replication works in HBase and some of the more interesting things you can do beyond the obvious disaster recovery; its not all sunshine and roses though, as a I talk about in this post.
HBase .META. Layout
- How does HBase organize .META.? A definitive description (as of HBase v0.96) and some info about how splitting works, why we get occasional ‘holes’ in .META. and some hints on how to fix it.
Modularizing HBase: Lessons in Maven’s Black Magic
- A while back, I spent a bunch of time modularizing HBase (HBASE-4336) - taking it from a single monolithic layer to a set of smaller building blocks. Along the way, I learned some useful lessons in dealing with Maven that anyone thinking about modularizing a big project will probably encounter.
HBase File Retention for Backup and Testing
- Recently HBase (0.94 and higher) gained the ability to start retaining HFiles when they are deleted. Combined with the existing archival of HLogs we have the building block of a comprehnsive backup and testing solution for you HBase cluster.

When I was asked to write the posts, I had about a 1.5 weeks - a pretty daunting timeline for four 1000+ word posts that I’d only half formed ideas about. I had another idea that didn’t get written about integrating HBase with legacy applications, but thought four posts more than enough work my deadline (maybe something I post on here?). In the end, I’m glad I put a limit.

Generally, my blogging process is fairly organic - I jot down a couple notes over a few weeks about things I might want to write about, then sit down at a cafe for an afternoon and hammer out the post, and then do a couple review passes before posting (by the way, I use git locally for all my documents + jekyll hosted on github - its a great, free way to host a blog and iterate on posts). This is the method I’m using for this post.

With my deadline, my usual meandering pace wouldn’t cut it. Luckily, I’d been a bit lax about writing, so I have a couple general ideas already written down. Two weeks ahead already

Once I narrowed down the topics I wanted to work on, I spent two afternoons writing outlines. Turns out the skills I learned in college aren’t all that rusty. This was great for two reasons: (1) organizing my thoughts into a coherent story and (2) made the writing only require the specific prose, not the ideas as well.

In software terms, the outline became a bit of an abstraction layer - suddenly the amount things I needed to keep in my head was halved; the ideas were already there, I just needed to make them sound good. Outlines inherently also use a markdown style syntax, so translation to the final document was even easier as I write in markdown (definitely worth digging into the positives of a leaky abstraction - both in terms of mental model and efficiency - but that’s for another post).

Once I had the deadlines, it was just a matter of another two afternoons in the coffee shop to write up the posts. Probably the worst part of that whole process was making the images (aren’t my words clear enough??) and converting the formatting to the publisher’s desired Word format (as you can expect, not a big fan).

Follow up with a single editing session over all the posts and I was good to go! I would never have written with the same volume in the timeframe without the deadline. Instead, it would have been spread over a few months, with ‘recovery time’ between posts. There is a certain dark enjoyment though out of burning hot - getting things done is always appealing, even at the cost of a few hours of sleep.

Its hard to say if this was any faster than my usual process. The rigor lead to a more… boring experience, but to overall higher quality posts. I have a tendency to ramble (noticed?) using the organic method (unless its writing about solving a technical problem - see fixing java GC logging) and be a bit more long winded than necessary. This blog is certainly not my masterpiece and while I like producing a higher quality work product, my time is certainly limited.

“Je n’ai fait celle-ci plus longue parceque je n’ai pas eu le loisir de la faire plus courte. (I have made this letter longer than usual, because I lack the time to make it short.)”

Blaise Pascal, Lettres Provinciales (1656-1657), no. 16.

But that’s how things go in this era of more, faster, now (if not yesterday); c’est la vie.

What kind of writing techniques do you use? Do you find it more fun to write in a more structured or unstructed environment.

Rolling Java GC Logs

2012-11-05T00:00:00+00:00

If you are running a java process, you probably want to keep track of what the garbage collector is doing. You can access this via jconsole or by logging the gc actions by adding:

-Xloggc:gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps

which logs to the ‘gc.log’ file.

And for simple cases, that will probably work just fine…until your process starts running row more than a few days. The GC log is not rolled automatically, potentially resulting in a log that can easily grow out of control and fill up your filesystem.

Bad news bears!

What you really want to do roll the logs periodically. You could do this manually with a cron job (which means you might missing some elements), or every time you restart the process (but if you don’t restart often, you’re up a creek) or send the log to your own custom logger (which is can be tricky to get right).

All pretty ugly solutions. I sure wish we had something better…

As of Oracle Java 1.6_34 (or 1.7_2 in the latest minor version), we do! GC logs can be automatically rolled at a certain size and retain only a certain number of logs.

To turn on simple log rolling, you only need to add (in addition neccessay gc log arguments mentioned above) to your java command line options:

-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=<number of files> -XX:GCLogFileSize=<size>

where <number of files> is just an integer and <size> is the size of the file (e.g 16K is 16 kilobytes, 128M is 128 megabytes, etc.). Rolled files are appened with .<number>, where earlier numbered files are the older files.

Suppose you ran an java program with the parameters:

$ java -Xloggc:gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=128K

you might see something like the following show up in your directory:

-rw-r--r--   1 jyates  staff    90K Nov  5 18:25:39 2012 gc.log.0
-rw-r--r--   1 jyates  staff   128K Nov  5 18:25:25 2012 gc.log.1
-rw-r--r--   1 jyates  staff   128K Nov  5 18:25:29 2012 gc.log.2
-rw-r--r--   1 jyates  staff   128K Nov  5 18:25:33 2012 gc.log.3
-rw-r--r--   1 jyates  staff   128K Nov  5 18:25:36 2012 gc.log.4

What’s really nice note here is that GC logs beyond the specified number are automatically deleted, ensuring that you know exactly (+/- a few kilobytes for the occasional heavy load) how many log files you will have.

Pretty cool!

Unfortunately, if you attempt to turn on log rolling and forget to include the number of files or the size, the jvm will not turn on logging and instead tell you:

To enable GC log rotation, use -Xloggc:<filename> -XX:+UseGCLogRotaion -XX:NumberOfGCLogFiles=<num_of_files> -XX:GCLogFileSize=<num_of_size>
where num_of_file > 0 and num_of_size > 0
GC log rotation is turned off

this is wrong!

Double check your other parameters, and try again; you definitely want to use -XX:+UseGCLogFileRotation.

Hopefully this helps you setup your own log rolling. If you have any other JVM/GC tricks, I’d love to hear about them.

Notes:

This is actually a best effort rolling process. If you are doing a lot GC work (e.g. leaning on the ‘garbage collect’ button in jconsole), the log may grow larger. However, as soon as the jvm has a chance it will then roll the log.

Consistent Enough Secondary Indexing

2012-07-09T00:00:00+00:00

In databases, data is organized into tables, sorted by the ‘primary key’ of each data row. The primary key is generally either a globally unique id (GUID) or so other uniquely identifying information for that row.

For instance, in a database of people, you could use social security numbers (SSNs) as the primary key each person-row. Then to find a person by SSN, you can then do O(lg(n)) lookups and find that row (assuming the SQL database of your choice implicitly creates an index on the primary key - otherwise, this is also a full table scan since SQL-esque database usually don’t store data in sorted order, though indexes are stored in B-Trees). However, suppose you wanted to lookup the person by their address - maybe you want to find all the people living at ‘123 Jump Street’. With the current table setup, you would have to scan the entire person table, looking at each record to see if that person lives at ‘123 Jump Street’ - potentially huge, time consuming query.

The idea behind secondary indexes is that we ‘index’ the address field of all the people in our database into another table. The primary key of the secondary index table is just addresses and then it stores all the primary keys (social security numbers) of people living at that address. We are trading space for speed and in very large queries, this trade-off is entirely acceptable. This means that it becomes very fast lg(n) to find the indexed row and then lg(n) again to do the lookup in the primary table.

In our people database example, to find all the people living at ‘123 Jump Street’ you can the just jump right to that row in the index table, and get the all the primary keys in the people table that have that address. Then to find all the people living at ‘123 Jump Street’ you can the just jump right to those keys, giving you the information for all the people directly from the ‘people’ table. This is vastly more time efficient that doing a scan of the entire people table, looking at each record to see if they live at ‘123 Jump Street’.

Creating secondary indexes in an RDBMS is a very natural fit - every time you update a row (either a new row or updating an old one), it is just turned into a transaction where you are doing an update to the index every time you do an update to a row that is indexed. For example, if a person moves, we update that information in the ‘people’ table and at the same time update address index table. Since this is a full transaction, if the client or server fails halfway through the transaction, we will never have an inconsistent state, so reading the location row always gives us the latest information.

The problem with this approach for a distributed system is that transactions are hard when spread across machines. To be completely safe they require a Paxos-like protocol to complete, which can be very costly time-wise. You can play some tricks with optimistic locking to get good overall performance, but its still very hard to ensure that even 90th percentile times that are anywhere close to the average.

In the past I’ve worked on some code to do secondary indexing (Culvert) for a BigTable like system, but the indexing was of secondary importance (no pun intended), to the rest of the work of doing SQL over BigTable like systems. Culvert is almost there in terms of overall correctness, but trades-off full consistency for speed and ease. We decided that a dirty index (false positives) and fast writes were more important than having a fully consistent view of the index.

There are ways to do the latter, and the guys over at [Lily] (http://www.lilyproject.org/lily/about/playground/hbaseindexes.html) have done a terrific amount of work to make it possible. However, Lily has a lot of moving parts (secondary servers, a full write ahead log, etc.) that make it inherently hard to use, fragile and slow - somewhere on the order of 100’s of writes per-second in HBase, which natively can do millions of writes/sec. Don’t get me wrong - they are great guys and are doing great work, but we can do better.

There are a couple of pieces that we need to put together to enable consistent secondary indexes. The first key realization is that we don’t need to be transactional - indexing can be an idempotent operation if we bind the writes to a timestamp (the second part).

Let that sink in for a moment… it means we can essentially ‘cheat’ (traditional) indexing and still always be right (enough). Its known that secondary indexing in a distributed environment is inherently easier that full transactions, but its rarely articulated why its easier. Idempotence allows us to retry without concern about currently running operations or worry about previous effects. This is huge. A game changer.

We then have two major concerns - making sure we never get a ‘wrong’ answer from the index (even if the client or server crashes) and making an indexed writes fast. If we didn’t care about the latter, we could do two-phase commit with a WAL and get correctness (if that’s good enough, just use Lily), but we are going to take a significant latency hit due to the high number of writes and further - depending on the workload - could be highly contentious with respect to locking (for multi-row writes).

Instead, we split the effort by making the client a little bit smarter and adding some more manipulation on the server-side.

Example and proof by hand-waving

Let’s walk through the implementation via an example.

Architectually, the first piece we would like to have is a globally unique id generator, allowing each client to apply the same write number to each batch of writes. Something like this is discussed in [Percolator] (http://research.google.com/pubs/pub36726.html), and its fairly trivial to implement something similar over HBase using the increment operation (I’ve recently done it and hope to open source it in the near future - check back on my [github] (http://github.com/jyates) page).

Going back to the people example, say we want to add a new person to our database with an index on the address. This is just a simple inverted index example, but you could do smarter things like pre-joins, etc. in the index with the nearly exactly same methodology.

The client then has to do a couple of things. First, it gets a write number from our global generator to apply to this batch of writes - this allows us to reason between the index state and the primary table state cleanly, but is still possible if you set write timestamps from the client.

Once it has a write number, the client first writes to the address index table - the address row gets updated with the person’s SSN. It looks something like:

    Address     |      SSN     |  timestamp
123 Jump Street |  111-22-3333 |      2
                |  333-44-5555 |     14

supposing that the someone with SSN=111-22-3333 already lives at 123 Jump Street and we added person SSN=333-44-5555. After successfully writing to the index table, the client then writes the same person to the primary table:

   SSN      |     Address     | timestamp
333-44-5555 | 123 Jump Street |    14

Then when another client attempts to lookup who lives at 123 Jump Street and will see two SSNs, the one that was already there and the one that we added. Secondary index built and working? Check.

Failure Situations

While this indexing scheme works when no component fails, things get interesting when we deal with failure situations, which are very likely on the commodity hardware that runs the largest clusters in the world.

Suppose the client fails before either write hits the wire - no problems, we still have a consistent system (okay, that was a gimme). Same story for if the client if both the writes succeed - another easy one.

Note that we can never get into a situation where there is a write to the client table, but not to the index table because we always guarantee that the index table succeeds before writing to the primary table. You can do this by having the client write to the index table, waiting for success and then writing the primary table or via an indexing coprocessor that writes the indexes to all the tables before it writes to the primary table - these are just implementation details (for the record the latter would be faster and learn from a client perspective, but a bit more difficult to implement). Either way, you will never have false negatives in the index table, you will only get false positives in the worst - partial failure - case.

False positives occur when the first write - to the index table - succeeds, but the write to the primary table fails. This puts us in a little bit of an odd situation because the index table says a row should exist, but the primary table doesn’t have that edit. We know the edit belongs to that index because of two key points:

both writes have the same timestamp
each timestamp is unique (via the timestamp generator), so it must have come from a partially-successful write

both of which allows us to ignore the failed edit. Basically any row in the primary table with a key who’s latest update doesn’t match the expected timestamp (allowing us to keep multiple versions of the table back), can be considered a broken link and lazily cleaned up. Note that we don’t actually need to use the latest timestamp, but rather only a timestamp matching or greater than the index timestamp can be considered valid - proof is left as an exercise to the reader :)

Reading the index and lazy cleanup

Reading becomes a bit more complicated when dealing with failures. If there are no failures, each write to the index will create a row (specifically, a row key, column family, column qualifier, timestamp, value tuple) that corresponds to the written index row. When we go to retrieve the primary row (the set of all key-values for that row), we have to consider which timestamp versions of each key-value to take since a Delete may have deleted specific versions of the key state between when the index was written and when we retrieve the key.

Note that we will never find the row from the index if the row (or the column that was indexed) was deleted since we always update the index before the primary table. This means that at least the primary key and the information for the indexed value will be correct when we do a read.

Well, there is a case when it won’t be exactly correct - we could actually have written the index and then failed to write the client. This can also happen if we are just slow to see the primary table write while reading the index table. In this case, we could proactively update the primary table with the correct information for that indexed value and timestamp. However, easily leads to corrupted data in the primary table because we only know the indexed value, not the fully primary key. The right thing to do then is to assume that the client write failed and mark that index for deletion (we don’t delete it right away to avoid slow primary table race conditions - we’ll get back to this point).

A read then will can always return the latest state of the row in the primary table, if it has the indexed column. This is obvious because if we find the row in the index and that indexed value is found, the row is properly indexed and we want to get the current state of the entire row. We still need to consider the rest of the information associated with primary key we found from the index. The safest way to retrieve the row from the primary table is to return only those columns with a timestamp equal or greater than the indexed timestamp (for columns not equal to the column we used in the index - if that column is newer, then our index is out of date and we can clean that up on the way out). Let’s go back to the person example again, and see if this actually works out.

Suppose, our theoretical person with SSN=333-44-5555 is also named John Smith and has a cat. His entry in our primary table would look something the following, if we put all that his information at the same time:

   SSN      |     Address     |   Name     |    Pets   |  timestamp |
333-44-5555 | 123 Jump Street | John Smith |    cat    |     14     |

Doing a lookup into the primary table then works as expected. However, consider the case where we knew John’s name and that he had a cat, but not his address at timestamp 7, but then later, at timestamp 14, learned his address. His row in our primary table then looks look something like this:

   SSN      |     Address     |  --------------------  |  timestamp |
333-44-5555 | 123 Jump Street |  --------------------  |     14     |
                              |   Name     |    Pets   |
                              | John Smith |    cat    |     7      |

If we just return the columns with timestamps greater than the edit we wrote, then we will actually miss most of the information with John. The same applies to returning columns that have timestamps greater than the one we indexed. The only time we shouldn’t return the latest values is if the column we indexed has a timestamp that doesn’t match the index in the index.

Its interesting to note that if you update a key (say, the person moved again), you just need to do a single write for each field you are indexing and write to the primary table with the same timestamp. We can let the lazy cleanup take care of removing the failed entry when its found. This is a slight optimization for a write-heavy system, but could prove incredibly valuable in the long term (but could be rough in a read-centric system, in which case you would want to cleanup the index when doing a write).

The careful reader might now ask, “What about partial writes that haven’t completed yet? Won’t we end up cleaning those too?” If we follow the above methodology, then yes, we will likely end up with false negatives in our index - one of the things we are attempting to avoid to make this indexing scheme all make sense.

However, what one can do is just ensure that if you see a potentially failed write that you don’t delete it immediately, but rather add it to the queue of elements to delete, sorted by timestamp. No edit can be deleted before you reach a timeout for a successful write. There are a couple ways to cleanup the index in the background.

The easiest way is to just periodically (e.g. daily) run a MapReduce job that compares the state of the primary table to the index table and removes index entries that you are should should have been committed. You can be sure they are committed by, for example, only cleaning up broken indexes with timestamps older than a day. If you are using a timestamp generator, it will probably push batches of timestamps, where each batch has a TTL, which you can then use to check cross-table consistency.
Using a daily job will likely cause a lot of overhead where you are looking up data that you really shouldn’t and can get around via doing cleanup when you find the broken link. This costs you an extra round trip with a delete for that index cell, but this is a single operation that can be done asynchronously to the primary client and prevents other clients from finding that broken link in the index. There are further two options for doing this on the fly, depending on your implementation:
- client-side - in this implementation, the client gets the primary keys after reading the index table, then looks up each key in the primary table. The results are then filtered at the client (as mentioned above) to ensure the end-user only sees the correct key-values. Any borked index entries can then be cleaned up in a background request from the client.
- server-side - when do an indexed lookup, the index table actually goes to the primary table to retrieve the results and then validates those against the index before pushing those values back to the client. This has an network hop for the primary table (which has a bulk of the information), but saves the return of the primary key list to the client (a relatively small overhead).

Note that any of these read and cleanup methods can be used with indexing of writes through either the client or server and can lead to some really nice implementations, that are more or less useful, depending on your environment. Imagine a 4x4 grid with reads on one axis, writes on the other and client/server as the ‘unit’ of the axis.

          | client   |   server  |  <--- Reads 
   client |    1     |     2     |
   server |    3     |     4     |
 Writes

Lets break down what you get in each quadrant:

1 - the client writes the index and then writes to the primary table. On reads, the client first gets the primary keys from the index, then gets the rows from the primary table and filters out incorrect rows. These incorrect writes are asynchronously removed from the index table.
2 - same writes as (1), but you only query the index server and let the server take care of retrieving the primary keys, filtering them, passing them back to the client and then locally updates the index links.
3 - The client just writes the primary keys to the index tables, and one table is picked as the ‘leader’ table and writes all the data to the primary table. In this case, we can actually mark all the indexes as completed or not on the leader table. On reads, the client still asynchronously updates the index table used in the read.
4 - Writes are the same as (3), but then reads are done through the index table, as in (2), with all the pruning and updates that implies.

There isn’t an immediately apparent answer for which quadrant is the right one to go with every time. In fact, it is likely to depend on your cluster; specifically, you will need to consider (1) the bandwidth between the clients and the servers, (2) bandwidth between servers, and (3) the flakiness of clients. Depending on the BigTable implementation you can you, there are some more tricks you can play with things like filters, Iterators and/or Coprocessors for optimizing where the filtering, updates, etc. happen to get even better performance.

Performance - more credible hand-waving

If this was a single-system database, then we would update the index table, then go over and write to the primary table. Each lookup is about O(lg(n)), where n is the number of keys, giving us about 2lg(n) time to write a row with a single index (adding 1x for each new index). Lookup times are again 2lg(n) based on any of the indexed fields for that key.

From the quadrants above, lets look at quadrant 1 - the client writes first to the index table (assuming a single index, but its trivial to multiply for other indexes) and then writes to the primary table; on reads we query the index to get the primary keys, the data from the primary table, filter incorrect results and then asynchronously push them to the index table.

In the proposed implementation, writes are going to be O(lg(m)), where m = number of keys in memory on each server (for HBase, its actually the number of keys in each region), to the index table, and the same again to the primary table, giving us 2lg(m). Since m « n, but we need to have 2 network round-trips (one for the index, one for the primary table), writes are probably about as fast as the single server system - ignore the man behind the curtain. Keep in mind that we don’t need to update old indexes when we change a key - we can let the lazy cleanup handle that, amortizing those network hops quite nicely and speeding up a write-heavy workload.

We are going to be hurt a little bit on reads. We have the same lookup time comparable to a for the single system: lg(n) for the index + lg(n) for the primary table + primary table round trip (so far the same) + round trip for the index. This last round trip is a fairly small value given that we are just pulling across primary keys, not the entire primary row.

Since we may have a dirty index, we might have to pull data across the wire that we don’t actually want to read. Its conceivable to use a coprocessor here that only return only key-values from the matching row that are correct, saving you that data over the wire, but not the lookup and round trip. Even without use of this optimization, I submit client failures causing large amounts of partially-written indexes is very unlikely - at worst you might get a handful of incomplete indexes. This can be calculated as a probability of failure * number of concurrent writes per client * average write size to give you the worst case expected overhead when do a single read.

On average, this overhead is still just going to be constant value, and for increasingly stable hardware, this approaches a fairly small constant. For example, at the rate of 1 node failure per day, in a thousand node cluster, with 100 concurrent writes and an average write size of 1KB, means a node failure will have <= 100KB of extra data written across the wire an extra time, which on 1Gb ethernet links is 7.649 milliseconds of extra latency for a single read. The correction is done asynchronously, and transparently to the client, so we can ignore that overhead. (99th percentile calculations are left as an exercise to the reader - wow, your really getting a workout today!).

So, in the end reads are 2lg(n) + small constant time, for n = number of keys in a region (« total data set), when doing a read + time to transfer all the data over the write. In the worst case, you will have to do async update to the index, but that can either be batched or just done as its own RPC without affecting reads. This is almost as good, and in non-failure cases exactly as fast, as if the entire dataset was immutable and we just used the index for speeding up lookups into our primary table.

This means we can do secondary indexing in a distributed, consistent and network partition tolerant system with only the overhead of going over the wire to do our writes + a constant factor on average - arguably as good as a single system (even with network latency overheads), and in almost every case as fast as indexing into a static dataset. This makes this system far faster than using distributed transactions, either through Paxos, optimistic locking or two-phase commit, but just as correct.

Wrap up

There are a lot of potential optimizations you make on top of the proposed implementation - utilizing things like filters, coprocessors, and iterators more efficiently to minimize data across the wire, pre-joining in the index to avoid going to the primary table entirely, etc. - that can lead to even faster secondary indexing.

However, we can still to incredibly performant, correct indexes. It might take some time to build an super-optimized implementation of client-consistent indexes, but there are no apparent technical problems; its likely that the simple solution will be performant enough for all but the most demanding use cases.

Think I’m crazy or fully of it? I’d love to hear your thoughts in the comments.

Edits:

slight technical correction on traditional database indexing.
correct spelling
There is a weird state where the following could happen:

Client 1 —–> write to index table Client 2 —–> read index table Client 2 —–> read primary table, doesn’t find the primary table write from Client 1 Client 2 —–> delete Client 1’s write to the index table Client 1 —–> write to the primary table

Which means you get into a state where the index is out-of-date. To resolve this condition, you just setup the constraint that index elements more than ‘t’ time old can’t be deleted, but if you find an index element without a matching primary table row, then you are free to delete it. This gives you the constraint on clients writing - they have to complete all writes from the time it hits the index table to finishing the primary table writes in at most ‘t’. However, if you set this to even a few seconds, this should be sufficient in more cases.

An alternative implementation that is a bit more heavy-weight (in terms of RPCs) is to write to the primary table but mark the writes as ‘hidden’, then the index table, and then the primary table to ‘reveal’ the writes. This gets around the timeout issue at the cost of RPCs and the need to put a filter on reads from the primary table.

Table References in the HBase Shell

2012-05-02T00:00:00+00:00

As of HBase v0.96 (currently trunk), one can now get a reference to a table in the client shell. This is huge news for the hbase shell - the biggest update since the security features were added.

The HBase shell is actually a specialized jruby REPL, preloaded with a bunch of specialized HBase functionality. One of the things that always bothered me about the shell was that evn though Ruby is object-oriented AND HTables are objects, you couldn’t get a reference to an HTable, you had to use the top-level put, get, scan, etc. methods and specify the table name each time. A typical test that the shell is working, might look something like this:

hbase> create 't1', {NAME => 'f1', VERSIONS => 5}
hbase> put 't1', 'x', 'f1', 'v'
hbase> scan 't1'
hbase> disable 't1'
hbase> drop 't1'

In this little test, you are doing the following:

create a table named ‘t1’ with the column family ‘f1’ and keeping 5 versions
putting a single row ‘x’ into table ‘t1’
scaning all the rows in ‘t1’
disable ‘t1’
dropping ‘t1’

Anything in there seem really redundant? Accumulo solves this by have a ‘table context’ in their shell, but I’ve always found it a little odd and easy to forget which which commands apply in which context (general or context or both?).

Instead, in HBase we use a table reference which lets you use all the commands without having to worry about context and at the same time simpliying manipulating tables. To do the same test as before, but with a lot less effort (especially with longer named tables) you can do the following:

hbase> t =  create 't1', {NAME => 'f1', VERSIONS => 5}
hbase> t.put 'x', 'f1', 'v'
hbase> t.scan
hbase> t.disable
hbase> t.drop

What is really neat is the addition of the ‘get a table’ functionality. If you have already created a table, say named ‘t1’, the above example could look something like:

hbase> t =  get_table 't1'
hbase> t.put 'x', 'f1', 'v'
...

Any of the more complex invocations of the table methods (get, put, etc) also work on the table reference - just like you would expect!

To get more information on how to use that command, you can use either:

hbase> help 'put'

OR if you have a reference to a table,

hbase> t.help 'put'

Similarly, to get general help for a table, you can:

hbase> table_help

OR if you have a reference to a table,

hbase> t.help

Note that table references also will also you to tab-complete the manipulations on the table reference. One of the great advantages of using the ruby REPL and a real Table object.

High Level Implementation Details

Internally, a Table has a bunch of ‘internal methods’ that do the actual low level calls to the HTable reference (with a little massaging). This allows the user to much more easily tab-complete and find the correct methods first, e.g. get, put, scan, describe, etc., and the internal methods are accessed via calls to <pre>_name_internal</pre> which in practice are things like: <pre>_get_internal</pre> or <pre>_delete_internal</pre>

Each of the top level commands binds its named command to the Table at load time, allowing it to wrap both call paths - from the shell and from a table reference - with a formatter and do all the ‘nice’ things we have come to expect from things in the HBase shell.

The admin commands on a table are a little bit different than the table commands. Since the admin each time will create a new HTable to modify the table we only need to ensure that we pass in the name of the current table. Implementation wise, this means we have a static call back to the shell that takes the name of the current table and binds at load time to a bunch of list of strings in the table. This allows us to keep track of which admin commands we are binding to a table, but keeps all the implementation details out of the table class (in the same way that the HBaseAdmin doesn’t have code in the HTable, except for all the late binding sugar).

All of these changes were enabled by the highly dynamic nature of Ruby. You can dynamically bind methods at run-time to classes - the ability to reopen a class and modify it. Further, since everything in Ruby is a message, we can also bind via strings in the method table for a class. Pretty cool stuff!

If you are interested in taking a look at how we do this in HBase you can look the [shell.rb] (https://github.com/apache/hbase/blob/trunk/src/main/ruby/shell.rb) and the class methods in [table.rb] (https://github.com/apache/hbase/blob/trunk/src/main/ruby/hbase/table.rb).

Caveats

Currently a table reference does not support many of the admin commands on a table, things like ‘truncate’ or ‘alter’. This is still a pretty new feature, so as people require/want this functionality, they can add it (yay open source!); recently [HBASE-5921] (https://issues.apache.org/jira/browse/HBASE-5921) was filed around this issue. Its a pretty easy fix and hopefully I’ll get around to it next week. Further, all the per-table security features are not yet supported either - its likely as the security code becomes more widely used this will be the case, but that remains to be seen.

Heads Down, Thumbs Up

2012-05-01T00:00:00+00:00

If you have been following this blog at all, you have probably left by now. But for those of you still around (and those new!) here’s a little explaination as to the recent gap in posts.

At the end of January 2012 I started working at Salesforce.com on the team dedicated to bringing HBase to one of the world’s foremost CRM software providers. Despite being a large company, they are still incredibly agile. Further, I get to spend almost everyday working on a project (HBase) that I used to work on for fun. On top of that my team is top-notch. And we are running real code, that impacts real people and makes a difference everyday.

Some people say they have great jobs, but I have the best job (and snacks!). Here is the view from my window:

This post was not meant to be an explicit plug for Salesforce.com, but we are hiring.

Also, in the meantime, I’ve picked up my marathon training (following this great book: [Run Less, Run Faster] (http://www.amazon.com/Runners-World-Less-Faster-Revolutionary/dp/159486649X)) and am doing 30-40 mi of running a week with cross-training via cycling (fixie or raod bike, depending on the day) and yoga at [Planet Granite] (http://www.planetgranite.com); I’m also starting to get back into climbing now that my finger has healed from a trip to Bishop, CA earlier this month and a two year long fight with tendonitis. This isn’t an excuse, merely the reasons behind being more busy of late.

Oh, and I also just moved into a new apartment in Lower Haight with a few great guys. Tons of space, peek-a-boo view of downtown, good landlord and quiet neighbors; pretty much what everyone is looking for in an apartment (and crazy low rent). This means I’ve been spending an exhorbinant amount of time doing things for my aparment; its mostly little things or massive Ikea runs, but time sinks nonetheless.

And in the rest of my free time I’ve been working as much as I on HBase. Its been a couple months, but the code is just starting to come together on a lot of these Jira issues. For those interested, here are the main things I’ve been working on:

[HBase-50] (https://issues.apache.org/jira/browse/HBASE-50) - Snapshots (also the longest standing ticket in HBase)
[HBase-5547] (https://issues.apache.org/jira/browse/HBASE-5547) - Get a reference to a table in the shell
[HBase-4336] (https://issues.apache.org/jira/browse/HBASE-4336) - Convert HBase into maven modules
[HBase-5548] (https://issues.apache.org/jira/browse/HBASE-5548) - Don’t delete HFiles in backup mode

So its been a bit busy. Sorry. I’m going to try and be better about it. There is a list of things I’ve been wanting to talk about, but just haven’t gotten around to it yet.

Here is what I have planned (I’ll attempt to remember to add links here as they are written):

using table references in HBase shell
tips and tricks for using maven in multi-module projects
philosophical discussion on why money is not an indicator of a person’s value, but seems to be all we have
use and the design of hfile backups - code for this is nearly done, and the only gating factor in my writing
using HBase snapshotting (and a second post on the architecture) - have to wait until the code is done for this

Definitely some stuff to look forward to, you know, if your into that kind of thing.

HBase Eclipse Support

2012-02-03T00:00:00+00:00

(WARNING: I’m going to preach a little bit about IDE’s. If you don’t care, skip on down to the hbase changes). Using a ‘real’ IDE is becoming increasingly popular among developers today - it’s come to the point that a large factor in considering language maturity is its IDE support. In particular, its not uncommon to hear people ask, “Is there an Eclipse plugin for that?”

Now, there are many ‘purists’ who would argue that using an IDE makes for worse developers. I won’t argue that it can make lazy developers, who are tied to their tools, but truly ‘worse’ is really extreme. The reality of the situation is that these tools are really powerful and open up development to people who are not command line wizards, to whom vim or emacs is an incomprehensible jumble (in fact, universities are only reinforcing this attitude, rather than really teaching kids this days about how to really use the terminal).

The new argument has now become IntelliJ vs XCode vs Eclipse, not command line vs IDE. This really is the reality now, and we are likely to progress more towards s gui interface, though any real work will probably always be done with text, not shiny blocks and lines (which is why LABView is an absolute kludge, at least for any ‘real’ coder). The power of the IDE really comes from the the fact that it lets you work on multiple levels at once - you can see the line you are writing, but at the same time, have visual queues as to how it fits into the rest of the project structure as well as the rest of the file. This is incredibly powerful as it lets you keep more in your head than before, which anyone who works on code must see as a benefit. And this doesn’t even take into account all the power of the refactorings, searching, hot-links, auto-building, etc that these environments provide.

On the counterpoint, a lot of IDEs can be a huge pain in the a$$, making it seem like your build is working when its not (which is why things like maven from the command line must be considered the final source of truth) or not finding classes when you do things outside the IDE (like running ‘mvn clean’). So yeah, they aren’t perfect, but they are getting better all the time and at an increasingly rapid pace as they see increasing adoption. This isn’t anything new, so get used to it folks.

So yes, you could do all the fancy stuff that an IDE can do from emacs, but let’s face it, most people don’t care to put all the time in to learn all the tricks of emacs, write their own modules, etc. They just want it to work. And for a first timer to a project, this is really important to ensure they want to work on it. By lowering the barriers to entry, we make it easier for more people to become involved, which makes the project better. At least in open source.

##HBase Change

For a long time, HBase has technically ‘supported’ eclipse and even provides instructions on how to get it up and running. However, it usually takes a lot of ‘jiggering’ and then then doing a bunch more ‘rejiggering’ to make it actually compile. And then if you run any maven commands on the build, well, you are going to probably redo a lot of the ‘rejiggering’ (and because its a gui, scripting is a pain - point for the command liners).

This all changed with the release of Eclipse Indigo with inclusion of m2eclispe. For those of you who don’t know, meclipse is the best plugin for eclipse to integrate with maven. Its automatically pulls in maven depenedencies, supports pom modification and can do full maven builds from within the IDE, and a bunch of other nice utilities; pretty sweet overall. By rolling it into the ‘official’ java developer release of Eclipse, m2eclipse has gotten much better support and integration with Eclipse.

In the most recent upgrade, m2eclipse added the idea of a ‘connector’ to handle ‘interesting’ lifecycle events that previously would cause the borked project problems in Eclipse. Connectors take care of making sure Eclipse is a awre of these lifecycle events and also make sure there are no classloaders leaked, modification of random files inside workspace or nasty exceptions to fail the build.

Sounds great, right?

It is, except not all the maven plugins you could want to use have been updated to support it. Would have been nice if the m2eclipse plugin could just run without changing any existing files, huh? Anyways, back in HBase….

In particular, HBase has three plugins that do some pretty important stuff - maven-antrun-plugin, avro-maven-plugin, and jamon-maven-plugin - and, unfortunately, are not supported by m2eclipse. m2eclipse gets tripped up in the lifecycles of these plugins (even though they bind to pretty standard goals) and justs throws its hands up.

Ususally this wouldn’t be a problem, but since m2eclipse is built into Eclipse, it means you can’t even get Eclipse to recognize it as a project you can build, so you get this spurious error messages and prevent you from doing certain development within the IDE easily. Lame, right? And because HBase is open source, I wanted to make it as easy as possible for new people to get up and running. Since Indigo has been around for a while, it seemed time to to add full support for what is the ‘standard’ java IDE.

Eclipse and m2e was actually nice enough to have a ‘Quick Fix’ for these lifecycle issuee: it adds a few lines to the pluginManagement part of the pom for the

	<groupId>org.eclipse.m2e</groupId>
	<artifactId>lifecycle-mapping</artifactId>

plugin. Essentially, it just lists out the plugins that need to be ‘handled’ and then tells them what ‘action’ should be taken for the plugin when its phase is used.

By default, it just ignores that plugin phase when it builds in eclipse. That doesn’t make much sense as the default in most cases though, so frequently you will have to change the action from ‘ignore’ to ‘exectue’, which does exactly what you think - allows the plugin to execute when it is ‘goal’ is executed.

This modifcation is actualy very nicely encapsulated - its doesn’t affect any of the other plugin definitions and it doesn’t actually need to be run at any point; it just acts as a heads up to m2eclipse. It is unfortunate that the pom does need to be modified, as I don’t think it would be that hard to dump this stuff into the eclipse project properties file, but in the end is not a big deal as Indigo is become the defacto standard. Maybe in the future they will make it a little easier and then we can rip the code back out of HBase.

For the actual code used for the patch, check out HBASE-5318. If you are interested in the full details of how to deal with m2eclipse lifecycle stuff, check out the official documentation.

Building Big

2012-01-02T00:00:00+00:00

A lot of this post is based on my recent discussions with a few companies - both big and small - who are attempting to ‘change the paradigm’ either of society in general, their industry, or just their company.

Turns out what they are really doing is moving to a cloud infrastructure which all of a sudden is enabling a huge amount of innovation (this stuff is pretty damn exciting and is really changing how we interact with the world). But in a world domainated by SQL and big iron, it is also changing how people think about building these systems.

And it turns out once you start throwing in tons of moving parts, it gets hard.

And then you have to make them go fast.

And then, once you have all of that, you need to make sure multiple products within your company can actually use this stuff.

I’ll get to all of them in turn. But let’s start with just figuring how all this stuff goes together.

Make sure you have a problem

I can’t stress this enough. It’s an epidemic among engineers that we build these massive systems that no one needs or cares about. And they end up sucking. Not having a user segment means that your stuff will never be useful to anyone. That may be fine for personal side projects. However, for systems that a company is spending money on, the ‘customer’ - in this case the internal devs using the platform - need to be involved from the start.

Any time you are building something new, you need to really make sure that you need what you are building. Here is where Eric Reis’s The Lean Startup makes a ton of sense - talk to people who are going to use you backend, figure out what their problems are - not what they want to do with your technology. It’s about building in that feedback loop from the beginning. This also works really well with traditional agile methodolgies - get feedback from the ‘client’ and iterate based on their responses.

This is all about eliminating waste - only build things that are really necessary, everything is just stoking your ego.

Building from scratch

If you start by designing a massive system and then figure out how to mold that design to your problem, it will only generate a bloated architecture that half-solves your problem. You want to keep in mind the features the system needs as a way to avoid designing yourself into a corner, but the goal needs to be figure out what you need _first_, then how to build it. We are at the point now where basically anything you can think up can be built - the hard part is figuring out what you want to build (and if you should).

First, start with the dumb solution - it doesn’t scale, single points of failure, but it takes care of the functionality you need, it works. Then start removing/replacing things following the principles below and you will end up with a scalable system that actually solves the problem. From there, you can refine for certain properties - scale, fault tolerance, security, etc. The beauty here is that you can literally scrap the entire design so far if it doesn’t conform to what you need (and you shoudn’t be afraid to erase everything) since it is the swipe of an eraser, rather than huge swaths of code.

###Design it right the first time (or scrap it and try again)

The design here is iterating up from that first, naive solution - you have something that works, but doesn’t really scale. If you are a startup, then what works here might actually be a phyiscal system, but if you are expecting to scale or have a bit more time, it behoves you to start iterating on that design immediately. Otherwise, you are liable to go through a pain period very soon when things start falling over and you have to rip out everything or limp along on a legacy system that can’t really do what you need.

What you really need to do is design up from the ground, before a line of code is written, for it do the right thing. This is really easy now as we can iterate the naive solution for the properties we need. In working through the design, you might run into situations where the dumb solution’s assumptions break at scale and you need to redesign. However, you just go back to the start and redo each property from a revised starting point. This new start may be a little more complex that the original, but it will scale better later. The far worse case is trying to take a system beyond its original constraints - it only leads to lots of duct tape, dirty hacks and heartache. Oh yeah, and you will probably have to rip it out in the end.

Let’s take a look at an example that has seem some controversy recently: MongoDB. I’ll readily admit that there are a lot of benefits of the system - its wicked easy to use, it integrates easily, schemaless as expressive as necessary, and it handles a lot of annoying things for you.

However, what Mongo isn’t is a cloud-scale database. From the start, it was designed as a single-server NoSQL database (don’t get me started on why NoSQL != cloud) when then adding sharding and ‘scalability’ on later. This lead to data loss in some cases and a set of really painful ‘features’ - lack of good monitoring, manual restarts and re-partitioning on failure, collections don’t shard, etc. These added properties were not designed into the system from the start - they break the original paradigm and really need a system redesign to be done ‘properly’.

Another case is adding security into Hadoop - a freaking mess. On the flip side is security and scale in Accumulo; it had two basic goals and it does those two things pretty well (not saying anything about the quality of the code…).

Once you start accumulating enough indicators that it has become a legacy system, it’s time to consider doing a massive rewrite. What you are pivoting the system on is longer solid, but instead has become a unstable bog that doesn’t quite do what you need and now also sucks at it original design goals. So go back, and design it again. You learned some lessons from building it for real, so now you can do it right.

That’s also the value in hiring people who have built big before - they know the problems, they can see the cravasses and can build it right the first time. Unfortunately, these people - the ones who really “get it” - are few and far between in terms of today’s cloud technology.

##General principles: When doing the original design, or a rewrite, there are certain properties you need to make sure a built into the system from that start. Otherwise, its going to be huge pain later or just be duck-taped together. These are the things you need to consider when iterating up from the naive solution.

1. No single points of failure

This has been the bane of the Hadoop stack for years now and mountains of work have gone into making a high-availabilty namenode. Multiple companies have rolled their own solutions because they realized that if shit goes down and their namenode crashes, so does their business. That can’t happen.

Now consider MapR - they are popular now because they did a great implementation and make it plug-and-play (more or less) for enterprise. This is really closer to the way things need to be; its what Oracle has become and part of why Apple is great: it just works. Oh, and it scales.

This means need multiple pieces running the process, either dynamically parallelizng the work and/or with hot failover. Either way, you need to have a system that can immediately pick up the slack if a component goes down. It all depends on your SLA as to whether a true hot failover is necessary or if you can just cut down polling intervals and implement a dynamic system.

You also need to make sure if something goes down you get the replacements (either promotion, resharding of work or fail-over) within hard bounds. Eventual consistency is nice here in that you can play fast and loose with these constraints by designing your data to fit the model. However, the things you need to really worry about are availability of the system (recovery happens fast) and the holy grail: data is never lost. This means replica’s, flushes to disk, and write-ahead-logs.

###2. Parallize everything

This is higly complementary to not having a single point of failure - on this side of the coin its really about making sure your producers and consumers match up. What this parallelization doesn’t mean is just throwing a bunch of machines at the problem - if you don’t design your system to handle that, you are going to just waste money and resources.

This generally means randomizing your key space (hashes work well for this), matching up producer and consumer machines, and leveraging ‘server-side’ resources (part of why HBase’s coprocessors so freaking awesome). Essentially you want to get to the point where you have one writer talking to one consumer (+/- a margin of acceptable parallelization on the consumer). Think of one ingest client talking to one data server, with a margin of parallelization of regions per server. If you design your key space correctly, then it is pretty reasonable to expect this kind of behavior. This also means you will probably (nearly definitely) need to do some indexing to avoid doing full table scans (this time a one client to many server situation - which is also going to kill you). The key here is that we can leverage the fact that storage is cheap, so replicate data as much as you need to avoid locking. Here things like Culvert are really nice as they work in-system, scale, and are very flexible to accomodate variable indexing and data schemas.

But these are really the basic things - the final answer on how to do this well is, “It depends.” It depends on your system requirements. It depends on the kind of data you are storing. It depends on your access patterns.

In the end, all of these dependencies will lead to you a set of requirements. Its pretty likely that someone else has come up with these same requirements before (we are not all unique snowflakes) and written something to handle them (give or take). Oh and its probably open source. It is very easy to think you need to build a tool, but build in-house if you don’t have to; if you do, know why you are building it.

###3. Make scaling easy (and separate services)

This is a little more subtle as it relies on the fact that you already have 1 and 2. Consider the way Amazon is setup ¹: everything is based on APIs between different internal products - even if it isn’t exposed as a user service, you have to act like it is. This is great because you can do a lot things: innovate independently of other pieces, ease new developers into the company as they only need to learn their one portion (rather than one monolith system) and if you want to make any portion run fast, just put more machines behind that facade.

On the downside, cascading failures can be incredibly hard to debug. Further, there is no tight integration between pieces, which can easily lead to fragmented technology stack, which could run way faster, if only people could work more closely along the stack (the very thing Job’s was trying to avoidat Apple). One of the worst parts of this setup (and is true of any system based on APIs) is the proliferation of tools within a company can start to become overwhelming - people need to have seminars and brown bags on the tools available in their own company.

Wait, what? Sounds like its time to simplify - software is built on abstractions, so what happened to get away from that? You really only need a couple types of backend: traditional, in-memory, cloud scalable; within that you can split on what 2 parts of the CAP Theorem you are covering. That’s something like six different types of database you might need to maintain, but in reality it will be a closer to two or three. However, above that you really should be more tightly integrating products - middle-ware up to frontend.

Yes, there will be some overlap, but then your architects meet regularly and talk about what they are doing (right?). And why you have people rotate teams, sharing best practices.

I personally like the idea that you have three independent teams build a tool with the same basic requriments. They will probably come up with different architectures, each with their own benefits. However, once you have three different use cases and examples, then you have a chance to really build the system with proper abstraction and using the knowledge gained from trying it three different ways. I think this is exactly what Apple does internally.

What we end up with then is a trend upward in services that are tightly coupled, but general enough that people can build new things on them. Combined with a culture of collaboration, this leads to new tools built with the understanding that it may, one day, be scrapped and rebuilt the ‘right way’.

4. Do the right thing

A recent methodology I like is following a server-based approach using APIs as a guideline. Essentially, you can end up with a set of servers that will respond to a given set of APIs, and you don’t worry about how they do it on the backend - you could even make them part of the bigger cluster so you can share basic admin costs (nudge, nudge, wink, wink)!
This means each product claims ownership (to a degree) over all the tools to build their product. This gives you a vertical integration from the bare metal up to the product/service - that leads to great products.

This type of setup starts to get really bad-ass when you can start doing automated deployments over a shared cluster. Using things like Chef and Mesos combined with some automated cluster load monitoring. All of a sudden you can roll out pieces of your backend as you need it, have it humming along and configured correctly right away, and if you design correctly, will ensure linear(ish) scalabilty.

You still need the guys who make the services run, but they can (and should) work with multiple teams. That has the added advantage of making the tools more generalized (so new products can just be bolted on top) and also makes the systems more accessible and bomb proof.

Horizonal-Vertical integration

You can think of this approach as the horizontal-vertical integration. Horizontally across the organization for each service, vertically within each product or service.

On the low level side of things, this means you have to be able to add machines on the fly to handle load. If you do this correctly, failure recovery comes pretty much for free.

For example, you have a bunch of pollers reading from a queue. Well, make the key space each poller need to cover dynamically assigned and combined with ack’s to make sure messages aren’t lost. All of a sudden, suppose a ton more messages come in - add more pollers. On the other hand, suppose a poller goes down; pollers will be notified of a key-space modification and can immediately repurpose and pick up this keyspace change. Kinda like how the Dynamo-like systems handle their key-space. If you keep the pub-sub ratio close to one-to-one (or one-to-more) you can get some really blazing fast systems. I find using ZooKeeper to handle these notifications and monitoring works really well- it scales as much as you need, is pretty freaking reliable (hundreds of days of uptime for a given machine in the cluster are not unusual), handles all the heartbeating for you and is open source.

Running at scale

If you aren’t don’t run whatever you are building at scale, you are building a toy. Unless you are running at scale, no one gives a damn about what you build, what great design patterns you are using, what language its written in, or how hard you worked on the system.

This means getting your stuff into production as fast a possible (again, following Reis’s advice) because it will show you where your shit breaks. This is a good thing, because fixing those pain points will make it useful and fast.

Most platform people (people in the company using your backend) don’t really care how you are storing the data, just that you don’t lose their data and meet certain criteria. Yeah, they may be excited to work with NoSQL, but most people can’t handle the complexity. Because you know what? This shit is hard - running at scale is not the same as running on one big-ass server.

Right now we need to have some really smart developers managing our clusters, making sure we have great key design, that data gets aged off appropriately, to write a specialized Map/Reduce jobs to clean up out tables.

This is crazy.

This stuff should be dead simple. We need simpler abstractions on top of the really scalable stuff. Developers shouldn’t be running the servers, but focusing on building those on those abstractions. This is a big part of why traditiaonl SQL-based databases became so popular - they could be run just by DBA, rather than database software engineers (and at least one order of magnitude cheaper).

Key-Value stores are not a natural way to think about things, so people stuggle with low level things like key design, rather than what their stuff actually does (Disclaimer: I enjoy optimizing key design and think it is really important to get high performance applications. But I also think that most people shouldn’t have to worry about it).

What we need is a couple layers of abstraction - this idea of only having a few types of databases, of periodically killing off a group of tool to rebuild it the right way, and tight communication between teams to avoid massive overlap.

If we can build simple tools for services, that are built into the technology stacks for each product, you end up with very clean designs which make it easy to modify and understand the system. Simple, intuitive interfaces will then tend to generate more innovation. If you build the bottom of these stacks will basic scale in mind, then scaling the each new product becomes a matter of flipping a switch rather than doing a massive redesign.

Stevey’s Google Platforms rant - https://gist.github.com/1281299 ↩

High Tech Colors

2011-12-10T00:00:00+00:00

“…Sony’s high-tech look, which is gunmetal gray, maybe … black”

~ Steve Jobs, 1982, in "Steve Jobs" by Walter Issacson

This is the point where Jobs fell out of love with the style of cool, dark colors and more in favor of the style of many of Apple’s current products - the shiny, the white, the colorful, the soft-rounded corners. Granted, he had been moving towards this for years; just read about his designs for the Apple II - its all molded plastic and beige.

Most of the Apply products do hold to 1982 Steve’s vision of what things should look like - soft, inviting, and usually white. Take the entry level Mac’s, the iPod, and the ‘alternative’ iPhone (but who really get’s the white one?). Its interesting to note that we really haven’t gotten away from that gunmetal grey as ‘geek chic’. All the really ‘hardcore’ products are still holding to that old asthetic.

Just look at the MacBook Pro. Gunmetal grey - no other choices.

Or the ‘nicer’ version of the generic MacBook. You want black? Well, it will cost you a RAM upgrade.

This also has implications for the cell phone market. By far, the black iPhone outsells the white. But people keep buying all these different cases to make their phone individual.

But they are still intimidating devices. My dad said, “I need the iPhone for Dummy’s book - I’m still never going to use even 10% of the functionality of this phone.” WHAT!? As a technologist, this was very shocking and frustrating to hear - there is so much cool stuff and it can make your life easier and and and. But it’s black, so in the back of his head he is thinking he is part of the cutting edge. And its cool to be out there (or at least think that you are).

So maybe that’s Apple’s whole deal - the ‘archaic’ coloring to make it seem cool and sexy and advanced. Grandma wants soft and white, but all the cool kids are using the industrial steel. That still true to a degree, but they are doing it from scratch. It’s like shopping at Hot Topic (don’t get me started), but hey, it seems to work.

Don’t get me wrong I’m writing this on a MacBook Pro. I have the latest iPhone (though I was two.5 years behind before upgrading). I run with an iPod. I don’t count myself a fanboy, but my wallet says otherwise. But you know what? Apple makes products that just work. Can’t say that about Windows. Can’t say that about anything Linux.

Working from the command line makes you strong; I highly recommend spending a couple years running linux at home and tweaking it to get it just right for you. Its ‘interesting’. After putting in my time though, I happy to let Apple make a bunch of choices for me as long as I stil have a (mostly) working terminal.

To paraphrase Winston Churchill, Apple is the worst form of product except for the all those other forms that have been tried from time to time.

So black and gunmetal are still cool, but it seems only because we have this view of the future from 20 years ago. Maybe Steve was right and everything is going to be shiny and white, we’ll just have to wait and see.

In the meantime, where are my flying cars?

This may read like a plug for Issacson’s book, but it really is that good. Three hours into it at the end of day that I wanted to sleep by 8:30 and this is the first time I’ve put it down (it’s nearly 1am). Go out and get Steve Jobs. Read it. Its worth it.

NOTE that I’m not advocating much of his personal life, but that man had a serious amount of vision, taste and overall a burning drive to succeed. These are what you should take away. Not that you should insult your coworkers, be arrogrant and a general ass. 99% of the time, that just ruins relationhips, the product and the company.

Unless you are that 1% of brilliance - here’s to you Steve.

Technical Leadership

2011-12-02T00:00:00+00:00

Yesterday I was talking with a potential future boss and was talking about what I was looking for in my next position. One of the things I mentioned was I wanted the a technical leadership role. Now, to someone already in managment who has been around for a while will naturally be a little suspicious - as well he should be; the very first thing he asked me was, “So why do you want to be a lead? Because you then have power to boss people around?”

Don’t trust someone who wants power until you understand the why. I’ve found the best leaders are rarely leaders by choice, but rather because there is no one else better around. Being a leader is stressful, hard, and often times filled with minutae not immediately related to the project. In short, it can really suck.

So why in the world would people want to be leader? When it comes down to it, it really means control. Control over the vision, the execution and the final product. It is the chance to fix the problems that you have seen in previous projects. It’s the chance to nurture your developers and watch them grow. Its about getting to that feeling of project ‘flow’ and knowing you enabled. But in the end, it is the chance to build something bigger than you can alone - having a vision and building something impactful from scratch.

You can be a technical leader and embrace all the pain and the joy of building great software or you can just technically be the leader and with some certainty just crank out crap and make everyone (including you!) unhappy. Just think about all the people who build great software - you have to jump in head first or you are going to crash and burn.

Coming from being a software developer, assuming technical leadership really is just a higher level abstraction - you have to think about the whole software ecosystem: the people building it, the architecture, the clients, the goals. Which means yes, you have to have authority to ‘boss people around’. Otherwise, you cannot orchestrate (and I chose that word intentionally) the entire system to suceed, to reach its full potentially. If you don’t have any dejure power, it can take years to build up enough repoire and cache to build great things. Sometimes you don’t have years - it needs to get down now. In that case, having a boss hand down authority certainly helps. Yes, it doesn’t work if the people under you don’t respect you and believe in you, but those are basic qualities to any good team. Once the leader-follower dynamics are estabilshed, it becomes much easier to actually guide the work, rather than wringing your hands and waiting. (Full disclosure, some of this comes from some experience having to lead a team from an unofficial position - its a very nasty situation which leads to lots of frustration and roadblocks. Having the dynamic established is criticial to moving quickly.).

But with great power comes great responsibility.

That responsiblity is to building a great product. What does that mean? The first is responsibility to your team. Its becomes your job to enable them to succeed. That entails a multitude of things: providing cover from “bs” work, making sure people have work to do, ensuring the project is synchronized, that right people are working on the right things and ensuring that all their hard work gets communicated back up to managment. Making it easy for your team to suceed makes it easier for them to build something great. This also means helping where necessary and (this is harder) stepping back and letting the really smart people just work.

If everything is good, the team should feel like its floating, it just easy. It just works. There are no major hitches. Everything is clicking with the developers. Work is being getting done on sprint boundaries (+/- technical difficulties). But this is a hard thing to get to, and there are a lot of tools and books out there to help leader just let developers write code (agile, xp, cms, etc.), so clearly this isn’t a solved problem. Hey, writing software is a craft not engineering (nor did anyone say it was easy).

There are risks in any project and things can fall behind because of unexpected delays. If that’s the case, you shouldn’t crack the whip - that will only break spirits over time, though it may get this project done - but instead make sure everyone (the team, management, the client) understands what’s going on and why this happened. Its all about communication, about making sure everyone knows what’s happening with the project. If its behind because people are slacking then, by all means, start raising some hell (politely, of course), but that is not the general recourse.

The worst thing that could happen is people thinking that everythink is hunky dory, when in fact they are going off the rails. This leads to people being angry, a shoddy product, an overworked team, or some combination of the previous.

As a lead, your main job is really to facilitate communcation. Make sure the developers are talking and coordinated. Make sure the client knowns whats happening. Make sure upper management is apprised of the project status. No suprises. This is your moat - the first line of defense against failure.

In this however, you have to be the conduit between the developers and the client and the management. If the managers start bothering the developers or the client pesters them all the time, they won’t be able to do what they do best - write code. This often means taking the boring tasks like writing powerpoint slides, long meetings, and extra-extra documentation (though the severity of all this depends on the size of the company. At smaller companies much of this pain will be gone.). This doesn’t mean the developers shouldn’t be able to talk to the managers or client if needed, but that they should do it only when they need to - not right in the middle of working through some really tough, gnarly code that requires an hour to even get in the right frame of mind to work on (or more concisely, needs ‘flow’). By providing that buffer, you can keep the developers happy, which keeps everyone else up the stack happy too.

However, there are a couple things people can encounter to make it seem like they are slacking:

black holes - a piece of the architecture that no one knows that much about, requires some investigation to complete, and is important to the system
worm holes - even worse than black holes, wormholes take you into not just the product you are working on, but into the internals of two or three (I hope its not more) layers deep - into other projects - to fix an internal bug in their code.

Black holes are obviously bad - they can be a huge time sink and lead to schedules longer than the amount of time in the universe (see Dreaming in Code). Wormholes are something I recently stared using after spending about 2 weeks to find a workaround for a bug that should have been fixable in hours, in the process uncovering bugs in another component of the system as well as two crucial ones in the database we were using.

In cases like this, it is important to try to mitigate those risks by lending as much guidance as possible. As the tech lead, you need to have seen lots of problems, worked through a bunch of them yourself, know when to tear it out and start again, and (possibly most importantly) know when to file a bug and move on. If you don’t mitigate these risks, the whole project starts to fail - people get frustrated, the excitement goes away - you quicly go from floating to falling.

At the same time, its likely you have some junior developers on the team (you do, right? if not, get some - what happens when the senior guys leave?). If so, what may be a black hole for them, is really simple for one of the experienced developers. It’s all about figuring out how much help people need and if you or another team member need to step in for support. Start with small questions (any problems today?), and then if things seem fishy start to escalate until you get to the root of the problem (its all just communication!). At that point, maybe you step in and more closely monitor their work or maybe do a some pair programming (or set some up with an experienced member) or ‘realign’ them to something more suited to their skills. Its suprisingly frequent to find that people are just tasked with the wrong thing; they may seem completely incompetent because it doesn’t match the way they think, but if you find the right thing they can just fly. So a couple different things to think about there, a couple different options for help - if the team is open enough, the problems become apparent quickly, making the solutions that much easier.

So we’ve talked about the responsilibities (and there is no lack of them). Now let’s talk about why you would even want to be a technical lead. At least for me, its all about what I was talking about originally - building something bigger than you could alone. Big doesn’t just mean a lot of lines of code, there are implications for the complexity and the usefulness of the software at the end. Honestly, who doesn’t want to engineer something that no one else has ever done that makes a huge impact in the world? In fact, that’s the premise of most startups.

In building something big, as a leader providing the vision is as important as shaping the product around the vision and the vision to the product. Chances are, what you intended to build in the beginning isn’t really what you have in the end - it may be close, but at you build it, you learn new things - what works, what doesn’t, if it has impact, etc. If you don’t adapt to what you have learned, you are shooting yourself in the foot and will probably end up with a pile of junk. That’s not to say you should radically change the plan every week (though 180s are necessary from time to time), but that change needs to be tempered.

In the end, it’s all about taking that vision you have, getting people excited about that idea, and building your own software castle in the sky.

To recap, if you want to lead well (as least as far as devs are concerned), you just need to provide:

vision and direction for the project
guidance when people need it
clear channels of communication (and ensure they are used)
cover so your developers can do what they do best.

Yeah, its not an easy job - lots of stress from above and below, worries about schedules, concerns over providing direction and leadership, and all in hopes of turning your dream into reality. So go on and build your castles - revel in the pain, the work, and take some time to enjoy what you’ve accomplished…

Then I’m moving onto the next big thing because for me the joy is in building, not sightseeing.

Vagrant + Chef - Tips and Tricks

2011-11-26T00:00:00+00:00

There is a good chance you got here looking for a solution to some fincky problems with vagrant and/or chef. Congrats on using some cool tools! If not, I’m going to introduce the tech I’m talking about and why you should even care. So if ‘get it’ already, then skip on down to the interesting features discussion - some caveats and tricks might need to keep in mind when working with vagrant + chef.

Chef

Imagine that you are expanding your Hadoop cluster and want to add another data node. Well, that means you are going to have to install java, download the correct hadoop version, update the configuration files, and startup the datanode process. That can be a bit of a pain for your sys-admins, though with practice you can probably get it down to 5-10 minutes of scping and fiddling. What if you could do it with a push of a button and you know its exactly the same as every other datanode in your cluster?

Or what if you need to replace your Job Tracker? Push a button.

Or your web server goes down? Push a button.

If you haven’t heard of it by now, Chef is one of the easiest ways (feel free to rant about the qualities of CFEngine or Puppet in the comments) to manage the configuration of your computers, particularly in a cluster (though there are certain places where they really aren’t ‘cluster aware’, see Ambari). There a a bunch of tools out there to manage this, but they all follow the same basic idea.

You store the configuration of each ‘role’ on some remote server. Then when you fire up a new machine, you point it at that server it downloads the configuration it ‘should’ look like, and then tries to build itself up to that configuration. If everything goes as planned, then your machine looks like the configuration you specified. Every time. What if you want to update that configuration? Just do it in one central location and then push it out the necessary recievers. All of a sudden, instead of doing tons of parallel-sshing and manually setting things up (or home rolling your own scripts) you can do it all in this one tool.

Yes, this is pain learning the system. And yes there is pain in writing the recipes. However, do it once and then you have it, potentially, for years.

Its worth it - trust me. If you don’t, then do the math - still worth the time to learn the system and write your node recipes.

Chef is nice because it is all in Ruby (and a special, Ruby-like DSL). This means anything you can do in Ruby, you can do in Chef. This means is is really easy and natural to do more complex configurations. There is also a pretty strong (and growing) community around Chef, with tons of people open sourcing their own Cookbooks. The short story of that is, you can get up and running in minutes AND you have great examples of how to write your own cookbooks.

If you want to learn more about Chef, I would recommened the official wiki/tutorial - which is pretty dang good - and this terms guide to help keep your head on straight).To people not used to these systems, it can be a lot to wrap your head around (it was at first for me too), but once you understand the paradigm, its simple to write your own recipes and leverage the others out there. On top of that there are a variety of chef cookbooks out there. They will help you get started as well as seeing

There are also a couple of ways you can try out chef. First, you could run your own chef server. Its a little daunting, to jump to that immediately (though the offical wiki has some great info on how to do that). Next easiest step is Opscode’s option to try out ‘Hosted Chef’, where they run the configuration sever for you and (even better) its free to try on up to 5 nodes (aweome!). The last thing you can also try is to use Vagrant. Vagrant is a tool to dynamically build virtual machines; the dynamic part comes from using Chef to configure the VM. No remote server and minimal configuration pain (though there is some ‘fun’ associated with it).

So there really isn’t any reason not to try chef, if you haven’t already.

Vagrant

Ever wanted to …

try out some new software but don’t want to mess with your home machine
setup a standard developer environment for new developers
automatically build a virtual machine from code (without wanting to shoot yourself)

Then Vagrant is for you! Vagrant will let you do all of these things and more. Essentially, with the push of a button you create and then configure the virtual machine. Another one-line command tears it down (actually it’s just vagrant destroy). What’s really great is that Vagrant can leverage either Chef OR Puppet (two of the most widely used configuration management systems) so many people will be able to leverage a bunch of your current skills. Vagrant also provides a safe and (relatively) easy way to learn Chef or Puppet.

Personally, I wanted to learn Chef and a project using Vagrant (dynamically building a VM for doing Accumulo training) offered the perfect opportunity. Plus, I’m also in the process of learning Ruby (following some of the recommendations in [Pragmatic Programmer] (http://www.amazon.com/Pragmatic-Programmer-Journeyman-Master/dp/020161622X). Try it out, it pays off in the end, and well before it too!), so a real world application was a great way of getting my hands dirty. Vagrant is also heavily Ruby based, so IMHO another point in its favor. In fact all the configuration is actually done in Ruby. Also, most of the documentation out there is for chef, so thats an easy call. Hmmm, interesting how the amount of documentation influences tool choice, isn’t it? I’ll get back to this later.

Everything in Vagrant is based on around the Vagrantfile. Think of this as your pom.xml (if you like maven) or build.xml (for those ant people) or make script (remember those?). The default vagrant file is chock full of documentation around which options you can select and which you would need depending on using Chef or Puppet. The official Vagrant tutorial and getting started guides are actually pretty fantastic. It’s definitely your first stop in getting the system up and running. Go ahead, try it. Work through the examples.

…(type type type)…

Buuuuut, the tutorial and docs don’t cover all the intracies everything you say? Fair enough, your problems (I’m guessing) are probably stemming from chef. Check out this great blog post about how to use chef+vagrant together.

At this point, I figure you have a pretty good handle on vagrant. Lots of Ruby goodness and some pretty good docs for things that aren’t totally apparent. If the virtual machine is actually going to be pushed out to a server and used as a vm, you can forget about turning on gui support. However, if you are just playing around, its rather nice to have. Just remember the defaults: <pre>username: vagrant password: vagrant</pre> as the gui doesn’t automatically log you in like the usual ssh connection will take care of for you.

Also, if you are already running Chef to do configuration, Vagrant is really nice in that gives you the option of configuration form the Chef-Server, rather than from local files. This is more like a production(ish) situation, so its another, gentler way to work up to using chef in a ‘real world’ system.

Tips and Tricks (pain points solved!)

There are some facets of the interaction of vagrant and chef that can cause problems. Hopefully, the resources above answered your questions, but if not, lets dig into some of the things I found.

Read what’s out there

Seriously, go do it. A lot of the recipes are really nice (the java one in the standard community is great) and will teach you a ton about how to write good recipes. At the same time, there are a bunch of recipes out there that are crap - don’t worry, you’ll learn a lot from those too (about what not to do). That’s part of the beauty of open source, and you would be foolish to not take advantage of it.

Also, think about doing readme driven development. It works really nicely with chef recipes which end up being very modular and easy to work with in the readme style. And at the end, you’ve already done all your documentation! No need to try and remember all the possible knobs and options, no double checking to make sure you have the right calls - you did it when you wrote it, so its pretty close to perfect.

Setting up your directories (and version control everything)

Keep a separate site-cookbooks for your own cookbooks. Its mentioned in the chef guide, but only briefly. It makes all the difference in the world, especially when you start messing with others cookbooks.

Adding them to vagrant is as easy as adding the following to your Vagrantfile:

chef.cookbooks_path = ["chef-repo/cookbooks", "chef-repo/site-cookbooks"]

All the regular (external) cookbooks that you pulled down from various open source repositories should go into chef-repo/cookbooks. Then everything that you write should go into chef-repo/site-cookbookbooks. That way you know which things came from where and who tell email if things start breaking.

Its debatable whether you want one master git repository (say in chef-repo) and then add submodules for each cookbook or creating a new git repo for each cookbook. Personally, I like to go with the latter since it ends up being much cleaner than dealing with submodules in git (for those interested, you can read about submodules - and the ‘fun’ associated with them - [here] (http://progit.org/book/ch6-6.html)

Chef Solo

The whole way vagrant works is to run a chef-solo instance using files copied into its /tmp directory. Specifcally, you will have a directory under /tmp for each of the cookbooks folders (copies of those folders in fact). This is where everything is run from and where you will need to check to grep the logs and see what happened (though vagrant has some pretty good info when it fails already).

That being said, it can ofte be incredibly convenient to leverage vagrant’s data copying over using the data bags available in Chef. It just ends up being less to write using Vagrant. However, this will impact all your recipes and is not recommended unless you are 100% sure that you will need to copy those files over every time. Otherwise, put it into a recipe; it will take you about 5 extra minutes, but could save you 10x as much in debugging later on.

Do the right thing.

Managing packages

For some reason, the ‘package’ command in chef doesn’t always work well in vagrant. Chef runs under the vagrant user, rather than root, so anything that requires super-user powers, needs to be handcoded. So to install emacs, you can’t just do:

package "emacs" do
	action :install
end

But instead have to do (assuing your are on a debian system, adjust for you own package manager):

bash "install emacs" do
  user "root"
  code <<-EOH
  apt-get update
  apt-get install -y -q emacs
  EOH
end

Here, we are essentially just running a shell command, as root, that (1) updates the apt-get repository and then (2) will install emacs.

One top level recipe

If you are building a dev machine or a handful of roles, it is much easier to just make a single recipe. Yes, Chef provides the idea of roles…but do you really need to have a dev machine that is a Datanode and gerrit server? Probably not. So just make a recipe for each. If you really need to add things together, then you can use a can pull in each recipe as needed.

Veewee

Veewee is the easiest way out there to build your own Vagrant ‘base box’ from scratch. It will take care of a lot of the hard work for you, if you aren’t happy with the standard, available base boxes out there.

It’s currently on github [here] (https://github.com/jedi4ever/veewee). Its definitely worth looking into if you are doing serious customizations.

Those are all the tips and tricks I have for you today boys and girls. Hopefully you found this helpful and are going to go out devops your wildest dreams.

The Worst Case

2011-11-22T00:00:00+00:00

So I’m sitting here in SeaTac Airport for the second time in as many weeks, on my way home for the holidays, and I can’t help but be be think of a saying my friend [Haris] (http://www.syedharisali.com) and I have been using recently: “Just go and do it. The worst case is your life stays exactly the same.” This is something that has been my on mind frequently over the last couple months as I try to figure out the next steps for my life.

Two weeks ago I was hanging out with Haris in San Francisco. Ostensibly, I was in the Valley for an interview (more on that later), but I was taking the opportunity to hang out with an old friend. Before going out bar hoping on Saturday night with some of Haris’s friends we made a pact to each talk to five girls that night. Not for any sketchy reason, not trying to ‘hook-up’ with some random girl, but honestly just trying to get better at it as a general life skill.

Did I reach my goal? Honestly, I have no idea. I didn’t keep a strict count and we hadn’t working out splitting odd numbered groups. But here is what did happen: I had an amazing night, talked to some cool people, found a couple new bars, and danced like crazy. It was a great night and all because I wasn’t afraid to go out there and just say, “Hi, I’m Jesse”. And do you know why I wasn’t afraid? Because the worst that could happen would be a conversation that goes nowhere and my life stays exactly the same. Its important to note that I had such a good time specifically because I realized I wanted to change the way my life is now, in hopes that it can get better. It was easy, because the downside was that my life goes on - exactly the way it was before I tried. So the question becomes, why not?

I’m told you that annecdote because it is a nice, easy example of my point. However, that kind of situation is really just a small piece of the idea (nothing “nice” and “easy” ever really tells the whole story). How about a more indepth example? (You get it already? Well, smarty pants, skip on down to the end for the punchline).

Going back to the real reason I was in San Francisco - I was there on a job interview. Why was I looking for a new job? I realized that my life needed to change. Say I took that next job and it was terrible? Well, in the worst case I could always go back to my last company and the way my life was before. No harm, no foul (it doesn’t always work out that way, but I’m a little lucky in that regard).

So what is the more likely ‘failure’ scenario? In the interim I would be doing something pretty cool, meeting new people, growing as a person, in a much better position for other opportunities and learning a ton along the way. Are you kidding me??? Sign me up!

But how did I realize I needed that change? We need a little bit of back story, a bit of perspective to understand my movtivations.

I was born, raised, went to school, and got my first job in Maryland. However, I somehow became really outdoorsy and a technologist (not as uncommon as you would think). I had to get the heck of Maryland. At the time, I was fortunate enough to be at a company that allowed me to move to Seattle and still have a job. A little bit of cushion for the transition to the west coast and still generally working on things I found interesting. So I sold or threw out most of my furniture, packed the rest of my stuff in my car and drove across the country. I was on my way.

At this point, I was starting to lean over the rabbit whole, looking to see how far down it really went.

Shortly after moving to Seattle, I did a couple things. I got involved in the the vibrant Hadoop/cloud community and jumped in the startup community. The former I had been involved with in MD, so that was an easy choice. The latter was just out of curiosity and a growing entrepreneurial streak. Quickly, I came up with an idea for a company and started working at it part-time (evenings, weekends). That idea didn’t prove out, but it gave me access to a really exciting community. Thats not to say doing all of that was easy - my head spinning like a top. Between working two jobs, contributing to open source projects (HBase and Accumulo), trying to stay in shape/train for a marathon, and getting out to meet new people in a new city was a pretty crazy couple of months.

I was starting down the rabbit whole. Along the way my sleep schedule got completely messed up (not being able to sleep and coding to 6 am on multiple occasions), I put on a good 10-15 lbs (not the good kind) and was stressed like crazy. But man, I was having fun.

Still at this point, I was basically doing the same thing I always had - look at whatever I have in front of me and trying to pick the thing that sounds the best, most fun, etc. However, I quickly came to realize that I was having way more fun doing the stuff in the ‘spare time’ than I was at work. In fact, orders of magnitude more fun. So yeah, I could keep doing what i was doing, busy like crazy and probably headed for a heart attack by 25, or I could cut out the things that I really didn’t love in favor of trying to optimize happiness. And what was the worst? The 40hrs a week I spent at my job. Yeah, I was doing kinda interesting stuff at work, but the hours dragged by and I looked forward more to my side projects than coming into the office.

Hi, my name is Jesse. I’m have a type-A personality. I am a maker. I have to build to be happy.

Once I realized that, making the leap was easy. But where did I want to go? Being in the startup community, I had a lot of options. Cloud skills on a good dev are sexy right now and I quickly found I was a hot commodity in a dev’s market. All of a sudden I had bunch of really good options doing really fun stuff. In the worst case, I could join a company (or start my own) do that for a year or two, and if those failed or I hated it, then I could always get a job at some big company or go back to the one I’m at now. Life could always go back to being comfortable.

Here is where it started to get really interesting (read: stressful). Not only did I have a lot of options, they were also varied - everything from being employee #1 at a currently 3 person startup up to getting a job at Amazon (I am in Seattle after all). So which was the best? Turns out, at the time I didn’t have a good metric to choosing. My basic criteria were: (1) doing fun cloud stuff, (2) working with smart people, (3) having leadership opportunities, and (4) making enough money to be comfortable. In short, I wanted to awesome stuff in a cool place with good people.

Well, I could do that basically anywhere. I couldn’t do anything really ‘wrong’, just potentially not as great. So close my eyes, spin around and just pick one, right? Not really an option being the type-A nerd that I am - I want some real goals and real metrics to guide my decision. Otherwise, I can just flip a coin.

Before I had a perspective on life looking to the next year and not thinking about concrete, long-term goals, but rather “I want to be THERE doing X”. That is totally ok for a twenty-something just getting out into the world. However, not thinking more deeply about what you want means you are going to be skipping from one thing to the next. And hey, that could be just fine. But for me, I need structure and goals - somewhere to be (A-type, remember?). Interestingly, having all the options for my next step actually forced me of the one-hop lookahead to instead the 3-5 year lookahead (still only two to three hops out). Its probably apparent to older folks, but recently I’ve had to take a hard look at really where I want to be in two years, in five years; 10 still feels like a ways out, so I can again label it as ‘doing awesome stuff at a cool place’.

Okay, so that took about a week of introspection to build a solid plan (solid enough - it still changes weekly, but the basic scaffolding is there). For me, I realized I wanted to start my own company doing cloud (yup, still gonna keep the idea stealthy) on the west coast. Well, once you have a longer term goal, just start splitting the time in half until you figure out where you need to be. If I want to be Z in 5 years, then in two-ish years, I need to be doing Y and then to be at Y I need to do X next, and then etc. until you find out what you need to do today to help enable that. Once I start working through the plan, a lot of the options become comparatively less attractive, until you can cut it down to a few options.

Then it becomes all about the story you want to tell with your life (well, at least next five years). It frequently doesn’t turn out the way you planned it (the plans of mice and men, right?), but it builds a consistent logic that is not only inherently elegant but also puts you in the right place, to meet the right people at the right time - serendipity. And that is how you can go out and change the world - baby steps within your bigger plan. Don’t be afraid to change the plan if conditions are right, if you are having fun, if you are really living your life, but make sure you know what you want. Otherwise, you might as well be flipping a coin.

And all that magic happens because you went out there and did something. And yeah, maybe it was uncomfortable. But know know what? Probably means its the right thing to do. The one thing I do know is this: sitting in your basement by yourself, doing the same thing you do everyday isn’t going to change anything. But if you start putting yourself out there, I bet 10:1 that people are going to come knocking.

Think about in 30-40 years, sitting around the hyper-television drinking your future-scotch, telling your kids/grandkids about your life story. You want it to be fun, exciting, and smattered with some life lessons. You do want to be cool parent, right? That means taking some risks, diving in head first when the time is right (and sometimes not) and make a dent in the world. Because you know what? If you don’t, thats going to be one heck of a boring story.

So yes, start that company. Find the job that will make you happy. Grab a beer with your friends. Talk to that cute guy/girl at the coffee shop. Write your own story. Follow your own rabbit hole.

“Just go and do it. The worst case is that your life stays exactly the same.”

Intro To Culvert

2011-11-17T00:00:00+00:00

Culvert is a secondary indexing platform for BigTable, which means it provides everything you need to write indexes and use them to access and well, index your data. Currently Culvert supports HBase and Accumulo, though new adapters are in the pipeline. If you don’t know why we need a secondary indexing platform for BigTable, I recommend checking out my previous post.

I’ll pause while you go and catch up.

…

Ok, at this point I’ll assume that (1) you know what secondary indexing is and (2) you want to know how to actually use Culvert to solve your secondary indexing problems.

Lets start with how you can actually get a Culvert client up and running. Turns out its pretty simple.

We are going to use an example of connecting to an Accumulo Database:

// start configuring how to connect to the instance
Configuration conf = new Configuration();
conf.set(AccumuloConstants.INSTANCE_CLASS_KEY, ZooKeeperInstance.class.getName());
conf.set(AccumuloConstants.INSTANCE_NAME_KEY, INSTANCE_NAME);
// set all your other configuration values
...
// create the database adapter with a configuration
DatabaseAdapter database = new AccumuloDatabaseAdapter();
database.setConf(conf);
// create a client to configure
Client client = new Client(CConfiguration.getDefault());
//setup the client to talk to your database
Client.setDatabase(database, client.getConf());

That wasn’t too bad, right? At this point we’ve got a client to talk to the database. Since you are using Culvert is for indexing, the next thing you would want to do is add an index. Its actually pretty simple programmatically:

// create term-based index: index each of the words in the value, where the
// row key is the word and the row id is stored in the rest of the key
Index index = new TermBasedIndex(INDEX_NAME, database, PRIMARY_TABLE_NAME,
	INDEX_TABLE_NAME, COLUMN_FAMILY_TO_INDEX, COLUMN_QUALIFIER_TO_INDEX);
// other index definitions could also be loaded from the configuration
...
// and programmatically add the index to the client's configuration
client.addIndex(index);

Its important to note that each index needs to be given a unique name, otherwise namespace conflicts will occur. But generally this is not a problem and it useful when you want to have more than one index of the same type (eg. You want to do a TermBasedIndex on two different tables, two different fields, two different whatever).

You can also save yourself some effort by setuping your indexes in the configuration – the client will pick these up when it starts and automatically make sure the indexes you specified are used.

Once you have the client setup and all the indexes specified, the next step is to put data in the table. All data is wrapped as the high level Culvert type key and value - a CKeyValue. A CKeyValue is then transformed into the correct key and value for the underlying database. This makes doing an insert very similar to how inserts are done already in a BigTable system:

// build the list of values to insert
List valuesToPut = Lists.newArrayList(new CKeyValue("foo"
      .getBytes(), "bar".getBytes(), "baz".getBytes(), "value".getBytes()));
//wrap them in a put
Put put = new Put(valuesToPut);
//and just make the put
client.put(PRIMARY_TABLE, put);

Pretty simple, right? Not only are these items being inserted into the database, Culvert also takes care of all the heavy lifting for you of make sure those values get indexed by all the indexes you have added to the client.

Secondary indexes are only useful if you can actually access the data. Culvert also handles doing this via “Constraints”. A constraint is the way you query the index, it’s the way you get the columns associated with row ids that the index stores and its also the way you can do efficient SQL-like queries.

For those interested, we used the decorator design pattern here to make it really easy to that nesting. Every constraint takes another constraint and some parameters.

Querying your data back out using the indexes is a little bit more complex as you have to build up your constraints but once you pick up the general strategy, it isn’t too bad. Lets start with just doing a simple query of the index looking for any records that have the word “value” in them:

Index c1Index = client.getIndexByName(INDEX_NAME);
Constraint c1Constraint = new RetrieveColumns(new IndexRangeConstraint(
     c1Index, new CRange("value".getBytes())), c1Index.getPrimaryTable());
// check the first constraint
SeekingCurrentIterator iter = c1Constraint.getResultIterator();

First, we get the index out of the client that you want to use when querying (to make sure you are searching for the right field). Then you build a constraint to use as a query.

That constraint is actually a nested constraint, describing each step in the process. First you scan the index to get the row ids of the field you are looking for (in this example, rows that the have word “value”) – this is the IndexRangeConstraint. You can basically think of this as a ‘WHERE’ clause where we explicitly specify the index to search for that value. This is because for things like the TermBasedIndex, you would be looking for values in different fields, depending on which index you use – you don’t want to look for email sender names in the content field, right? Most queries are going to start with an index range constraint.

Then once you have an all the row ids, you can go and actually get the rows specified using RetrieveColumns – retrieving all the columns associated with that row id. Its just like you would be doing with indexes all ready, just formalized and prebuilt for you. Makes sense, right?

That is the simple case - you just want to pull values of your table that you have indexed.

Now consider a little bit more complex case – doing an AND between the results of two queries. Now the simple, home rolled solution, is that you load all the left side of the AND into memory, then check to see which values from the right side match up. This is actually pretty bad if you pick the wrong side of the AND to load into memory – you will probably blow out memory and crash your client before getting any result. Culvert takes a different approach - each side of the AND is streamed to the client and only matching values are kept around, so you never have more than the number of matches +2 elements in memory. It looks something like this:

Each side of the AND is streamed back to the AND logic on the client, where we can decide which rows to keep and which rows to discard. Note here that Culvert is leveraging the fact that the BigTable model enforces that from each TableServer or RegionServer will returned ordered results, so all we need to do is make sure we match the right results up. Here is the code to do an index based AND:

Constraint and = new And(c1Constraint, c2Constraint);
iter = and.getResultIterator();

In the beginning lets assume that you already have the Constraints for each side of the AND based on the index you want to search (just like we did before)

Culvert also supports a variety of other SQL-like constructs OR and JOIN. OR works very similarly to AND, just with slightly different logic. JOIN, on the other hand, can be either naïve – just joining two tables – or index based. However, in both cases if the underlying database supports it, the JOIN is actually implemented as a server-side join. This means it is incredibly efficient and powerful. Currently only the HBase adapter supports server-side joins, but Culvert developers are working on extending Accumulo to support this functionality (see ACCUMULO-14).

If you don’t want to use straight Java to interact with the index, Culvert also (soon!) works with Hive. It integrates directly with Hive as just another handle (similar to how the HBase-Hive handler works). When you send an HQL query to Hive, Culvert pulls out the predicates that it can handle and then queries the indexes you have specified via configuration to serve out only the results that Hive will actually use. This means you get huge speedups using Culvert with Hive.

And that is really all there is to using Culvert!

I’m thinking about continuing the Culvert series with a more in-depth look at the underlying architecture - really examining some of core components and how all the pieces fit together. In the mean time, if you are interested in learning about how it works under the hood, you can see the original talk from [Hadoop Summit 2011] (http://www.slideshare.net/jesse_yates/culvert-a-robust-framework-for-secondary-indexing-of-structured-and-unstructured-data)

The code is available on [gituhub] (http://www.github.com/booz-allen-hamilton/culvert). Feel free to check it out, provide feedback, and if you are feeling really generous, contribute some code :)

Filling in the BigTable Gaps

2011-11-16T00:00:00+00:00

Big Data is big innovation, big headaches and in the end, big money. The only problem is, is it can be a huge pain to get running….and then to get running correctly. Recently – in the last 5-8 years – we have seen huge efforts by major companies (Yahoo, Facebook, Twitter, etc.) have put a lot of resources behind these technologies, doubling down on certain stacks to enable next generation business. So there is definitely something there.

The technology is still very immature and driven almost entirely by the Open source community. The implications of that is another blog post¹, but the punch line is there are a lot of rough edges, incomplete features and frequently a sloooow process².

More than a few companies have also been started (Cloudera, Datastax, Opscode, Hortonworks, etc.) around the ideas of to make these tools stable, fast and enterprise ready. Oh, and they sell support (gotta make money somehow, right?). So clearly this cloud stuff needs a lot of help and a lot of more features.

But I’m getting ahead of myself.

Lets jump back 5 years – Google releases the BigTable paper and the open source world jumped at the idea, quickly spinning up [HBase] (http://org.apache.hbase) under the Hadoop umbrella. And for a while it seemed great! I can store petabytes of data - awesome. I can access it in real time - even better. And do appends, updates and deletes over a write-once file system? Fantastic. It was even so great the US Government came up with their own version of BigTable, Accumulo, optimized for high throughput, though still faithful to many of the aspects of the original BigTable.

So great, we have this massively scalable database. Well, turns out that BigTable doesn’t cover everything we want to do with the database, particularly if you want to do fast lookups or scale out even farther or do traditional RDMS operations. So along comes Megastore. Now, there are a lot of things going on in Megastore that most of the companies outside of Google don’t need or are covered via alternative means (see [Hive] (http://hive.apache.org) or [Pig] (http://pig.apache.org)). However, one of the things that isn’t really covered by external tools is indexing.

Now you are probably, “Woah, hold on! What about Lily? Or Solr? Or etc…???”Well these things are good if you are doing indexing on just one thing – unstructured text in a given field. And a lot of times, that’s all you need. This is especially true as these tools integrate with search tools. However, what about the case where you need to index across multiple fields? Or build your own special indexes to make it go fast? What about trying out new index schemes? Then you are going to be out of luck and hand rolling your own.

To make sure these indexes scale you then have to store them in a cloud (probably the same database as the one hosting your data). Okay, doable but that can get a little tricky to make sure it scales well. Then you have to make sure that when you update your database that the indexes also get updated. And then you have to build a tool to use those indexes. What about using something SQL-like? Then you are writing your own SQL parser and then pipe that into your indexing and then use that to pull out and combine data. Now consider that you have to make that performant on a cloud scale. Ouch.

Clearly this is a hard problem. And every organization doesn’t want to solve this problem from the ground up, scalably every time.

You just want to write indexes and have it integrate with all your tools right away. You don’t want to deal with writing a specific client to handle indexing your data on ingest. You don’t want to have to worry about using those indexes on query.

You just want to get some data out as fast a possible. And with standard BigTable, this isn’t possible. Yeah, it’s pretty fast. And yeah, it scales like crazy. But you need to do a lot of work to make sure it goes fast. And you need it to go fast.

Enter Culvert.

Culvert is a secondary indexing platform, which means it provides everything you need to write indexes and use them to access and well, index your data.

It takes care of all the pains of indexing your data as it comes in –you don’t need to worry about making sure your index tables match your real table. Culvert ensures that when you do a query, indexes are used properly to get you back the answer as fast as possible.

All you have to do is write your own indexes so your data can be accessed quickly. Cut out all the developers to build a custom interaction. Drop all the people worrying about maintaining a special database for the indexes. All you need is a couple smart people with a good idea of what the data looks like to write down the best way to access the data.

Sounds easy, right?

In the next post I’ll talk about how you can actually use Culvert in your own system. Then we’ll finish it up with a post about how the internal of Culvert really work.

Working on that post. ↩
Not always – there are many cases where the open source stuff is way better than the closed version. However, this tends to be the exception, not the rule. It is interesting that in the cloud space, open source software has proven to be far more widespread (and higher quality) than the closed source solutions. For the counterexample, see ([MapR] (http://mapr.com/)) ↩

Welcome to My Blog

2011-11-11T00:00:00+00:00

Hi, I’m Jesse. If you are reading this, you probably got here after reading one of my other posts. Congratulations on being one of the few people compelled to read everything.

In case you haven’t gathered from some of the posts here, I really like playing with Big Data. As in, I really believe it can change (is changing?) the world. Right now I’m working on a way do that, but more on that later - let’s just call it in ‘stealth mode’.

So if you like hearing about big data, cloud, programming, entrepreneurism, and (among other things) general nerding out, stick around!