umbrant blog

Paper review: Relational Cloud, Database Scalability

2011-12-11T22:15:00Z

This is a combined paper review for "Relational Cloud: A Database-as-a-Service for the Cloud", a paper from MIT published at CIDR '11, and "Database Scalability, Elasticity, and Autonomy in the Cloud", an extended abstract from UCSB which appeared in DASFAA '11. These papers deal with the strategies used to transition databases and storage systems to the unique challenges of the cloud environment.

Database Scalability

I'm covering the UCSB paper first, since it's essentially a survey paper. It covers the required properties of a cloud storage system or database, and the different techniques used to achieve these properties. The important thing to remember here is that it's always a game of tradeoffs and choosing your point in the design space; databases (including cloud databases) normally mean ACID properties and an SQL interface, storage systems (BigTable, Dynamo) normally mean a wider array of consistency guarantees and a more programmatic interface. Once you've chosen what exact consistency properties or programming interface you want, the underlying techniques used are about the same.

As is stated in the title, there are some core requirements for any database or storage system in the cloud: scalability, elasticity, and autonomy. Scalability means scale-out, the ability to use multiple nodes to gain increased storage capacity and performance. Elasticity is one of the core selling-points of the cloud: pay-as-you-go pricing as a cloud consumer, adding and removing nodes in response to the load on your service. Autonomy just refers to the ability to do these things automatically and reducing management overhead, since people are expensive, and the cloud means that you could potentially be dealing with lot of nodes (and thus potentially a log of people).

The paper then establishes the design space. Pure key-value stores are an unfriendly programming model (not enough consistency), but you also can't just run MySQL in the cloud, meaning that you want something in between. The authors describe taking a key-value store and providing strong consistency on an entity group (think Megastore) to be data fusion, while taking a database and sharding it to be data fission (of which Relational Cloud is an example). They both share the same property of intra-group/shard operations being efficient (on the same node), but cross-group/shard operations being expensive (two-phase commit!). Like I said before, they end up sounding about the same once you choose the same point in the design space.

The real difference here is in the provided API. A data fusion approach is more explicit about performance, since going cross-group to do an expensive operation requires more code wrangling (do 2PC yourself). On the other hand, data fission will still run your naughty cross-shard SQL query, it'll just do it slowly. Partitioning data in data fusion is also generally more explicit (Megastore makes you define your entity groups), while data fission tries to do it automatically under-the-hood based on your query access pattern.

Relational Cloud

This is essentially an implementation of a data fission approach from the MIT databases group. They identify many of the same points brought up in the UCSB paper, and add another of their own: privacy.

Scalability is achieved through data partitioning. Rel Cloud uses a graph partitioning strategy to identify min cuts on a graph representing query execution traces, basically trying to group together data that is used together. The clear problem here is speed. It's slow to turn a cloud database worth of tuples into a graph, run the partitioning algo, and then move the data around. Ideally, the system would be able to do this in reaction to load spikes (on the order of minutes), but that's unlikely. Unless the algo is weighted properly too, it could result in bad "full shuffle" data movement patterns, and the inability to manually tune is classic monolithic "let us handle everything" database thinking.

Rel Cloud introduces Kairos to take care of autonomous elasticity. It monitors load and the current working set of the database, and adds or removes nodes in response to this. Kairos also can migrate data partitions to take care of load imbalances. It also does pretty deep modeling of I/O performance to figure out the capacity of the system, which I'd like to hear more about.

The final properties is privacy. This refers to CryptDB, a paper that is also on the reading list for CS294. It essentially is a way of doing a limited subset of SQL operations on encrypted data, where the data is stored in the most secure format that can still support the requested operation. In this way, the database only ever sees encrypted data, though there are some assumptions about keys and the threat model that I find slightly unconvincing.

Analysis

It's important to realize that there's always going to be a core group of business users who won't want to learn some new API, and for whom a data fission/Rel Cloud approach is the only solution they're ever going to use. SQL, even if the database community disagrees, is one of the defining attributes of a database, and that's a major selling point to some. Data fusion key-value stores are well and fine for hip Ruby-on-Rails hackers and Google, but small or medium sized businesses that don't own their own datacenter but want to use the cloud probably want the Rel Cloud database-as-a-service. They want something that looks just like a normal DBMS, but has the additional scalability, elasticity, and autonomy properties that come with the cloud.

Paper review: Bigtable

2011-12-11T21:32:00Z

This is a paper review of "Bigtable: A Distributed Storage System for Structured Data", published at OSDI in 2006. This is Google's columnar key-value store built on top of GFS, and I believe that it's the preferred storage system within Google. It's also important to remember that even though it has rows, columns, and the word "table" in the name, it's doesn't provide attributed traditionally associated with a database.

Highlights

BigTable uses a three-level indexing scheme to resolve a value: (row, column, time). This time field is the surprising addition; apparently it's used for versioning and garbage collection of old values (an optional, per-table feature). It also allows multiple values for the same (row,column) tuple. Tables in BigTable are also sparse and stored in columnar format, meaning that any given row probably only populates a fraction of the hundreds of columns in a table. The number of columns in a table isn't limited, but columns do each have to belong to a single column family, which is a more permanent entity. Column families are used as a means of access control, and maybe also to optimize access patterns.

One thing that kind of threw me at first is how columns are used in BigTable. Unlike traditional RDBMS where the table schema is fairly fixed, BigTable encourages the developer to add lots more columns, in fact storing data as a new column. This is demonstrated in their web page example; their schema actually adds a new column for each webpage, and each domain that links to the webpage. It's really best to just think of it as a big, columnar, scalable key-value store.

BigTable also builds heavily on other internal Google systems. It uses GFS for persistent storage of data. Chubby is used heavily for things like bootstrapping connections to BigTable, master election, detecting partitions, storing BigTable schema, and access control. Note that a single BigTable tablet server handles all the reads and writes for a tablet; GFS takes care of durability, so all BigTable has to watch out for is load imbalance (which can be handled by migrating tablets to other tablet servers and caching).

Metadata look ups all happen in memory, meaning reads take just a single disk access if they aren't served from cache. Writes happen to a commit log stored durably in GFS, and kept in memory in a memtable. When the memtable fills up, the log gets compacted and then written to disk as an SSTable (which is an immutable key-value format). SSTables are merged periodically, otherwise an awful lot of SSTables could have to be read on recovery.

Analysis

Compactions are expensive and can lead to bursty performance. Tablet servers also aren't necessarily located with the corresponding GFS node that durably stores the tablet, meaning reads often have to go remote. Load imbalances also seem like they'd be a big problem, even with the ability to split and migrate tablets, since only only one node can handle reads and writes (which does simplify consistency).

All that said, it does scale up pretty huge, and sees widespread use within Google. There are lots of good points too. I like their shootdown protocol for maintaining BigTable membership; if the master can't reach a tablet server, it force-removes its Chubby lock to make sure the tablet server kills itself. Masters also kill themselves if they lose their Chubby lock on master-ship. This prevents partitions from leading to divergent state, and is a cute idea (despite the name and terminology). BigTable also shows that a simple design with just row-level transactions can work for a broad array of applications, which is a heartening thought for the systems guy in me.

Paper review: DryadLINQ and FlumeJava

2011-11-02T16:57:00Z

I'm combining two paper reviews this time, for "FlumeJava: Easy, Efficient Data-Parallel Pipelines" (PLDI '10) and "DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language" (OSDI '08). These are both high-level languages that compile down to the MapReduce and Dryad respectively, and I think share a lot of similarities.

Main Ideas of DryadLINQ

The basic premise for both FlumeJava and DryadLINQ is that it's hard to write "raw" MapReduce or Dryad programs (especially true for Dryad), and that really, they should be treated as an underlying execution engine for higher-level, declarative DSL embedded directly in a common productivity language. These higher-level languages then compile down into a lower-level execution plan (e.g., MapReduce jobs, or a Dryad DAG). This makes it somewhat analogous to how a database works, especially true since LINQ can use SQL Server as an execution engine. It doesn't enforce a schema or give the other nice properties of a traditional RDBMS, and it lets queries be written in either a more SQL-like or a more object-oriented approach.

Talking a little more about LINQ, it's a query language that can be used directly in .NET languages. Microsoft basically swapped out the normal SQL Server backend for Dryad, meaning that there's excellent language integration because of the maturity of LINQ. The "schema" is thus defined by the application's use of datatypes, rather than enforced by an underlying schema. This abstraction is made even better by the fact that DryadLINQ will automatically partition data across nodes (I think according to access pattern or data type).

Because it compiles down to Dryad (a pretty flexible execution engine), it allows for optimizations beyond what is possible with MapReduce, namely in smartly reusing in-memory data, avoiding disk writes after each stage, more efficient modes of moving data, and more flexible execution DAGs. Furthermore, they can also do runtime optimization for making efficient aggregation trees.

Main Ideas of FlumeJava

FlumeJava is a pure Java library that provides special Java collections and operations which get translated into MapReduce jobs. It also serves a similar purpose as something like Pig, where one of the primary advantages is transparently chaining together multiple MapReduce jobs into a processing pipeline. FlumeJava also does a bunch of optimizations on the resulting dataflow graph to combine and optimize the different stages, but still has to deal with ultimately reading from and writing to disk between stages (unlike DryadLINQ, which support in-memory transfer). It also takes care of messy things like creating and deleting the inter-stage files, as you'd expect.

The result is something that comes really close to the performance of a hand-optimized MapReduce pipeline, meaning there really isn't much reason for people to write raw MapReduce at Google anymore. Since it's just a library, it's easy to bring the same sort of functionality to other languages. FlumeC++ already exists, and it shouldn't be that hard to make a FlumePython or the like too.

Future Relevance

Not writing raw MapReduce / Dryad code is a lesson we've learned from all of the higher level languages (Pig, Hive, Spark, and these two). The future definitely looks more declarative, and I like the direct language integration of DryadLINQ and FlumeJava more than introducing a new DSL like Hive or Pig. It makes it effortless to do large scale computation in a language that you already know.

That said, all of these approaches are sort of converging. There really aren't that many types of operations that map well to the MapReduce model, and all the approaches pretty much have all of them. I don't think there's much more to be done here on the research front. It comes down to ease of use and debugging at this point rather than the programming model itself, which is actually one of the big wins of Pig (the debugging console).

Paper review: PNUTS

2011-11-01T12:59:00Z

This is a paper review of "PNUTS: Yahoo!'s Hosted Data Serving Platform", published at VLDB in 2008. This is a distributed, cross-datacenter key-value store that introduces the notion of "timeline consistency" for records, which is stronger than mere eventual consistency, and is still easy for programmers to reason about. One of my more favorite papers from the reading list.

Main ideas

I like PNUTS a lot, and it's always a little hard to criticize industry papers that present a real-world production system that is being used by thousands of internal programmers and serves millions of records a day. That said, PNUTS does a better job than others (cough Dremel cough) in presenting where it lies in the great distributed storage system design space. It's got the sweet spot of an architecture that is conceptually simple, a consistency model that is easy to understand, and a complete API. One of the core questions while I was reading Megastore was how application programmers were possibly supposed to possibly design their application and use this complicated system; PNUTS feels comparatively much easier. An easy to understand consistency model and API is actually way more important than slightly stronger guarantees, unless you're only designing for the ubermensch programmer (foolish, even at a place like Google).

So, lets talk about timeline consistency, the model presented (if not invented) by PNUTS. PNUTS uses a record-level mastering scheme that requires each record to be "owned" by a single replica. All writes to this record have to go through this replica, meaning that we have record-level serializability (the same sort of guarantee given by lots of key-value stores). Write propagation is done asynchronously by using the pub/sub Yahoo! Message Broker to avoid synchronous inter-datacenter roundtrips. This means there is some potential durability/availability loss of writes if the YMB fails, but Raghu in his talk indicated this was a very low probability. There's also an write availability loss if the master replica goes down, since there might be pending writes at the master.

API wise, we're presented with a "choose your own consistency" model for reads and a test-and-set write operation, besides normal blind reads and writes. Blind reads and writes don't have any special semantics; timeline consistency says that reads are always consistent, just potentially stale. Reads can also specify a minimum version, or ask for a fully up-to-date version. Test-and-set write lets apps do lightweight optimistic concurrency, by doing a read (getting a value), and then doing a test-and-set write to only write if the version matches the version read, abort and retrying if not.

You can effectively emulate "cross record" transactions by packing all your data into the same PNUTS record or denormalizing (with, of course, a loss in flexibility or consistency), which might be why Raghu says that Yahoo!'s developers don't need cross-record consistency guarantees.

PNUTS also will dynamically transfer master responsibilities to a geographic replica closer to where writes are being sent, to reduce latency. My impression is that this is a fairly lightweight operation, since all that really needs to happen is transferring the master bit, and delaying writes while waiting for the old master's writes to flush. YMB only gives total ordering on messages sent from the same datacenter, which is why the new master has to wait.

Future relevance

It's still an open question whether web applications really need multi-record transactions or not, since the claim by Yahoo! and the PNUTS team is that they haven't seen a need from their own developers. Staleness is okay, inconsistency and reordering is not. I find the consistency model easy to grok, and Raghu indicated that there's no real desire to significantly change or redesign the system. The paper states "multi-record updates" and "eventual consistency" as future work, but that hasn't happened in the 3 years since the paper was published because of a lack of demand. I find that tremendously interesting, and a very compelling backing for this intermediate kind of consistency model.

Paper review: Dremel

2011-10-31T13:16:00Z

This is a paper review of "Dremel: Interactive Analysis of Web-Scale Datasets", published in short form at VLDB in 2010. This is a large-scale, interactive analytics engine built by Google that handles adhoc queries on terabytes to petabytes of data, returning aggregate results in seconds to minutes (and if that's not cool, I don't know what is). However, the paper didn't cover query execution, and instead talked about the novel, but kind of boring, nested hierarchical storage format. I'm still hoping for a more systems-y paper in the future.

Main ideas

Dremel was optimized for one thing: scanning through lots of read-only data really fast, and generating aggregate results that reflect something like 99% or all of the data (99% letting you chop off the latency tail). This makes it great for doing adhoc drilldown analytics, when you're trying to poke at data from many angles to identify what you want, before writing a more involved analytics program in a different language to analyze it. It's really not optimized for doing point lookups, updates, or more complicated analytics: big scans that result in aggregate numbers is the name of the game. Unlike Pig or Hive, it does this with its own custom query execution engine, rather than compiling down to the "common substrate" of MapReduce.

The nested data format makes use of "repetition" and "definition" fields to specify at what level in the hierarchy that any given value in the column is. These let us reconstruct the entire nested data structure by storing only the leaves of the tree, but means it has to do scans to find out which record any row belongs to since the position depends on all previous entries in the column. The use of null values and repetition and definition also allows really easy compression of null values and the "wide but sparse" style of BigTable where there are a lot of columns, but not all are are filled out.

The rest is less interesting. Dremel is queried in an SQL-like language, and works best when do aggregation results on a few columns. Doing many columns is expensive because they need to be joined. It also makes good use of multi-level aggregation trees to get better parallelism, since each individual aggregator has to process less data. In-memory caching and prefetching further improve performance. Because of the efficient data storage format, it often can read only an order of magnitude less data than a comparable row-oriented MR job. Going to Dremel's more efficient execution engine results in another order of magnitude speedup.

Fault-tolerance and straggler detection also play in to execution time. When trying to run a 10-sec query on thousands of nodes, it's very likely that you're going to be hitting a slow node or two. This is why Dremel allows for "99.9%" type results, that reflect almost all, but not quite all, of the data.

Future relevance

I like the idea of custom systems besides MapReduce built for specific tasks like this. Google chose to make a system that does one thing really well, with clear tradeoffs in terms of performance and features. They gave up the ability to modify the data or do point lookups, and resulted in a system that is two orders of magnitude better than MapReduce. There's clearly a need for more interactive query systems than MapReduce, though also clearly not as general purpose and not a complete MR replacement.

Paper review: Dynamo

2011-10-27T10:35:00Z

This is a paper review for Dynamo, the tunable consistency/availability/durability key-value store built by Amazon. It's based on the Chord DHT, and was published at SOSP in 2007. It's also one of my favorite papers.

Main idea

This is an industrial paper, so the novelty comes from the engineering effort that goes into making Chord practical for the datacenter. The authors clearly did their homework before building the system, resulting in the practical application of a number of different techniques.

I want to start by talking about the usecase that Dynamo was designed for. A DHT key-value store has the major benefits (generally speaking) of being relatively simple, quite fault-tolerant, good at spreading load, and easy to scale. The downsides (again, generally speaking) are the slightly erratic behavior in terms of consistency and routing performance, and undefined behavior when it comes to actually storing and moving data around. Chord, for instance, defines just a routing protocol. After you finish hopping around to get to the node with a certain key, the data itself isn't necessarily on that node (instead, the true location of the data, meaning one more lookup).

Amazon's primary usecase for Dynamo seems to be for its shopping cart, where it's really important to have highly available, even if slightly inconsistent, writes. This works really well since shopping cart updates are pretty commutative; it's easy to just take the union of divergent shopping carts, and reach a mostly consistent state. There can still be problems (what if the user adds the same item once in each cart? What if they add and delete in one and add in the other?), but these can be kicked up to the user at checkout time and resolved manually. It's not to say that this happens very often at all, but when nodes do fail, almost normal-looking operation can continue.

The secret sauce here is Amazon's tunable R+W>N consistency model. The application programmer using Dynamo specifies the number of replicas that must be updated on a read (R) or write (W). As long as R+W is greater than N, the total number of replicas, we should be able to provide consistency to the user (assuming we can correctly merge writes). This means for a typical replication factor of N=3, the programmer can specify highly available writes and slower but consistent reads (3+1>3), a more balanced approach (2+2>3), or assuming a read-heavy workload (1+3>3). Increasing N increase the replication factor, meaning better durability. Choosing R+W≤N lets you play the brave game of eventual consistency, relying on your merge function more to do the right thing.

A couple notes to close out. Amazon's metric for Dynamo was 99.9% percentile latency, the first time it was indicated to me that variation in latency, rather than average latency, is the real killer. Dynamo also utilized the Chord ring-membership protocol, but used O(1) routing instead of Chord routing since it's a datacenter environment where all the nodes are known and presumably long-lived. They used cool things like vector clocks and Merkle trees to do efficient detection and merging of updates. When the vector clocks diverge, the programmer has to provide the merge function (the default, and most heavily used, being last-writer wins). These, and other details, are what made it such a revelation to me.

Future relevance

I think all of academia had a love affair with DHTs for a while, because of all the nice probabilistic and mathematical properties that they have. Chord is still one of the coolest papers ever to me. However, for the datacenter environment, we have to wonder if this is the right model. I wonder how many of the properties of a DHT are really necessary. Fault-tolerance via replication is not unique to DHTs, neither is elastic scaling or load balancing. I find the "choose your own consistency" to be cool, but the apparent result was that most programmers just left everything at default. Default R, W, N, default merge function. Eventual consistency is also a weak model, and Dynamo can give you either fully consistent (slow, low availability), consistent if you rely on your merge function (dubious), or eventually consistent (eww).

Thus, I'm making the call that for datacenters, pure DHTs like Dynamo don't really make sense. We need a stronger consistency model, and we need it to be more automatic and easy for programmers to reason about.

Paper review: SCADS

2011-10-12T15:51:00Z

This is a paper review of "SCADS: Scale-Independent Storage for Social Computing Applications" by Armbrust et al. This was published at CIDR in 2009. In a nutshell, SCADS is a key-value store that lets programmers choose their own consistency model and semantics, and restricts queries to be "scale-independent", i.e. requiring a constant amount of work.

Main idea

I think SCADS chooses an interesting point in the scalable storage design space to focus on.

Simple key-value storage interface
A query language that only allows constant-time requests (no O(n) operations that fail at scale)
Declarative, tunable consistency models, letting the programmer specify consistency at the level of application requirements
Scale-up/scale-down architecture designed for incremental cloud pricing

This makes me feel the comparison against Facebook's heavily sharded MySQL cluster behind memcached is kind of unfair because they are pretty different usecases, but there is still a lot of merit behind the ideas in SCADS.

SCADS doesn't seem designed for ad-hoc queries, since handling requests in constant time can require building indexes, which is potentially quite expensive. Updating indexes is also a potentially high cost. I'm not really sure how to keep both reads and writes in constant time here, since denormalizing means writes might require O(n) writes.

I really, really like the idea of declarative specifications for consistency, performance, and other application constraints. I feel like application developers really shouldn't have to reason about the details of replication, data placement, and consistency; they should be able to state what they want at a high level in terms of application requirements, and have the system figure out how to achieve this. This pushes the responsibility down to the people running the storage system, who are hopefully better able to reason about machine failure rates, the types of failures, and the consistency and durability properties of the system.

Future relevance

As I hinted above, I really like the idea of declarative specification of application requirements. It's not an easy problem to translate this into the low-level SLOs that can actually be enforced by a cluster resource manager, but it's a good one. Providing these guarantees to all the different kinds of applications running on a cluster is the end goal. This is probably hard to do generally without some application-level semantics about the incoming requests (SCADS and it's constant time only requests for instance).

At the very least, I'd like to see this done for a more general purpose storage layer, perhaps something like GFS or Bigtable.

Paper review: Hive and Pig

2011-10-09T17:44:00Z

This is a paper review of "Hive: Data Warehousing & Analytics on Hadoop" by the Facebook Data Team (a set of slides), and "Pig Latin" A Not-So-Foreign Language for Data Processing" by Olston et al. The Hive slide deck I believe is from 2009, and Pig was published at SIGMOD in 2008. I supplemented this with the Hive paper published at VLDB in 2009.

These are Facebook and Yahoo's approaches to higher-level languages that compile down to MapReduce on Hadoop. Measured by the percentage of Hive and Pig jobs on their production clusters, they have both been extremely successful. Hive takes a traditional SQL/database-like approach, while Pig looks more imperative. At face value they seem quite different, but there are actually a bunch of underlying similarities.

Hive main ideas

Hive is effectively a traditional database that just uses HDFS and MapReduce for data storage and query execution. Tables are serialized and deserialized to files in HDFS, and can be partitioned across and within fields. The query language, HiveQL, looks exactly like SQL minus some of the more complicated operators because of engineering effort and the limitations of MapReduce. UDFs are also supported, meaning that normal MapReduce code can be slid right in. HiveQL is compiled down into a MR query plan which can consist of multiple MR jobs. The logical plan is optimized by a rule-based optimizer (future work being an adaptive cost-based optimizer).

Queries can be fed to the server via a Thrift server, which enables Hive usage from a variety of different programming languages. A small note is that table metadata is stored outside of HDFS, in a normal database. This is simply because the amount of data is small, and the access pattern is pretty random, making HDFS ill-suited.

Pig main ideas

Pig is designed explicitly for ad-hoc data analysis by programmers. The query language looks like Python with operators pulled from SQL, and instead of tables, users are given more programmer-friendly data structures like maps and lists. UDFs are also first-class citizens in Pig, and can have arbitrary inputs and outputs (non-atomic values).

All this means that the query language and data format are more flexible. Hive needs to the classic ETL (extract-transform-load) to get data into tables before it can query it. Pig, you just pass it a file and a function explaining how to interpret it. This likely comes at a performance cost, but using the standard deserializers and a schema would ameliorate this. Pig also allows for more explicit control over the query plan, since each stage in the execution DAG is as programmed.

As in Hive, Pig does not provide some operators because of the limitations of MapReduce. Also as in Hive, statements using the SQL-like operators can be optimized, and multiple MapReduce jobs are chained together for you automatically.

One thing I really like about Pig is the focus on debugging. Trying to reason about a page of SQL is really difficult, and it's much easier to reason about Pig's series of steps. It looks extremely similar to how I do ad-hoc text parsing in Python: gradually applying operators to collections of strings until I get the result I want. Pig also provides an "example execution table" that shows what the Pig program does on a small amount of data, which is much quicker than running the actual MapReduce jobs.

Analysis

It's handy that both Hive and Pig automatically string together MR jobs as part of one program, but you still pay the serialization overhead of writing things into intermediate files between jobs. This is something that isn't true with a more general execution framework like Dryad. The move towards more declarative languages, as I've said previously, isn't surprising at all, since actually programming a MapReduce job is way more work than using something more high-level and declarative. For ad-hoc queries, it's way better to optimize for programmer productivity than try to squeeze out that last 20% of performance from writing it in raw MR.

Hive has been extremely popular at Facebook, and I think the same is true of Pig at Yahoo. I think the future is going to be improving the underlying Hadoop execution engine to better support ad-hoc queries by keeping intermediate files in memory, and improving the number of operators and optimizers for both languages.

Paper review: MapReduce and Dryad

2011-09-30T11:58:00Z

This is another combined paper review, because the ideas are again pretty similar, and it's a useful compare/contrast. The first is the famous MapReduce paper from Google, and the second is Microsoft's response, Dryad.

"MapReduce: Simplified Data Processing on Large Clusters", Dean and Ghemawat. Published at OSDI 2004.
"Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks", Isard et al. Published at Eurosys 2007.

Main ideas from MapReduce

MapReduce is a parallel data processing framework designed initially for a very specific task: scanning large amounts of textual data to create a web search index. It was essentially codesigned with GFS for this purpose. As a result, it boils down computation into just two phases: map, followed by reduce. Programmers have to write just two functions, one for each phase. Then, these two functions are run in massively parallel fashion: mappers run the map function, and the output from mappers is then fed to reducers, which run the reduce function. Mappers are scheduled for data-locality, moving computation to where data is stored to minimize network communication. The map phase essentially does some data-parallel operation, while the reduce phase aggregates results from the map phase to produce the final output.

The cool parts of this paper are twofold: first, that such a simple, limited programming model can accommodate such a wide variety of tasks, and that almost all of the complexity of running code on thousands of machines can be abstracted away from the programmer.

Regarding the programming model, it's something that can be taught in a matter of days. Map and reduce are familiar from functional programming languages, and really small amounts of code can do very powerful things. It is quite limited (only works for data-parallel operations), but when dealing with big data, your operations basically have to be data-parallel to complete in any reasonable time. Google used MapReduce to do a wide variety of tasks, so the proof of utility is in the pudding.

The distributed, fault tolerant framework is what really drew me in. A single master manages all the workers (which potentially means a single point of failure), but this is way less likely than a worker failure. Failure of workers is handled transparently, by restarting the worker. This is possible because the output from the map stage is durably written to disk storage, then read by the reducers. Mapper input, of course, is durably stored as well, so they can also be easily restarted. Another feature I liked is chopping off the latency tail caused by "straggler" workers by starting duplicate tasks towards the end of the job.

In summation, MapReduce is both a powerful and simple framework that saw a lot of use at Google for a variety of tasks.

Main ideas from Dryad

Dryad is what some people see as "MapReduce done right", but this is a contentious claim. It's a more general framework in two important ways. First, it allows for more general styles of computation, meaning more than just two phases, and more than just map and reduce elements in the graph. Second, it allows communication between stages to happen over more than just files stored in the DFS: Dryad allows for sockets, shared memory, and pipes to be used as channels between elements. It ultimately ends up looking like a DAG of user-defined elements. Data flows between elements over a choice of channels, and the elements are all user-defined. This has a number of benefits: more efficient communication, the ability to chain together multiple stages, and express more complicated computation.

This leads to a number of complications. While it does subsume the MapReduce paradigm, with generality comes complexity. The programming model is nowhere near as simple (the authors cite "a couple weeks" to get started), and to me, it feels like doing the work of a database: designing all the elements and communication in a physical query plan, and optimizing it. They even do a direct comparison against SQL Server in the paper, in fact showing that they have similar query plans but Dryad comes out a little bit faster. Doing this really isn't simple at all, and the example queries they show do nothing to deny it. I consider dataflow programming (what this is, essentially) to be difficult to reason about for most programmers.

Dryad also incorporates the same fault-tolerance as MapReduce, and is able to restart failed tasks correctly. It also has this idea of "dynamic runtime optimization" which sounds very DB, and is hard for MapReduce to do since it's the equivalent of UDFs in a database.

Comparison and evaluation

My impression is that people were pretty unhappy with Dryad when it came out. It's not nearly as elegant as MapReduce, there aren't any cool operational insights, and feels very "me too". However, as stated in the Dryad paper, programmers aren't really meant to interface with Dryad directly, and are instead supposed to use things like DryadLINQ (which turns declarative LINQ queries into Dryad execution graphs, exactly how everyone wanted). This is true for MapReduce too, since FlumeJava has seen heavy use at Google, and Hive and Pig dominate Hadoop workloads at Facebook. As nice and "simple" MapReduce is compared to Dryad, no one is directly programming on either these days, instead doing the DB-like thing and using declarative query languages.

Dryad also did correctly identify all the flaws with MapReduce, flaws that have to be papered over and hacked around to get the same kind of performance and generality. Hadoop is going to have to become more memory aware to eke out additional performance, and there are "workflow management" tools that allow chaining of multiple MapReduce jobs to effectively achieve multi-stage workflows. As long as the user never has to worry about the details, declarative execution engines built on top of Dryad rather than Hadoop have an advantage.

In terms of future relevance, I think that the basic idea of hiding faults and communication from the programmer is totally the right idea. It's way easier to write programs within a Dryad or MapReduce framework than something like MPI, which didn't hide anything. The DB community had it right though in calling for declarative query languages, and Hadoop and MapReduce these days are essentially being used as distributed query execution engines. I think we're going to see a wider variety of query languages in the future though, since there's a tradeoff between generality and simplicity. I doubt Hive and FlumeJava are the final word. There's also room for other types of query execution engines; Pregel's BSP is an example.

Paper review: Megastore

2011-09-24T19:35:00Z

This is a paper review of "Megastore: Providing Scalable, Highly Available Storage for Interactive Services" by Baker et al. This was published at CIDR in 2011. The basic idea is providing ACID semantics across geographically-distant datacenters with highly partitioned datasets and an efficient Paxos replication scheme.

Main ideas

The basic premise of Megastore is that some applications require strong ACID semantics, while also desiring the fault-tolerance that comes with cross-datacenter replication. They claim that existing solutions (like a heavily sharded MySQL database) do not fill this niche because they are hard to manage and scale, driving a need for Megastore. Megastore does this by asking application developers to partition their data into entity groups, where each group represents a relatively small amount of data: the profile for one user, or a single blog account. Operations within the group get full ACID semantics; cross group operations have to build their own consistency model, perhaps two-phase commit, or something looser. Megastore also allows applications to do less-consistent reads for lower latency.

The data model and query language for Megastore also differs from traditional RDBMSs. The data model isn't relational since it's built on top of Bigtable (which in turn, is on top of GFS), but is still strongly-typed and consists of properties within tables. The query language is more limited; being based on Bigtable means that there isn't support for joins. This is fixed either by denormalizing the data, or doing it in application code. There seem to be a lot of tricks for creating indexes and doing data placement efficiently.

Log replication is done by using Paxos to resolve each log entry before applying it. Multiple writers race to get a single leader to accept their write; failed attempts have to be retried. Performance wise, they still have to do an inter-datacenter roundtrip even in the best case of a stable leader and being able to piggyback accepts and prepares. This means that they're never going to do better than a few writes per second; they quote a figure of 100-400ms latency per write. This is okay as long as the entity groups are small and the application write rates are thus low. Reads can be done without a roundtrip by having a special coordinator in each datacenter which tracks when replicas become out of date.

Future relevance

The biggest thing that stuck with me when reading this paper was that as a developer, this sounds really painful to use. Partitioning data that finely is painful, and you have to build inter-group consistency yourself. This indicates to me that schema changes might be common, but that's really painful since data is denormalized and there's all this schema-specific app-level code built on top to do joins and consistency. The claim that there is "predictable performance" from a lack of joins seems unsubstantiated. Building on top of Bigtable which is on top of Megastore means that it's very hard for developers to reason about what is actually going on under the hood. Furthermore, developers have to program around the super slow write rate. Hiding a slow Megastore write behind an asynchronous Javascript call sort of defeats the purpose of having ACID.

Compared to other Google papers like GFS and MapReduce, Megastore just seems way too complicated. It doesn't convince me that it's chosen the right point in the design space, or that it's fulfilling a particularly pressing need for real applications. I think it's still interesting to hear about, but I wouldn't pick this for a 10 year best paper award.

Paper review: The Google File System

2011-09-24T15:39:00Z

This is a paper review for "The Google File System" by Ghemawat et al., published at SOSP in 2003. This is a fairly important paper, and directly inspired the architecture of HDFS.

Main ideas

What Google did was look very carefully at their desired workload, and build a distributed filesystem specifically for that. GFS is very much not a general purpose filesystem, and I really like how they lay out quite clearly early on the assumptions they make:

Files are almost all large, many GBs
Target throughput, not latency
Append-only. Cannot overwrite existing data.
Must be distributed and fault-tolerant

The problem they were basically trying to solve was doing log analytics at scale, meaning mostly long sequential writes of very large files. Spreading files across multiple disks is crucial to getting enough throughput and getting fault-tolerance. GFS can be viewed sort of like a distributed version of RAID 1.

The architecture is a single GFS master which stores metadata for all the files, and a lot of chunkservers that store chunks (64MB) of files. The master is used to chunk locations for a given range of a file, the actual reads and writes are done by directly accessing the appropriate chunkserver. All chunks are replicated across multiple chunkservers for durability and load balancing. Chunkservers talk to the master via heartbeat messages, upon which the master can piggyback commands like re-replicating or getting chunk lists.

Data consistency is made a lot easier by not having to worry about overwrites. It also means clients can cache chunk locations, since they change rarely.

Future relevance

GFS clearly does a good job at the application it was designed for: sequential reads for large files by data-parallel workloads. However, since HDFS has become sort of an industry standard for storing large amounts of data, it's increasingly being used for other types of workloads. HBase is one example of this (a more database-like column store), which definitely does a lot more random I/Os. Facebook also published a paper on doing real-time queries with MapReduce (and thus HDFS). The question is how well HDFS can be squeezed into these roles, and if other storage systems are necessary. For low-latency web serving this is definitely true (memcached and other k-v stores dominate).

In short, I don't think the MapReduce paradigm is going anywhere, and HDFS already feels like the standard answer to storing big data. I don't think it's going anywhere in the next decade.

Paper review: Paxos, Paxos, and Chubby

2011-09-22T14:53:00Z

I'm combining my paper reviews this time, since they are pretty closely coupled:

"Paxos Made Simple", Lamport, 2001
"Paxos Made Practical", Mazieres, 2007
"The Chubby Lock Service for Loosely-coupled Distributed Systems", Burrows, 2006

Paxos Made Simple

Paxos is a distributed consensus algorithm. At it's essence, it's a quorum-based fault-tolerant way of arriving at a single consistent value among a group of machines. This can be used to do leader election (consensus on who's the master), or synchronous strong consistency (replicating writes in a distributed database). In the case of Chubby, it's used for both: determining who holds a lock (leader election), and serving strongly consistent small files.

Paxos has two basic types of actors:

Proposers are nodes that are trying to get a value accepted as the "true" value during a round of consensus. Normally, which node is the proposer is pretty stable, and only changes on failure.
Acceptors receive proposals from proposers, and vote as part of the quorum. They act as replicas for the global state of the system.

Paxos also has two rounds of communication. Here I'm showing the basic version. There are four types of messages here: propose, promise, accept, and accepted.

Propose
- The proposer sends a proposal to all the acceptors in the system, and waits for a reply. This proposal is tagged with a round number N, which has to increase each time this proposer makes a new proposal. A common thing to do is make this number out of the IP address and some local counter, so they are globally unique.
- The acceptors promise to accept the proposal if the proposal's N is the highest N they've seen from any proposer. A promise indicates that the acceptor will ignore any proposal with a lower number. This is so there's a total ordering on proposals; we don't care which proposer wins, we just want one to win.
Accept
- If the proposer hears back from a quorum of acceptors, it sends out accept requests to the quorum with the value it wants to set.
- If the acceptor hasn't promised to a higher proposal number in the meantime, it tells the proposer it accepted.

If the proposer gets back a quorum of accepted messages, then it commits. At this point, it can further replicate the value to other nodes, or respond to the client that the operation worked. It's normal for all nodes in the system to in fact fulfill the roles of both proposer and acceptor as needed.

There is one detail that prevents future proposers from changing the value when issuing a higher number proposal. In the first round, the acceptor promise also sends the number and value of the highest number proposal they've accepted. Then, the proposer uses the value of the highest number proposal it gets back when it commits in the accept round.

Paxos Made Practical

This paper goes into a lot of detail about how to actually implement Paxos, handling nodes leaving and adding a Paxos replicated state machine (ala an RPC server). All nodes run the same deterministic code which transitions based on input. Non-deterministic bits are decided by a single machine externally before being passed into the state machine. One of the machines act as the master (proposer) in Paxos operations, handling a client request, running Paxos, and executing and responding to the client after a majority of nodes have logged the operation. Changes in group membership (the view) are handled by again running Paxos to agree on membership.

Chubby Lock Service

This is an engineering industry paper from Google, always interesting reads. The basic idea of Chubby is running Paxos on a small cell of five machines to solve consensus problems. One of the five machines in the cell acts as the master, and runs Paxos to decide writes. Reads are all served from the master. This lets Chubby provide coarse-grained locking services, where locks are expected to be held for long periods of time (minutes, hours, days). Finer grained locking is deferred to application-level servers. Chubby's API is through a simple way Unix-like filesystem, upon which clients can open file handles and get and release reader/writer locks. This is also convenient for advertising results, and can be used to store small files in a very consistent manner. A typical use within Google is using it as a better version of DNS (no worrying about stale entries and TTLs).

The rest of the paper describes other features of Chubby: failover, cache consistency, event notifications. Master failover is handled by leader election in the Chubby cell, with a new node brought online in the background. Client caching is also an essential part of reducing load on the Chubby master, but the master has to synchronously invalidate caches on writes. Invalidation is piggybacked on heartbeat KeepAlive messages (default every 12 seconds), a lease mechanism that keeps client locks and cache alive. Event notifications can also be used to watch files for certain events: modification, master failover, etc. This is most often used to wait for modifications to a file, indicating that a new leader has locked and written its address to the file.

Future trends

Seeing how this is being used to great extent at Google, it's got great future applicability. Chubby seems to be used mostly for inter-datacenter locking, I wonder if there are any important modifications that have to be done for intra-datacenter locking. The whole space of eventual consistency gives a lot of alternatives to the strong guarantees that Paxos offers, so there are lots of ways to tradeoff availability and performance with consistency.

Paper review: Cluster-Based Scalable Network Services

2011-09-22T14:01:00Z

This is a paper review of "Cluster-Based Scalable Network Services" by Fox et al., published at SOSP in 1997. It describes an architecture for datacenter services that proved to be prescient, and used the Inktomi as an example.

Main ideas

This paper has to be put into context. At this point there was still contention whether "clusters of workstations" was the right approach for handling web-sized workloads. Inktomi was at the forefront of saying that yes, clusters were the right choice, and this paper demonstrates why this is true, and how datacenter services can be structured to achieve their key goals: scalability, availability, and cost effectiveness by using consistency semantics weaker than ACID: BASE.

The advantages of clusters are manyfold. They allow easy incremental scaling and upgrading, they can be build out of commodity parts, and they have natural redundancy through replication. Disadvantages are primarily in the programming model and management; it can be difficult to harness a group of machines to complete a task, and since it's a distributed system, there are issues with data consistency and failures.

The idea of BASE is a crucial component of this. BASE stands for Basically Available, Soft state, Eventual consistency. This is a significant relaxation of strict ACID semantics, since it allows servers to temporarily serve stale data while state converges. This is allows better performance, and many applications do not require strict ACID semantics to provide a good user experience.

The cluster architecture proposed also looks shockingly similar to what is in use today. Within a datacenter, machines are split into two major groups: front-end and workers. Front-ends handle actual client requests from outside the datacenter. To handle a client request, a front-end might harness a number of workers running different services to get data or do computation, before assembling and returning the response. This allows all the front-ends to share from the same pool of stateless workers which is good for utilization, and also allows pools of workers to be scaled up and down in response to overload.

Future trends and relevance

Seeing how Brewer wrote this paper in 1997 and we're still using roughly the same architecture today in 2011, I don't think there's any doubt that the paper had a lot of future relevance. I think there's still room for improvement in the cluster management side of things (Mesos), but the idea of clusters for datacenters has reached complete acceptance. Interestingly though, we're seeing the return of "big iron" to the datacenter for some applications. People are starting to wonder about the possibilities offered by a machine with 1TB of memory (purchasable today), and the "disk is tape, memory is disk" argument along with a strong focus on latency might lead to further development on the cluster programming model front. SSDs present yet another level in the storage hierarchy with unique cost and performance tradeoffs.

Paper review: The Datacenter Needs an Operating System

2011-09-14T11:40:00Z

This is a paper review of "The Datacenter Needs an Operating System" by Zaharia et al. This is a short 5 page paper published at HotCloud 2011.

Main ideas

This is a high-level ideas paper focusing on the abstractions that should be provided in a cluster programming environment. The authors identify the following as the core traits of traditional operating systems:

Resource sharing
Data sharing between programs
Programming abstractions for software development
Debugging and monitoring

The authors argue that these same things should be provided to cluster applications as a common layer, instead of having each programming paradigm separately implement them in an adhoc fashion. If I interpret the article correctly, they want to provide a common set of abstractions on top of which programming models like MapReduce or Dryad can be built, benefiting from code sharing as well as more efficient utilization.

Problems

What I really would have liked to have seen (perhaps in a longer paper) is more of a focus on where the abstractions are going to be drawn between the "datacenter OS" and the "datacenter application". This is a classic problem in traditional OS literature (should it be in the kernel, or in user space?), and I'm betting we'll see the same theme being explored here. Since all these different frameworks already exist, the initial question is figuring out what can actually be pushed down into the OS.

Debugging is one issue raised that none of the computing frameworks have really solved, so that's the least-defined problem on the list in my mind. Scheduling and resource sharing are difficult, but approachable.

Future impact

Highly relevant. A proper datacenter operating system enables higher utilization, better performance, and an easier programming environment. As stated before, cloud computing is only becoming more prevalent and important.

Paper review: Performance Modeling and Analysis of Flash-based Storage Devices

2011-09-12T23:43:00Z

This is a paper review of "Performance Modeling and Analysis of Flash-based Storage Devices" by Huang et al. This paper compares and contrasts three different types of drives: a high-end Intel SSD, a low-end Samsung SSD, and a 5400 RPM Samsung HDD. My review focuses on the implications for cloud computing.

Solid state disks have to be treated kind of like black boxes when it comes to performance modeling. As is evident from comparing the Intel and Samsung SSDs, we can get wildly different performance characteristics depending on how the manufacturer has configured the Flash Translate Layer (FTL), which maps block requests to actual storage locations. This is important because the FTL acts as a pseudo-filesystem, and has the ability to decide how data is really laid out on the drive independent of the ordering presented to the operating system. This has deep implications for filesystem design (don't bother minimizing seeks with reordering), and other characteristics of SSDs mean that coalescing and batching requests simply aren't as effective.

It seems really worthwhile for SSD vendors to invest in better FTL code, since the high-end Intel SSD appears almost unaffected performance-wise by the randomness of its workload. This indicates to me that it's probably doing LFS-like block placement to try and minimize the number of writes and erases. This is in stark contrast to the Samsung SSD, which sees its random write performance get trounced by sequential.

I think there's definitely space for SSDs in the storage hierarchy on the cloud. It's safe to assume future SSDs will have FTL firmware with performance similar to the Intel SSD, so we're basically gaining a tier that has strictly better performance characteristics than hard disk drives. It won't displace HDDs entirely since the price/GB is still way higher and the amount of data is only growing, but it does sit sort of nicely between memory and HDDs. There's less of a value argument for MapReduce (an application designed specifically to make good use of HDDs by doing large sequential reads), but for more random workloads (thinking of OLTP, or web caches) SSDs will be a big win. The limitations on the number of write cycles could be a problem, but can be countered through schemes like unbalancing writes to a single SSD in an array (making it fail first reliably, which is better), or by ironically sticking a HDD in front to buffer writes.

Paper review: Data-Level Parallelism in Vector, SIMD, and GPU Architectures

2011-09-12T23:21:00Z

This is a review of Chapter 4 from the Hennessy and Patterson book, "Computer Architecture, 5th Edition: A Quantitative Approach". This chapter covers the differences between vector processors, SIMD, and GPU architectures. This writeup focuses entirely on the future hardware trends of data-parallel SIMD hardware in the cloud.

My claim is that we're going to see GPUs moving on-chip to coexist with normal CPU cores; we're already seeing this happen with the newest processors from Intel and AMD. The reason for this is that the latency cost in shifting computation to a GPU over PCI-E is high, and there isn't any memory coherency. Moving GPU cores on-chip solves both of these problems, though memory coherency still faces the same problems it does among many CPU cores. I can't make any solid predictions about vector processors, but it seems like the momentum is heavily in favor of GPU and SIMD CPU extensions for data-parallel computation.

How does this relate to cloud computing? GPUs face one major problem in a cloud environment, in that they are currently quite hard to virtualize. Preemption support is nascent, and there's a high cost to context switching a GPU because of the high latency bus. Concurrent sharing might also be difficult. There also aren't well-defined standards for programming a GPU (CUDA and OpenCL being competing examples).

Putting all of the problems aside as solvable however, GPUs are great for speeding up data-parallel tasks (and GPUs already can run MapReduce). I definitely see them gaining traction for batch processing. For normal web-serving workloads though, I'm not sure where a GPU would be useful. I think it's a lot harder to derive data-parallelism from handling a single request, and the current limitations on multiplexing and latency make it less friendly for this type of work.

Paper review: Amdahl's Law in the Multicore Era

2011-09-11T19:08:00Z

This is a paper review of "Amdahl's Law in the Multicore Era", by Hill and Marty.

Main ideas

There are two important equations in this paper that lay the foundation for the rest of the paper: Amdahl's law, and the processor performance law.

speedup = 1 / ((1-f + f/s)
perf(r) = sqrt(r)

These two equations have some deep implications. Starting with the latter, there are diminishing returns from investing more chip resources to single core performance. It's the reason why per-core performance has basically peaked in recent years, and everything is moving toward multicore. Adding more cache and beefing up adders can only do so much; since clock rates have peaked because of power limitations, we're basically stuck with the per-core performance we've got.

Moving on to Amdahl's law, f is the fraction of a program that is parallelizable, and s is the speedup of the parallelizable part. Interpreting this, we see there are essentially three ways of making your program run faster:

Run the serial part faster (1-f component)
Make s bigger (increase the granularity of parallelism)
Make f bigger (parallelize serial parts of the code)

Combining both equations, what are the limitations we see?

Running the serial part faster gets expensive quickly in terms of chip resources.
Increasing the granularity of parallelism is a great approach, but only works really well for data-parallel tasks like serving webpages. Going beyond this natural parallelism into true fine-grained parallelism is expensive and limited.
Parallelizing serial parts of the code is also a great approach, but similarly limited. Some serial code simply can't be parallelized, and there's the same diminishing returns effect in terms of programmer time.

The article goes on to talk about different ways of allocating chip resources in light of these two laws, covering the symmetric, asymmetric, and dynamic multicore chips. Basically, favoring bigger cores makes the serial part run faster, but reduces the number of ways you can split the parallel part, with smaller cores having the opposite effect. The net result of the paper is that it all depends on your workload. Asymmetric and dynamic multicore offer better performance characteristics than symmetric since they can better handle both the serial and parallel parts of a program well.

Applying this to cloud computing, we see that there are a lot of parallels between multicore and cloud computing. Instead of getting speedup by parallelizing at the level of instructions in a program, web services are typically scaled through request-level parallelism: distributing requests among a cluster of machines. In this case, the 1-f factor in Amdahl's law can be effectively zero, since requests can be handled independently by each machine.

The ideas of asymmetric cores is also already seen in request-level parallelism since bigger tasks can be allocated to bigger machines.

Future trends

Servers processors are going to be manycore in the future, whether we want it or not. Peak per-core performance is probably not going to increase, and I don't hold out much hope for the effectiveness of dynamic multicore, but I think there's a strong argument for asymmetric multicore since there are demonstrated benefits over symmetric. Big cores will be used for serial computation, with data-parallel parts off-loaded to small cores as possible. This is also better for the coming scarcity of memory bandwidth; having big cores that can use it more effectively makes sense.

This also lends flexibility into request handling, since high-priority tasks can get the big core while latency-tolerant operations can run on a small core. The big problem here is going to be performance isolation for shared resources like mem, disk, and network; with more threads, there's more contention, and potentially more variability in performance (which is the killer). Scheduling also becomes more difficult, since "slots" on the same machine are now unequal in size.

For batch processing (like Hadoop), throughput matters more than latency. Here, SMP is the name of the game: we want to optimize perf(r) and have a bunch of wimpy cores. I still worry about I/O demands on durable storage, since HDD throughput doesn't seem to be increasing at the same rate that cores are being added.

Software stack predictions:

We're going to see a focus on better asymmetric-aware, load-based cluster scheduling algorithms. This needs to balance out I/O load too.
We're going to see moderate fine-grained parallelism added to existing request-level parallelism, since it's the real only way of reducing latency.
As a corollary, I think there's going to be a big focus on performance isolation even at the cost of throughput, since variation in latency is the real killer.
MapReduce as a programming model and framework probably will not go away, but we're going to see mappers making better use of multi-core. Whether this is going to be MapReduce-in-MapReduce or OpenMP-in-MapReduce or Pthreads-in-MapReduce is unclear, but it makes sense to further break down an already data-parallel task into core-sized chunks.

Paper review: The Datacenter as a Computer Ch. 3, 4, 7

2011-09-08T20:31:00Z

This is a review of chapters 3, 4, and 7 from "The Datacenter as a Computer". The topics covered are hardware for servers, datacenter basics, and fault-tolerance and recovery.

Main idea

The three chapters here cover essentially how to design the hardware, software, and operational concerns of a datacenter.

In terms of datacenter hardware, the main concern is choosing the most cost-efficient type of node that still runs your workload sufficiently fast.

Operationally, the concerns deal with power and cooling of nodes, factors which limit the density of nodes in a datacenter. Cooling can account for a major part of the power cost of a datacenter, and AC is just as critical a service as power since the datacenter can survive for only a matter of minutes if the AC unit dies.

The chapter on fault-tolerance and recovery talks about the different types of faults that can present in hardware and software, and how they might affect service availability. The ultimate goal of the service is to be able to survive faults without significantly affecting availability, either through overprovisioning or graceful degradation of service quality.

Problems presented

I divided this up into three sections, based on the chapters.

Hardware/software scaling

How low can you go? Small nodes are more cost-efficient than beefy ones in terms of computation-per-watt and price-per-computation, but might start becoming a bottleneck due to the limits of request parallelism. Further parallelization of an application can be really painful, and having to deal with coordination between more nodes has its drawbacks. Never forget Amdahl's Law: if the serial parts of your program dominate, and you're running that serial code on slow nodes, your program is going to run slowly too. Beefy nodes can run the serial part quickly.
Another point about small nodes is that they are harder to schedule efficiently. Resources effectively get fragmented; there might not be enough left on a small node to schedule a new task. Big nodes pack more efficiently.

Operations

What is the most effective way of cooling servers? This is heavily related to power density (# of servers / volume), with a higher power density being a more efficient use of a datacenter. The current canonical strategy seems to be alternating hot and cold aisles, with cold air pumped up through the floor by a central AC unit. Modular container datacenters seem to be another strategy, which are effective because they can be designed very tightly and are self-contained.
What is the most efficient way of cooling server? This relates to economic costs; it's claimed that cooling can account for 40% of load, which is a big power bill.

One question I had here was that they mention that UPSes are generally in a separate room, with just the power distribution unit on the floor. I remember reading that Google integrated batteries right into their servers, but maybe one source or the other is outdated. Having a battery right in the server might increase fault-tolerance and modularity.

Fault-tolerance

What are the real limitations on availability? Internet is apparently only 99.99% available, and software faults dominate hardware faults (only 10%). Machines apparently last an average of 6 years before replacement, after factoring out "infant mortality", a number I found to be surprisingly high considering how much talk there is of nodes keeling over, but again just indicates that most keeling is due to software faults.
How do you maintain service when faults happen? In a datacenter, a new node fails on the order of hours, so it's not acceptable to become unavailable. This is done through some degree of overprovisioning of resources, and designing software that is fault-tolerant. When operating with faults, it's desirable to have the property of graceful degradation, where the quality of service gradually degrades (e.g. using older cached data, serving reads but denying writes, disabling some features). Fault-tolerance also makes the need for repairs less urgent and thus less expensive.

Tradeoffs

Programming cost vs. hardware cost. Can speed up either by throwing programmers at a problem (squeeze more parallelism from the code), or by buying better hardware (run the same code faster). This is an economic balancing act.

Impact

This textbook isn't going to be winning any awards, but I find it to be a fascinating look into operations and system design at Google. Like I mentioned in my last writeup, there are only going to be more datacenters being built in the future, so advice like this on how to design and build datacenters and datacenter applications is very useful.

Paper review: Warehouse-Scale Computing: Entering the Teenage Decade

2011-09-06T22:10:00Z

This review is based on a presentation by Luiz Andre Barroso from Google, titled "Warehouse-Scale Computing: Entering the Teenage Decade". I believe it was given this past year (2011) at FCRC. I really strongly recommend it, since it talks about the operational issues in running a Google datacenter, and also identifies a lot of the research issues surrounding "warehouse-scale computation".

Key Points

There were a couple problems identified in the talk, some of which have been solved, some of which have not been.

I/O latency variability right now is terrible, with basically all durable storage displaying a long latency tail. Random accesses to spinning disks are slow, flash writes are slow, and these high-latency events muck up the latency for potentially fast events.
Network I/O suffers a similar problem. Using TCP and interrupts adds orders of magnitude of latency to network requests, making fast network hardware slow again in software.
Datacenter power efficiency as defined by PUE (Power Usage Effectiveness) has gotten pretty good (<10 percent is used on operational overhead). The real problem now is making better use of servers, to get CPU load up into the 80% range instead of the current 30%.
This leads into the another problem: how do you share all of the resources in a cluster among many different services, while also chopping off the latency tail?

To summarize, there are two big ideas in the talk. First, latency and variation in latency are the key performance metrics for services these days; today's web-based applications demand both to provide a good user experience. This may require reexamination of a lot of fundamental assumptions about IO. Second, increasing the utilization of resources in a cluster is important from an efficiency and performance standpoint. Server hardware should be a fungible resource that can be easily shared among different services.

The differences here between warehouse-scale computing and datacenter-scale computing lie in scale and the type of user experience provided. Warehouse-scale computing operates in the many petabyte range, and allows for complete system integration of the hardware, software, power, and cooling of the cluster. The types of services hosted by a warehouse-scale computer are also scaled way up, in terms of the latency requirements and the size of the data that is being crunched.

Trade offs

One of the key tradeoffs mentioned was between latency and throughput. Most of the software stack for I/O these days is done to optimize throughput, done in response to relatively slow disks or networks. An example of this would be Nagle's algorithm in TCP; small packets are delayed and batched to be sent in bulk, reducing TCP overhead (fewer bytes need to be sent) but also increasing latency. New technologies like flash and fast networks mean that these assumptions should be reexamined.

Long-term impact

Web-based services are here to stay, and I feel confident in saying that this is an area that is going to see yet more growth. Large internet companies like Google and Facebook are already dealing with these issues internally, and there are only going to be more warehouse-scale datacenters built. It's clear that these are hard problems that aren't going away because of some deus ex machina like Moore's Law, so any solutions are likely to have a big impact.

Lottery and stride scheduling

2011-07-17T02:50:00Z

Today is a shorter post than previous topics, since I didn't want to lump the last paper (Paxos, ick) in with these. I'm covering lottery and stride scheduling, two very related approaches to doing efficient proportional-share scheduling. I believe this is the canonical way of doing things, since mClock (by Gulati et al., presented at OSDI 2010) used stride scheduling successfully to schedule disk I/O in a hypervisor.

"Lottery Scheduling: Flexible Proportional-Share Resource Management", Waldspurger and Weihl, 1994
"Stride Scheduling: Deterministic Proportional-Share Resource Management", Waldspurger and Weihl, 1995

Background

The basic problem for proportional-share scheduling is, given a set of processes that have been assigned relative weights (e.g. 3:2:1), schedule the processes with some kind of quantum such that they all get their assigned proportion of CPU time (e.g. 1/2rd, 1/3rd, 1/6th). A naive way of doing this is to simply schedule each process for weight number of scheduling quantums, but this penalizes processes that do not use their entire allocation (for instance, if they block on I/O early) and results in unfairness at small timescales (think of a 100:1:1 weighting). It's also desired that scheduling is responsive to changes in priorities.

These are some of the practical problems that any good proportional share scheduler has to solve.

Lottery Scheduling

Waldspurger and Weihl's first approach is a probabilistically fair one. Processes are assigned a number of tickets based on their relative weight (bigger weight=more tickets), and the scheduler holds a lottery each scheduling quantum (choosing a random ticket) where the winner gets scheduled. A quantum is small unit of time, in this case 10ms, which is the smallest unit that the scheduler will assign to a process. This means processes with more tickets get scheduled more often, and over time the actual scheduling should probabilistically approaches the desired relative weighting between the processes with standard deviation proportional to sqrt(n).

This can be implemented by logically storing the number of tickets of each processes in an array, and keeping a running sum of tickets as one traverses the array, advancing until the winning ticket is found. This is an O(n) operation, but sorting the array in descending order with insertion sort (or, as the authors do it, the "move to front" heuristic) can make this a lot faster. For large n, it's better to take an approach with a tree and binary search toward the winning leaf node process.

If a process ends early or runs over its quantum, the size of its ticket pool gets adjusted until it gets scheduled again with a compensation ticket. This compensation ticket is valued at (1/f)*num_tickets, where f is the fraction of its allocated time that it actually used. This will increase or decrease its likelihood of getting scheduled appropriately.

It's relatively straightforward to build a hierarchical scheduling system via what the authors term ticket currencies: different types of tickets that are backed by other tickets. In this way, groups of processes can be weighted based on one currency, and then the members of each group weighted based on another currency. This is basically just a nice management feature; in the end, everything gets translated back into the base currency.

They also have this idea of letting clients pass their ticket allocations off to a server, which is kind of cool when combined with a microkernel where everything is a server. It means you can give tickets only to clients, and make it so all server requests have to be paid for with a ticket allocation. Then, tickets accurately capture all work done in the system on behalf of a client process. My random idea.

They also have a similar idea with a priority inheritance scheme of sorts for locks, where all the tickets of processes waiting on a lock are passed to the process currently holding the lock. They also test lottery scheduling lock acquisition, which seems of somewhat lower utility to me. I don't know how often I would ever want to use this, as a programmer.

Lottery scheduling isn't really that responsive to dynamic changes, since it takes time to converge. Its probabilistic nature also means that it can't really give predictable performance, meaning that it might be decently suited for "average throughput" schedulers, but probably sucks for interactive uses where you want SLA-like performance guarantees.

Stride Scheduling

The follow up paper the year after solves many of the problems with lottery scheduling, by introducing the concept of stride scheduling, which has all the benefits of lottery scheduling while also being deterministic, responsive to dynamic priorities, and better error properties. This is slightly different from how it's presented in the paper, but I think it makes more sense.

This is something most easily explained visually, but the basic idea is the concept of virtual time, where each process has a clock that ticks at a different rate depending on its priority. High priority clocks tick slowly, while low priority clocks tick quickly. The rate at which a clock ticks is called its stride. The scheduler makes decisions by finding the clock with the oldest time, scheduling the corresponding process, and then advancing the clock by the clock's stride. This is implemented simply by keeping the clocks in ascending sorted order via a heap, insertion sort, or something like that.

Fractional quantums are handled via a multiplicative compensation factor: simply multiplying the stride by the fraction of the quantum used, before advancing the local clock.

There is also the concept of global virtual time that advances at the rate of the slowest possible tick (~1/sum(tickets)). This is used to calculate a compensation factor, remain, used to compensate a process for time spent waiting when there is a dynamic change. remain is the amount of virtual time until a process would next be scheduled, i.e. the difference between the global virtual time and the local virtual time. When the process re-enters the system, its local time is set to global_time+remain. In this way, if the process waited to be allocated before leaving (remain<stride) it gets scheduled sooner. The opposite happens if the process previously got an early allocation.

Since the stride and remain are both related to the size of the ticket allocation, in the case of a dynamic change, the stride is recomputed and used to scale remain appropriately to immediate reflect the new allocation.

This is extremely predictable since scheduling is deterministic, and processes are guaranteed to be scheduled at least once every complete cycle of virtual time (where a cycle is the slowest stride). This is a major boon for responsiveness since we no longer have to wait for probabilities to converge. Glancing at the evaluation section, I see major improvements to predictability, responsiveness to priority changes, and accuracy. The same ideas of ticket currencies and ticket passing also apply to stride scheduling.

Concurrency review

2011-07-08T19:14:00Z

I assume that everyone has already read Andrew Birrell's seminal paper on "Programming with Threads" or at least has a basic conception of parallel programming. This is going to deal with locking and concurrency at a higher level. At-bat today are five selected papers on concurrency:

"Granularity of Locks and Degrees of Consistency in a Shared Data Base", by Gray et al., 1975
"Experience with Processes and Monitors in Mesa", Lampson and Redell, 1980
"On Optimistic Methods for Concurrency Control", Kung and Robinson, 1981
"Threads and Input/Output in the Synthesis Kernel", Massalin and Pa, 1989
"Concurrency Control Performance Modeling: Alternatives and Implications", Agrawal, Carey and Livny, 1987

Background

I know I said I expected Birrell's paper as base knowledge, but here's a TLDR that might let you skip it.

The need for locking is derived from the concurrent reader/writer problem. It's safe for multiple threads to be reading the same data at the same time, but it's not safe to read or write while someone else is writing since you can get corrupted results. This requires the idea of the reader-writer lock, which allows any number of concurrent readers, but will make sure that any writer gets exclusive access (i.e. no other readers or writers are accessing the protected data). This is also called shared/exclusive locking, and is an especially common construct in parallel programming.

Granularity of Locks

The important takeaway from this Jim Gray paper is the idea of hierarchal locking, where locking a database table also locks all the rows and row fields in that table. This hierarchal structure allows locking at an almost arbitrary granularity depending on the needs of the executing query, which ameliorates the issues that can happen with too-fine-grained locking (excessive lock overhead from doing lots of acquisitions and releases) or too-coarse-grained locking (poor concurrency from unnecessary lock contention).

This scheme applies to exclusive (X) locks used for writes as well as share (S) locks used for reads, but also requires the introduction of a third lock type: intention locks. Intention locks are used to indicate in an ancestor node that one of its children has been locked, preventing another query from falsely locking the ancestor, and thus, the child as well. This is refined to having both a intention share lock (IS) and an intention exclusive lock (IX) to allow concurrent reads, since intention share locks are compatible. Exclusive intention locks are also compatible, since they still have to ultimately exclusively lock the child they want to modify. Queries are required to leave a breadcrumb trail of correct intention locks behind as they traverse toward what they ultimately want to access. Locks also must be released in leaf-to-root order, so the locking hierarchy remains consistent.

One more intention lock type is introduced for yet better concurrency: share and intention exclusive locks (SIX). This is interpreted as "read-only access to a subtree, exclusively-locking some to be written". This is necessary because normally you can't have concurrent read/writes (cannot just first acquire the share lock and then an intention exclusive lock since they're incompatible), but since these rights are being granted to the same query, it can be depended upon not to read a row that it's currently writing. This read-modify-write behavior for a subtree is super common in databases, which is why SIX locks are important.

Table 1 on page 5 of the paper is a concise rundown of what locks are compatible with each other. It might be a nice exercise to work through (the lock types being null, IS, IX, S, SIX, X).

The rest of the paper seems less relevant. Gray et al. explain how this can be applied to a more dynamic graph based on locking ranges of mutable indexes, with the same strategy. I don't think this works for multiple indexes, since then the hierarchy DAG is no longer a DAG. They also cover the idea of "degrees of consistency" with tradeoffs between performance (concurrency) and recovery (consistency). I don't think real-world databases use anything except the highest degree of consistency, since the idea of unrecoverable, non-serializable transactions isn't pleasant. Anything with eventual consistency (looking at you, NoSQL) has made this tradeoff.

Experience with Processes and Monitors in Mesa

This paper specifies a new parallel programming language, Mesa, used to implement a new operating system, Pilot. Introducing a new language and OS at the same time is pretty common, and is how we arrived at C and Unix. This is where the idea of "Mesa semantics" for monitors came from (compared to Hoare semantics). To put the work in proper context, apparently in 1979 one had to justify why a preemptive scheduler is required over a non-preemptive design even assuming a uniprocessor (the obvious reason being interrupt handling).

Mesa is kind of a neat language, in that any procedure can be easily forked off as a new process, and processes are first class values in the language and can be treated like any other value. This isn't to say that everything is expected to be able to run concurrently, just that the FORK language construct is easy to apply. The core organizational construct in Mesa is the "module" or the "monitor module". This is basically a way of logically organizing procedures, and specifying which of them need to acquire the monitor lock as part of execution.

This is also where "Mesa semantics" come in. Instead of immediately switching to a waiting process on a signal, the signaller continues running. This seems like a great win, since although it means slightly different semantics to the program, it also means fewer context switches.

The paper goes on to describe more about monitors and the implementation.

On Optimistic Methods for Concurrency Control

When I read this paper for 262A, it was a big eye-opener. I felt that the idea wouldn't hold up in real usage (and I think that this is true, except in the specific situations noted), but it was a refreshing approach to handling concurrency I had never thought of.

The idea behind "optimistic concurrency" is doing away with locking, and instead doing checking at the end before commit to see if there are any conflicts from concurrent queries, and aborting if so. In this way, even if incorrect query results are generated along the way, they are not externalized. This is speculative and will result in lots of aborted transactions (and thus wasted work) under write-heavy workloads, but as the paper says, this works wonderfully for read-heavy workloads where it's unlikely to have a conflict.

The motivation here is that locking often imposes unnecessary overhead, and can complicate things. In a locking scheme, even read-only queries need to lock rows even though they aren't modifying the data, just to indicate that the reads is happening. All this checking and verifying adds up, increasing complexity of the system, and leading to potential deadlocks which have to be resolved through a deadlock-free scheme, or deadlock detection and abort. Locking also can operate at too coarse a granularity; imagine the root node in a hierarchical locking scheme as described by Gray et al., it's basically constantly under lock contention, often times unnecessarily since queries are not necessarily operating on the same subtrees.

This is implemented by having two-phase transactions, which goes read phase, validate, then write phase. In the read phase, the transaction gathers up the names of all the objects it needs to read, defining a read set. It then validates whether the transaction T_j is serializable, checking to make sure that for all prior transactions T_i one of the following is true:

T_i completes its write phase before T_j starts its read phase. T_i comes entirely before T_j, so it's fine.
The write set of T_i does not intersect with the read set of T_j, and the T_i finishes writing before T_j starts writing. As long as there's no intersection, T_j's reads are safe, and as long as T_i finishes writing before T_j starts, T_j will not be overwritten by T_i.
The write set of T_i does not intersect the read or write set of T_j, and T_i finishes reading before T_j starts writing. Similar to the previous, no intersection with the read or write set makes T_j very safe, and T_i needs to finish reading before T_j starts writing to protect T_i's read set.

This means that validation needs to check the write sets of all transactions that had finished reading but not finished writing (conditions 2 and 3). This poses an issue for long-running transactions, since the validator might be expected to keep around write-sets almost indefinitely. The proposed answer is to abort and restart the transaction, which leads to the question of how to deal with transactions that repeatedly fail validation. This is answered by write-locking the entire database and letting the "starving" transaction run to completion in isolation. This isn't great at all, but the assumption is that both long-running transactions and repeat-failures are rare.

The evaluation in this paper is kind of spotty. It's purely theoretical, and they chose to do their analysis on a B-tree, which is one of the better (though also common) situations because of the high fanout and low depth leading to lock contention on root nodes. They also assume a uniformly random distribution for accesses which is probably untrue (accesses are normally temporally correlated, which is why LRU caching works).

All-in-all, this is the system design maxim of "optimize for the common case" taken to the extreme. The common case of no-conflict transactions will be faster with optimistic concurrency control, but it'll collapse under load a lot worse. Like the authors say, it's only for situations where transaction conflict is rare.

Threads and Input/Output in the Synthesis Kernel

This seems to be a more meta paper, where optimistic concurrency and lock-avoiding techniques were applied to an OS to improve performance. It's also chock-full of system-specific jargon, which I will kindly avoid introducing. Honestly, most of what's laid out in the paper feels like a bunch of small optimizations for a specialized kernel that add up to something that performs demonstrable better than SunOS. Some of these techniques might be translatable back to Unix-y implementations, some of it is unique to the system (runtime optimization of syscalls?), and some of it is because it's a special-purpose kernel. I like how the references are to the SunOS source code, GEB (yes, the Hofstadter book), 3 of the author's own papers, and then two external. Certainly a different time.

The takeaways here are unclear. The conclusion of "avoid synchronization when possible" seems hardly novel, and it feels too much like they implemented to optimize their microbenchmarks (no real apps were written).

Concurrency Control Performance Modeling

This paper does a deep comparison between three concurrency control algorithms: blocking locks, immediate restart on lock contention, and optimistic concurrency. I really love papers like this one, since they take a bunch of different algorithms that all tested well under different model assumptions, carefully dismantle said assumptions, and reveal real truths with their own meticulous performance model. It really demonstrates the authors' complete understanding of the problem at hand.

There are a number of model parameters that are crucial to performance here. The database system model specifies the physical hardware (CPUs and disks), associated schedulers, characteristics of the DB (size and granularity), load control mechanisms, and the concurrency control algorithm itself. The user model specifies the arrival process for users, and the type of transactions (batch or interactive). The transaction model specifies the storage access pattern of a transaction, as well as its computational requirements (expressed in terms of probabilities).

I consider this to be about as complete as possible. They ignore I/O patterns and cache behavior, but those are just damn hard to model. Using a Poisson distribution for transaction inter-arrival rates is canonical without evidence to disprove it ( see "On the Self-similar Nature of Ethernet Traffic" by Leland et al. for a situation where Poisson does not hold so true). They also do not take into account processing time spent on the concurrency control algo itself, which feels like a slight copout since I think this means they ignore lock overhead and use a completely granular locking system (not hierarchical locking), which disfavors optimistic concurrency. This is implementation specific and a lot of additional work to add to the model, and considering there's some prior work showing that the costs are roughly comparable and negligible compared to access costs, I'm willing to let it go.

The interesting part comes when they manipulate all the model parameters, and explain how different papers arrived at their different performance results. Basically, under the assumption of infinite resources, optimistic concurrency does splendidly as the level of multiprogramming increases, since locking strategies run into contention and transactions get blocked up. Optimistic transactions still face more conflicts and have to be restarted, but since there the pipeline is always full of running transactions (none are just blocked and using up a queue slot while doing no work), overall throughput continues to increase. Immediate-restart reaches an interesting performance plateau, due to its scheme of trying to match the rate of transaction completion with the rate of re-introducing restarted transactions. This was the model used in a number of prior papers.

Introducing a very resource limited situation turns things sharply in favor of blocking algos. Blocking performs much better until very high levels of multiprogramming, immediate-restart hits the same plateau for the same reason, and optimistic concurrency performs linearly worse beyond a very small multiprogramming level. Basically, every optimistic conflict detected at the end of a transaction just wasted all of the resources used; immediate restart does better since it will restart if it detects a conflict midway, and also delays restarts to match the completion rate.

Increasing the number of resources begins to favor optimistic concurrency again, but the price/performance isn't there since doubling the # of resources does not lead to a doubling in performance. They do a few more different situations, examining different workloads and model assumptions, which you can read yourself if you want to know more.

Basically, it's hard to make general statements about performance; things are dependent on your model. It seems that for most real-world use cases though (limited resources, high utilization), blocking is the concurrency control method of choice. It's also important to carefully control the level of multiprogramming for optimal throughput, since performance tends to peak and then decline as things thrash.

I also just find it really cool that they explained the performance results of a lot of previous papers within the scope of their own model, basically saying that no one was wrong, just incomplete in their analysis.

Virtual memory review

2011-06-11T18:33:00Z

I'm taking the OS prelim this fall, which means I have to read ~100 papers this summer for background material. Since repetition aids retention, I'm putting notes for papers I read up on my blog. The topics are wide-ranging, so I'm trying to start with the fundamentals and then move on up to the whole-system papers.

I'm kicking it off with "Virtual Memory, Processes, and Sharing in MULTICS" by Daley and Dennis (1968) and "The Multics Virtual Memory: Concepts and Design" by Bensoussan, Clingen, and Daley (1972). Learn about the joys of segmentation and dynamic linking from classic papers from the 70s! These are slightly infamous papers for some systems students here at Berkeley, due to a certain past OS prelim examiner grilling them on exactly these details.

Background Review

Some people consider segmentation to be the most natural way of structuring a program. Most programs are basically a collection of libraries (something really true in modern software engineering). In a segmented virtual memory system, each distinct library is placed in its own separate segment such that it has its own address space. They're still all mapped unto the same flat physical memory address space, but through per-segment base and offset addresses.

This isn't actually used much in modern operating systems for reasons I'm not entirely aware of (I'd guess simplicity and performance), but it's a pleasingly abstract and indirect way of organizing a program (a hallmark of Multics).

Virtual Memory, Processes, and Sharing in MULTICS

Multics structured its programs in terms of segments, which could be read/write/execute protected. Segmentation is great for doing memory protection (and something that only recently reemerged as the NX bit for flat memory models), since it's easy to do a compare on any (base+offset) calculation and see if it falls into a protected range. Segments can still of course be paged, and segmentation and paging are complementary: segmentation for protection, and paging for working set management.

Addressing in Multics is done in terms of a generalized address: (segment num + word num). The segment number of the currently executing segment is stored in the procedure base register, so most instructions just need to specify a word number. Indirect addressing (i.e. referencing an address stored at an address) is done with a pair of instructions to have enough bits: one for the segment number, one for the word number.

A descriptor table is kept of all the segments in a program to map them to hardware addresses. This table maps seg nums to a physical address, and then adds the word num as the offset. This is similar to a page table: virtual to physical address translation. A pointer to the descriptor table is saved as part of the context information of the process.

Now for the complicated bits: dynamic linking. Clearly, we don't want each program to have to have its own copy of shared segments (say, libc), and we want some abstraction so we aren't hardcoding word numbers into our program. This also needs to work for segments linking to other segments which link to other segments, etc., so it gets a little hairy. We also want this to be reasonably fast, e.g. don't do multiple memory accesses for every dynamically linked call, at least after the first time. In list form:

Dynamically linked accesses are specially marked in the program text
Dynamically linked segments are present at a well-know path name, e.g. in Linux /usr/lib/ld-linux.so.2, that the system can search for and find (see LD_LIBRARY_PATH).
Each segments presents a symbol table, which defines the call-in points for the segment (static vars, functions, etc)
Initially, all calls to an external function are to an indirect reference stored as link data in a per-segment linkage table. This table is initially set to trap to an OS lookup function.
A linkage pointer to the linkage table is maintained to switch around the table as context changes to different segments.
On the first reference, the OS lookup function finds the file of the external segment, examines its symbol table, and then links the two segments by updating the link data in the linkage table. Future references use that indirect address to go straight to the external segment.
A further complication enters when switching the linkage pointers between linked segments. To determine the new value for the linkage pointer, the calling procedure actually calls into the new segment's linkage table, which has special instructions to fixup the linkage pointer and then call the called procedure.
- Thus: Caller -> Caller's linkage table ~> Callee's linkage table ~> Callee
- This is direct -> indirect -> indirect, plus a fixup, seems expensive...

This really isn't that different from how Linux does it, the basic idea of "keep a well-known table that fixes itself on the first reference" is a winner. I feel like there are a lot of memory accesses required to traverse all these layers of indirection, since you are calling through multiple layers of indirect addressing, each of which is a pair of instructions.

The Multics Virtual Memory: Concepts and Design

It's weird to hear that back in the days of yore, files could not be easily loaded as program text, and the idea of virtual memory for protection, abstraction, and programmer benefit was a new idea. Users were just allocated a range of memory, or core image, with no sharing between users; if you wanted to work on a "shared file", that meant doing I/O to copy it into your range, and then another I/O to put it back into the filesystem. Each user's core image was also an unstructured jumble of instructions and data, which makes system-level sharing and memory protection basically impossible.

These two goals motivated the design of virtual memory in Multics: sharing and protection. This led to the segmentation, where each segment appears as a flat, linear namespace to the user program, with read/write/execute/append access rights attached as metadata to the segment.

Segments are also paged to ease the allocation problem and to support large segments. Descriptor segments (aka descriptor tables) are also paged for good reason, meaning 4 memory lookups to access a memory location, going through two page tables (one for the descriptor, one for the segment). TLBs work here, but it still sounds slow. Page tables are also a static size, not a tree.

This paper is a decent overview of Multics, probably would have made sense to read it before the dynamic linking one.

Android: the Good, the Bad, and the Ugly

2011-05-15T23:54:00Z

Over the last week, I've been doing a crash course in Android programming. Reynold and I have been working on our combination CS294-35 Mobile Development project and AMP Lab retreat demo, which is next week. Coming from a traditional web and application development background, there are some things Android does comparatively well, some comparatively poorly, and some that just irritate me.

This isn't a tutorial, but it'd probably be a useful read if you're thinking of getting started with Android development. My pain, your gain.

The Good

I like the relatively smooth integration of the Android SDK with Eclipse. It's pretty easy getting to the Hello World stage with the Android emulator. Autocompletion works as expected. Debugging is easy, since the emulator will throw stacktraces in LogCat even when not in debug mode, and it's just as easy to do these things on actual phone hardware. It was also easy to get things running on a real phone; the Motorola Droid I was using didn't require Verizon activation or rooting or paying for a developers license. Check a few boxes in the settings menu, and you're off to the races.

There's also a thriving community of Android developers. It's very easy to Google your problems (and believe me, that's important). StackOverflow seems very helpful.

I also appreciate having an emulator that works so effectively. I wish that the emulator's camera had a more useful test pattern, but that's forgivable. Otherwise, I had very few situations where the emulator behavior differed from actual hardware.

The Bad

There's a pretty large jump between Hello World and the next, less trivial tutorial in the series (Notepad), and there's a huge jump from Notepad to making your own app.

Notepad introduces what has to be the messiest part of Android (and indeed, mobile) development: the application lifecycle. Take a look at the flowchart in that link. An Activity is basically a single screen of an application. Whene a new activity is switched in, the old one goes through onPause() and maybe onStop(). After that, it's fair game for the Android task killer, which starts discriminately killing off applications if memory is low.

This is a huge hassle from a traditional app developer standpoint, since it used to be that the OS would save all your application state on a context switch, and restore it when your app is switched back in. Modify some variables, context switch out, context switch back in, and the variables are how you left them. In Android, that's no longer true. Now, you are forced to serialize all (all!) of your live state out to one of the Android persistent datastores, which most of the time means using (SQLite)[http://developer.android.com/guide/topics/data/data-storage.html#db]. If you don't do this, it means that your app works correctly most of the time (since the task killer doesn't always kick in), but occasionally, bad things will happen: settings get reset, entered form data disappears, just plain bugs.

This becomes especially terrible when you're doing any kind of network programming. This introduces a whole mess of concepts that'd require another full blog post (Service, ContentProvider, and background reading), but the basic problem is, how do you reliably make a request to a server if your request might die at any time (due to the application lifecycle)?

You end up having to watch this Google I/O talk on how to correctly implement REST in Android, which features this wonderful diagram that Reynold and I implemented:

Yuck. Wasn't REST supposed to be the easy way?

The very naive way of doing network requests is directly in your Activity, which blocks the UI thread (leading to the app freezing) and is clearly bad practice, even to a novice. The slightly-less-naive way is using a Service, in which you have to start a new thread to make the request (or else it will again block the UI thread). This will mostly work, since Services sort of run in the background, but is still prone to erratic bugs because Services are still under the authority of the Android task killer. So, you end up resorting to ContentProvider, and the filling out the 6 boxes in the diagram you see above.

The same story can be found for getting phone location. The documentation page isn't bad, but it quickly becomes obvious that it's complicated to do it both correctly and well. Your app has to balance using GPS vs. celltower triangulation based on accuracy and availability, cache old locations to get an initial fast fix, invalidate said cached locations if they're too old or too inaccurate, and minimizing overall usage of these radios since they're the biggest battery killers in a phone. It's a lot of manual heavy lifting to do it right, and it's easy to do it incorrectly (presenting inaccurate result) and poorly (quickly draining battery).

The Ugly

These are some random warts in the platform, not fundamental issues, but annoying (and fixable).

There are a lot of concepts thrown at you in the tutorials, and Notepad doesn't go far enough. I'd like to see some more tutorials, and more beginner-friendly explanations of classes like Activity, Service, Intent, and View, and some high-level advice on designing for the antagonistic application lifecycle.
The development emulator is really slow. It takes a few minutes to start up, and there's definitely lag while operating it. Comparatively, the iPhone emulator starts almost instantaneously, and feels just as snappy as a real iPhone.
No tutorial on how to programatically build a UI. I get that XML is the preferred way since you can do it graphically in Eclipse, but that falls short pretty fast.
The XML layout is decidedly less powerful than CSS+HTML. No templates, no style rules, lots of repeating the same padding, margin, and textsize parameters in each file. It's a nightmare for maintainability; fortunately mobile UIs are simple.
There isn't a provided library of icons. I think this is a no brainer; I want some basic icons for things like "list", "settings", "home" that I see used in core Android, but these aren't available in the SDK.
I can't figure out how to do not-fullscreen Google Maps with panning and zoom. iPhone can do it, but somehow all the Android maps I see that support pan+zoom are fullscreen.
Hardware and software keyboards have different event listeners and behaviors. Software keyboard has an unfortunate habit of staying open even when changing tabs in a TabView.
I disliked having to write something like 4 serialization/deserialization routines for every object. This was due to having to store all my state as Java objects, in SQLite, in JSON to talk to the server over REST, and also as visible data on screen. An ORM or something would be great.

Conclusion

This experience made me realize why many Android apps suck: it's hard to do things the right way, and easy to hack it together the wrong way. I don't think Android is alone on this one, from what I hear, iPhone isn't much different in terms of application lifecycle. From the OS point of view, it's great that any app can be killed at any time to save resources. I'm betting this results in huge wins in battery life, performance, and code size. However, it just shifts that burden onto app developers, who aren't used to doing this kind of thing, and there aren't libraries or APIs in place to make this as easy as it should be.

I'm not totally turned off of mobile app development, since I still believe that mobile is essentially the future of computing, but I really think it could be a lot better. Personally, HP's webOS appeals to me since HTML+CSS+JS is a much more natural way of writing applications (and that's not just my bias as a web developer), and more pure Linux-based OSs like MeeGo are certainly easier to program (but then you lose the noted benefits of the "kill anything at anytime" model). I'm still willing to bet on Android, but it still needs a lot of work before it's a first-class application development environment.

External sorting of large datasets

2011-04-16T17:24:00Z

This is a common interview question: how do you sort data that is bigger than memory? "Big data" in the range of tera or petabytes can now almost be considered the norm (think of Google saving every search, click, and ad impression ever), so this manifests in reality as well. This is also a canonical problem in the database world, where it is referred to as an "external sort".

Your mind should immediately turn to divide and conquer algorithms, namely merge sort. Write out intermediate merged output to disk, and read it back in lazily for the next round. I decided this would be a fun implementation and optimization exercise to do in C. There will probably be a follow-up post, since there are lots of optimizations I haven't yet implemented.

Introduction

Guido van Rossum (the creator of Python) did this a while ago for the rather smaller (and simpler) case of sorting a million 32-bit integers in 2MB of RAM. I took the same approach of a merge sort that writes intermediate runs out to files on disk, buffering file I/O to improve performance. However, since I'm targeting file sizes that are actually larger than RAM (e.g. a couple gigabytes), I need to do more complicated things.

The basic merge sort you learn in CS 101 recurses down to the base case of runs of just 1 element, which are progressively merged together in pairs in a logarithmic fashion (arriving at the ultimate O(n*log n) time complexity). This is inefficient for large datasets, because the merging rate is too low. If you're sorting a 1GB file of 32-bit integers, the first round of merging would generate (1GB/sizeof(int)/2) = (2^30/2^2/2) = 2^27 8-byte files, which is just too many files. This also leads to the second core problem: small disk I/Os are highly inefficient, since they result in expensive disk seeks. Writing a bunch of 8-byte (or even 8-kilobyte) files effectively randomizes your access pattern, and will choke your throughput. To avoid bad seeks, reads and writes need to be done at about the size of the disk's buffer (about 16MB these days).

All of my code is also available on github if you want to follow along, this post is based more-or-less on the initial commit.

Basic Approach

So our goal is to reduce the number of files written by the first merge step, and also write these files in much bigger chunks. This can be accomplished by increasing the quantum for merging, and doing n-way instead of 2-way merging.

I increased the merge quantum by sorting each page (4KB) of initial input with quicksort. This way, even with just 2-way merging, the first round for our 1GB of integers only generates (1GB/page_size/2) = (2^30/2^12/2) = 2^18 intermediate files, which is a lot better than 2^27, but still too large (a quarter million files is a lot).

N-way merging merges more (many more) than two runs together at once, and is basically the same algorithm as 2-way merging. This finally reduces the level of fan out to manageable levels, and means that the size of the output runs is much larger, meaning that disk I/O can be more easily batched into large 16MB chunks. With 64-way merging we finally get down to (2^18/2^6) = 2^12, or 4096 intermediate files, which is a pleasant number.

A further necessary improvement is to incrementally pull large runs off disk (required for later merge steps, when the runs are too large to all fit into memory). I do this at the same granularity as my other I/O operations: 16MB. Currently, this decides the degree of fan out as well, since I pack as many 16MB buffers into memory as I'm allowed, and n-way merge across all of them. This could be a problem if oodles and oodles of memory are allocated to the sort (since n gets large), but my computer with 4GB of RAM can only hold 256 runs, which isn't that many.

Miscellaneous notes

There are a few other miscellaneous notes. I ran into the per-process fd limit when doing large merges, so files have to be closed and reopened at the correct offset. I also parallelize the initial quicksorting of pages with a simple worker pool, which really helps speed up the first layer of merging. My quicksort also reduces recursion depth by bubblesorting for runs smaller than 5, which is okay since bubblesort is efficient on tiny sets (worst case 6 compares, 6 swaps, compare that to insertion sort). This might or might not increase performance, but it's fun. Finally, even if 256 buffers can fit into memory, one buffer must always be reserved to be an output buffer (meaning you can do at most a 255-way merge). There's also some O(n) memory overhead outside of just storing the data buffers, which you need to be aware of if your memory bound is especially tight.

Benchmarking

Enough discussion, onto the numbers! This is a situation where I feel like building an autotuner, since my envisioned final version will have a number of knobs to tweak (a future project I suppose). Right now, the two knobs I have to play with are the size of the overall buffer, and the size of I/O buffers.

I took two sets of numbers. The first set was taken on my laptop, which is a Intel Core i7-620M supporting 4 hyperthreads, 4GB of RAM, and a 7200 RPM disk. The second set was taken on my desktop, an AMD Phenom II X4 965 Black Edition supporting 4 hardware threads, 4GB of RAM, and an 60GB OCZ Vertex 2 SSD. The SSD should help for the smaller I/O buffer sizes, but sequential access shouldn't be too far apart.

I found these numbers pretty interesting. Each line represents a different total memory size. The graphs indicate that increasing the number of I/O buffer pages leads to better performance as expected, but the small total memory sizes end up performing generally better. Furthermore, my laptop performs better than my desktop with the SSD.

This can be interpreted as follows. First, linking the fan out of the merge to total memory size is a bad idea. The following table helps make this clear.

Fan out of n-way merge
	Number of I/O buffer pages (4k)
Total memory (MB)	1024	2048	4096
64	15	7	3
128	31	15	7
256	63	31	15
384	95	47	23
512	127	63	31

By looking at the laptop graph and this table together, we see that high fanout for 512MB is killing performance, since it's fine when fan out drops down to 31 at 4096 buffer pages. Conversely, the 64MB case suffers the opposite problem at 4096 pages; a fan out of 3 is too low. Since the two fastest completion times were both with a fan out of 7 (64MB with 2048 pages, 128MB with 4096 pages), I'm betting that it's around here, but this requires further tuning to decide for sure.

The second finding is that the sort is currently CPU bound. This isn't what I expected since there's a lot of disk I/O, but it seems that the I/O batching techniques are effective. Otherwise, the desktop with the SSD should outpace the laptop. Furthermore, since merging is still single-threaded, the i7 laptop actually might have an advantage because of Turbo Boost kicking up single core performance above the Phenom II desktop.

Also note that for the relatively low fan outs at 64 and 128MB, the desktop with the SSD has very flat performance as the size of the I/O buffer changes. This is the beauty of fast random accesses, and might be exploitable for better performance since you can save on memory usage by shrinking the I/O buffers.

Future work

Both of the aforementioned performance issues can be solved by parallelizing the merge step by running multiple n-way merges simultaneously. This lowers the fanout while still using all available memory, and will better balance CPU and I/O time. The number of threads and fan-out of the merge can be parameterized separately, adding two more tuning knobs to the existing knobs of total memory usage and size of I/O buffer (autotuner time?).

Another potential performance improvement is double buffering. This is essentially asynchronous I/O; instead of waiting synchronously for an I/O operation to complete, the CPU switches over to a second buffer and continues processing data. This comes at the cost of doubling memory usage (two buffers instead of one), but is probably especially beneficial for the write buffer since it's so active.

There are a few more minor performance tweaks I can think of, but no more really fundamental ones. Let me know in the comments if there's something I've missed.

A natural extension to this is parallel sorting with multiple machines, but I don't plan on taking this little C codebase that far. Better to do it properly with Hadoop in a lot less code.

Conclusion

My best case sorts 1GB of 32-bit integers in 127 seconds in 64MB of memory on my laptop, and I think there's at least a 2x improvement left with bigger memory sizes. I really enjoy this kind of performance analysis and tuning, since it requires thinking about the storage hierarchy, memory management, and parallelism. It's been a reasonable two-day project, and I could see this being assigned as an undergrad course project. It doesn't feel altogether too different from research either, just at a much smaller scale.

Once again, all the code is available at github.

Album first impressions pt. 1

2011-04-08T00:00:00Z

I've gotten some large influxes of music recently, posting my first impressions of eight albums I've given a listen or two. It's a pretty eclectic mix, new and old stuff. Reviews are unordered, and only qualitative ratings. Unexpectedly, Kanye's newest work is my favorite album of 2010. More to come.

James Brown - Love Power Peace (Live at the Olympia, Paris, 1971): I have a soft spot for live albums, and this has to be among the best. I don't have much exposure to funk (the closest thing being jazz fusion), but it's easy to fall in love with the layered big-band instrumentation, high energy, and evocative call-and-response segments.

LCD Soundsystem - LCD Soundsystem and LCD Soundsystem - Sound of Silver: I liked their most recent album This Is Happening quite a bit, so I checked out the rest of their discography. LCD Soundsystem has a pretty unique sound, rough and unvarnished singing/spoken word over some of the dirtiest bass lines and drops I've ever heard. It's easy to see why their work is such an appealing target for remixers and DJs. I'd argue that James Murphy has taken the art of repetition to perfection, surpassing even French house demigods Daft Punk. I like these two albums less than This Is Happening, but they're still going on my coding/work playlist since they're quite listenable for long periods of time.

Radiohead - King of Limbs: Disappointing. I had to come around to In Rainbows, which is now my favorite Radiohead album, but I don't think I'm going to acquire the taste of King of Limbs any time soon. It feels heavily influenced by dubstep, a genre of electronic music I'm not particularly infatuated with. I'm willing to give King of Limbs a few more listens because of brand loyalty, but it probably won't make my list of top 5 favorite Radiohead albums.

Regina Spektor - Begin to Hope: She reminds me of Feist, but with better clarity and range. I'm slowly adding female vocal music to my collection (see also: Cat Power), and this is a nice find. Where Feist has this airy, carefree feeling that permeates her music, Spektor is more raw, emotional, and pure. She also has some adorable speech impediment that makes "better" sound like "betto". Recommended.

Little Boots - Hands: I think she has mild internet fame for her videos playing a Tenori-on, and now she has an album. It's pop, which is not normally my thing, so it's a bit hard for me to judge. It's a little formulaic (expected), and reminds me of Lady Gaga, with a hip hop or alt rock feel. The autotune still grates on my ears, and is a waste of her talent. Probably won't go on my regular listening rotation, but recommended for the genre.

Robyn - Body Talk: Robyn is a Swedish pop/dance artist who turned down the recording contract that eventually went to Britney Spears. Happily, Robyn also makes much better music. I've been watching her since happening across Body Talk Pt II; 2010 saw the release of Body Talk Pt I and Body Talk Pt II as well as this most recent album, which pulls the hits from the first two Body Talk albums and adds a few new songs, forming what is effectively a "best of" compilation. Releasing three albums of new material in a year is impressive; even more impressive is the quality of each song on Body Talk. I could see any song off this album being a single. Recommended even if you don't like pop. Listen to Fembot and Indestructible.

Kanye West - My Beautiful Dark Twisted Fantasy: There's a reason this album made numerous "best of 2010" lists. It's the best fucking album of 2010. I can hardly believe I'm saying this, but I actually like it more than Arcade Fire's 2010 entry, The Suburbs. I heard MBDTF's opener, and was already overwhelmed. I could go through and recommend each track on the album for a different reason, but I'll promote in particular Runaway and Blame Game. Kanye was always one of the best hip-hop producers in the business, and now he's got the rapping chops to match. Sick beats abound. I love the super smooth production, every note is polished to a glossy sheen. I'm impressed with the variety in styles. To conclude: I don't particularly like hip hop, I don't like Kanye's earlier work, I don't like Kanye the person, but My Beautiful Dark Twisted Fantasy is undeniably brilliant. It's a hip hop album for the ages.

Static website hosting on Amazon S3

2011-04-01T02:49:00Z

Werner Vogels, Amazon CTO, posted on his blog about a month ago on "New AWS feature: Run your website from Amazon S3". S3 now offers the ability to host static HTML pages directly from an S3 bucket, which is a great alternative for small blogs and sites (provided, of course, that you don't actually need any dynamic content). This has the potential to greatly reduce your hosting costs. A small Dreamhost/Slicehost/Linode costs around $20 a month, and I used to run this site out of an extreme budget VPS (Virpus) which was only $5 a month, but I expect to be paying only a few cents per month for S3 (current pricing is just 15¢ per GB-month). Of course, you also gain best-of-class durability, fault-tolerance, and scalability from hosting out of S3, meaning that your little site should easily survive a slashdotting.

The difficulty here is that most of the popular blogging engines require a backing database, and do their content generation dynamically server side. That doesn't fly with S3; since it is, after all, just a Simple Storage Service, content has to be static and pregenerated. I chose to use Hyde, a Python content generator that turns templates (based on the Django templating engine) into HTML. Hyde page templates are dynamic, written in Django's templating language which supports variables, control flow, and hierarchal inheritance. Hyde will parse these templates, fill in the dynamic content, and finally generate static HTML pages suitable for uploading to S3. Ruby folks can check out Jekyll as an alternative.

Caveats

To be clear, purely static content won't suffice for many sites out there (like anything with user-generated content). Even a simple blog like is only feasible because there are web services that fill in the gaps in functionality. Disqus seems to have cornered the market for comments as a service; you just include a little bit of Javascript and it's good to go. It's similarly easy to include a Twitter widget showing your recent tweets with another little blob of Javascript, and Feedburner and Google Analytics are defacto analytics tools. There's barely a need these days to scrape, store, and serve content yourself these days, further obviating the need for a real server.

This is also clearly a more coding heavy approach to blogging and site generation than most people need. With free blog services like Wordpress.com, Blogger, Tumblr, and Posterous, blogging has never been easier or more available. Google Sites is also a great way of throwing up a quick website. I went with S3 and Hyde because I wanted more customization in the look and feel of the site, I like the Django templating system, and I wanted to play with S3 (especially since Amazon offers 1 year of free AWS credit). I also feel a bit safer about my data, since it's backed by Amazon's eleven 9's of durability on S3, it's on my local machine, and under version control at github.

Hyde

Hyde is pretty straightforward for anyone with experience writing Django templates, since it's basically the Django template engine plus some extra magic content and context tags. The Hyde README and github wiki are somewhat helpful in laying this out. Essentially, Hyde lets you assign per-page metadata that can be accessed by other pages as they walk the directory structure of your content; your URL structure mirrors your folder structure. By default, this metadata includes a created field that fuels the magic recents template tag which gets the most recent content from a directory (like your blog). There are a few more Hyde specific features which you can read about on the wiki page on templating, and the Django templating reference is also useful.

I still found myself a little stuck, and what was most useful was reading the source for the skeleton site that Hyde generates for you initially, and the code that Steve Losh uses to generate his own blog. To help you out, I've published the code for this site on github too. It might be useful to read Steve's write up on moving from Django to Hyde as well.

A few nice features of Hyde I like are the ability to automatically compress Javascript and CSS with jsmin and cssmin, and support for writing posts in Markdown, which is a lot easier and more portable than HTML. There's also support for writing "higher level CSS" (CleverCSS, HSS, LessCSS), but I never understood the point of these and didn't use them.

The features I had to add to the skeleton code are a draft status for posts, and the "Recent Posts" and "Archive" sections on the sidebar. Drafts were done by adding a metadata draft: True tag to draft posts, and modifying all my "listing" pages to exclude these posts (like the home page, archive, recent posts, atom feed). The "Recent Posts" and "Archive" sidebar use page.walk to traverse the blog directory and the recents tag the most recent posts. These posts are then filtered with if statements to exclude non-draft content. This is all slightly hacky, since if you want to show the 5 most recent blog posts (as returned by recents 5), you might have less than 5 posts after filtering out drafts. This is worked around by not dating drafts until publication (which gives them a default date in 1900).

I also had to modify Hyde's page.walk and page.walk_reverse to walk directories in lexicographically sorted order, but I'm hoping that's been fixed in git (I was using version 0.4).

S3

There is plenty of documentation on how to set up an S3 bucket as a website. It's pretty easy, I didn't have any trouble with this.

Making your existing domain name point to your S3 bucket is a little more tricky. S3 provides a URL for your bucket (in my case, http://www.umbrant.com.s3-website-us-west-1.amazonaws.com/). The first problem is a limitation of DNS: you can't make your zone apex a CNAME. If that was gibberish, it means that you can't make your plain domain name (http://umbrant.com) an alias for another domain name, like your S3 bucket's. Subdomains don't have this limitation, which is why you're viewing this blog at http://www.umbrant.com, happily CNAME'd to my S3 bucket. My zone apex then does a redirect to the www subdomain; this redirect is a service provided by some registrars, or you can beg a friend with a server.

I just lied to you a little about how this works. Notice that if you dig www.umbrant.com, you get the following:

$ dig www.umbrant.com
 
<snip>
 
;; QUESTION SECTION:
;www.umbrant.com.     IN   A
 
;; ANSWER SECTION:
www.umbrant.com.   831 IN   CNAME  s3-website-us-west-1.amazonaws.com.
s3-website-us-west-1.amazonaws.com. 60 IN A  204.246.162.151
 
<snip>

My subdomain isn't actually CNAME'd to my S3 bucket domain name, I've set it to alias directly to s3-website-us-west-1.amazonaws.com. This is a mild optimization that saves a DNS lookup; if you dig my bucket domain name, you see that it's CNAME'd to s3-website-us-west-1.amazonaws.com anyway, which finally gets turned into the IP address for an S3 server (the A record). This server uses the referring domain name (www.umbrant.com) to look up the S3 bucket with the same name. This system also means that if someone's already made a bucket in your region with the same name as your subdomain, you've got to choose a different subdomain (thanks to S3's flat keyspace). In other words, when using S3, your bucket name and subdomain must be the same.

Uploading files to S3 isn't too bad. I'm sure there are existing tools out there for interfacing with S3 on the commandline, but I rolled my own in Python with the SimpleS3 library available on PyPI. It's basically rsync-for-S3 with some issues; it doesn't delete old files from S3, the parsing isn't bulletproof, and it uses modtimes to check for updates instead of checksums (which I plan on implementing soon, right now it's almost my entire blog each time I re-run Hyde). However, it does work, and it is really simple to use.

from simples3 import *
import os
import re
 
# Config options
ACCESS_KEY = 'YOUR_ACCESS_KEY'
SECRET_KEY = 'YOUR_SECRET_KEY'
# Change this
BUCKET_NAME  =  "www.umbrant.com"
# Change this too, make sure to edit your region and bucket name
BASE_URL = 'https://s3-us-west-1.amazonaws.com/www.umbrant.com
 
# NO TRAILING SLASH
SOURCE_DIR = "/home/andrew/dev/umbrant_static/deploy"
 
IGNORE = (
          "\.(.*).swp$", "~$", # ignore .swp files
         )
 
# code
 
ignore_re = []
for i in IGNORE:
    ignore_re.append(re.compile(i))
 
# open bucket
bucket = S3Bucket(BUCKET_NAME, access_key=ACCESS_KEY, 
                  secret_key=SECRET_KEY, base_url=BASE_URL)
 
# recursively put in all files in SOURCE_DIR
 
for root, dirs, files in os.walk(SOURCE_DIR):
    relroot = root[len(SOURCE_DIR)+1:]
    for f in files:
        # root directory files should not have a preceding "/"
        # puts the files in a blank named directory, not what we want
        key = ""
        if relroot:
            key = relroot + "/" + f
        else:
            key = f
        filename = root + "/" + f
 
        # check in the ignore list
        ignore = False
        for i in ignore_re:
            if re.match(i, f):
                print "Ignoring", key
                ignore = True
        if ignore:
            continue
 
        stat = os.stat(filename)
        metadata = {"modtime":str(stat.st_mtime)}
 
        # check if it's changed with modtimes
        sf = False
        try:
            sf = bucket.info(key)
        except:
            contents = open(filename).read()
            bucket.put(key, contents, acl="public-read", metadata=metadata)
            print "Uploading", key
            continue
 
        if not sf["metadata"].has_key("modtime") or \
        sf["metadata"]["modtime"] != str(stat.st_mtime):
            bucket.put(key, open(filename).read(), acl="public-read", 
                       metadata=metadata)
            print "Uploading", key
            continue
 
        print "Skipping", key

Final remarks

This was a pretty reasonable and fun 2 days of effort, most of which was spent on tuning the CSS template and writing content, not wrangling code. Hyde doesn't feel very mature (documentation is lacking, example skeleton site is slightly broken, the sorting bug), but it works well enough and is good for people transitioning from Django. I'm very positive about S3 and Amazon Web Services in general (modulo Elastic Block Store being terrible, but that's a rant for another day), since my site is now essentially impervious to failure. It's also pleasing to see top management like Werner Vogels dogfooding Amazon's features.

Hello world!

2011-03-30T01:34:29Z

Hello world! This is my new blog and profile site, purely static HTML hosted out of S3. Since I find this to be a neat trick, I'm going to make the howto into my second-ever blog post. This site is still very much under development, I have lots of ideas for neat features I want to add (comments being the first), but it's good enough to put live.

Cake presented at SoCC

2012-10-15T14:00:00Z

I just presented some of our work at Berkeley at SoCC, on "Cake: Enabling High-Level SLOs on Shared Storage." It's a coordinated, multi-resource scheduler for storage workloads, which enables consolidation of front-end and backend workloads while meeting high-level performance requirements of the front-end workload. Consolidation has advantages in terms of economic costs (reducing overprovisioning and underutilization), and also significantly reducing the latency of traditional unconsolidated copy-then-process analytics cycles.

A PDF of the paper and the slides from my presentation are available on my research page.

MinuteSort with Flat Datacenter Storage

2012-07-02T22:42:00Z

Microsoft Research recently crushed the world record for MinuteSort, sorting 1.4TB in a minute. This replaces the former record held by Yahoo's 1406 node Hadoop cluster in the Daytona MinuteSort category, and means that Hadoop no longer holds any world sorting record titles.

I found MSR's approach of "MinuteSort with Flat Datacenter Storage" (FDS) to be intriguing. Most of the prior sort winners (e.g. Hadoop, TritonSort) try to colocate computation and data, since you normally pay a throughput (and thus latency) cost to go over the network. FDS separates out compute from storage, heavily provisioning a full bisection bandwidth network to match the I/O rate of the hard disks on storage nodes.

I'm going to give a rundown of the paper, and then pull out salient points for Hadoop at the end.

Storage

FDS the storage system is pretty simple. It's a straight-up blob store, where blobs are arbitrarily large, but are read and written in 8MB chunks called tracts. Blobs are identified with a 128-bit GUID, and tracts are 8MB offsets into the blob. Choosing 8MB means that the disks get essentially full sequential write bandwidth, since seek times are amortized over these long reads. It provides asynchronous APIs for read and write operations; pretty necessary for doing high-performance parallel I/O.

Data is placed on storage nodes (aka tractservers) through a simple hashing scheme. There's no Namenode like in Hadoop. Instead, clients cache a tract locator table (TLT) which lists all of the tractservers in immutable order. Lookups for a tract in a blob are done by hashing the blob GUID, adding the tract number, and then moduloing to index into the TLT. The paper doesn't talk about replication for availability and durability, but it'd be easy to tack it on the way MongoDB does it, which is making identical clones of each tractserver and then doing failover with Paxos or some such.

There's some more complexity in the hashing scheme to handle load balancing, which is reminiscent of Chord's virtual node approach. Client access through the TLT is normally going to be sequential, since clients will mostly be sequentially scanning through the tracts of a blob. Remember that to find the TLT index, the blob ID is hashed, but the tract number is just added. This means you can get convoy effects where clients auto-synchronize as they wait on a slow tractserver, and from then on move in lockstep through the list. This is countered by taking 20 copies of the list of tractservers, shuffling each copy, and then concatenating them together. This super-list prevents auto-synchronizing lockstep, and also does a good job of balancing data distribution across tractservers.

An interesting note is that storage nodes are heterogeneous, with the authors saying they "bought, begged, and borrowed" a mix of machines. They again borrow from Chord's virtual nodes by adjusting the number of times a node appears in the TLT proportionally to its I/O capability. This assumes that I/O rate and amount of storage on a node are also proportional, since there just this one mechanism of tweaking the # of appearances in the TLT to distribute both I/O and data. This assumption doesn't hold up since hard disk sizes are increasing faster than I/O rates, but maybe they didn't have heterogeneous enough hardware to care. In any case, it's not addressed in the paper.

Network

Each node has dual 10GigE Ethernet, connected by a full-bisection bandwidth Clos network. Since their flows are all short and bursty (random 8MB accesses), TCP's slow start is a bad fit. Instead, they use a windowed RTS/CTS scheme which limits the number of flows at each receiver, and thus limits congestion in the Clos. Control messages are sent through separate TCP connections to improve latency; this is because TCP's per-flow fairness allows them to skip buffering.

They also try to zero-copy everything right from disk to network to application buffers. This is basically a requirement if you're trying to beat a sorting record.

Phases

There are two phases: a read and partition phase, and a write phase. During the read phase, compute nodes read tracts of unsorted data from storage nodes, sort the data into buffers based on partition, and then stream these buffers to the node responsible for the partition. After the data is partitioned, each node does a write phase where it sorts its partition and writes it out to disk. Since they're doing MinuteSort, the dataset fits entirely in memory and only has to be read and written once.

Separated storage makes handling stragglers very easy. Since the initial read phase is lovingly stateless, any node can read and partition any tract of data. This means a dynamic work queue can be used for load balancing, which is an approach I really like. Whenever a node finishes a work item, it polls a central coordinator for more work to do. As long as the size of each work item isn't too big, stragglers no longer have a major impact on overall job completion time.

To determine the right size for work items, the authors use a clever dynamic "Zeno allocation" scheme. Basically, work items decrease in size as the job nears completion. This makes sense because stragglers hurt you at the end of the job, so it's worth the overhead of using smaller work items for that last wave of processing.

Computation and communication are overlapped during the read phase. As a node receives buffers of partitioned data, it saves up until it has a 250MB chunk and then quicksorts it in the background. Thus during the write phase, each node has to do maybe one more 250MB quicksort, then just merges all of its buffers together as they're written out to disk. This hierarchy of sorts is pretty much the same approach I used in my external sorting implementation. It also lets them do external sorting of datasets bigger than memory.

Hadoop comparison

There are a number of reasons why this is significantly faster than Hadoop MapReduce.

In FDS, data is only read and written to disk once. Hadoop's intermediate data is sorted and spilled to disk by the mapper, while FDS keeps this entirely in memory.
FDS also streams and sorts data during the read phase, while Hadoop has a barrier between its map and reduce phases. Hadoop can do some pre-aggregation in the mapper with a combiner, but not as flexibly as in FDS, not on the reduce side, and not without hogging an entire task slot.
FDS has better task scheduling, with a dynamic work queue and dynamic work item sizes.
FDS uses a single process for both the read and write phases, so the JVM startup cost is gone, and there's no unnecessary movement of data between address spaces.

That said, FDS gets away with some of this because of its problem domain. The MapReduce paradigm is optimized for simplicity and scale-out rather than raw performance. Serializing intermediate data and having non-overlapping map and reduce phases makes fault-tolerance easier. If a reducer dies, a new reducer can just read all the intermediate data off disk. If a mapper dies, there's no icky problems with a reducer having already processed some of the mapper's output. Hadoop is also optimized for overall cluster throughput rather than completion time of this single MinuteSort job, which is part of why it uses much larger splits (64MB or 128MB) and doesn't have dynamic task sizing (hurts overall throughput).

Conclusion

What I get out of this is that MSR built a highly-optimized blob store, connected it to a 20Gbit Clos network, and ran an in-memory sort. While it does crush Hadoop, I don't think there's much here over TritonSort, since their 1.47TB/min for the Indy MinuteSort wasn't that much better than TritonSort's 1.35TB/min in 2011, and TritonSort was built with academic dollars rather than MSR dollars. I like the idea of separated storage, but building out a full Clos network is expensive. I liked their tricks of hashing to determine tract locations, but they really should present a full fault-tolerance and durability story. Network scheduling in the general case is also more complicated than in FDS sort, and this is part of why Clos aren't so useful in practice: they're nice to program for, but are hard to keep fully utilized.

I think the main areas where Hadoop could see improvement are increased use of zero-copy I/O and a better story for intermediate data and use of memory in the cluster. Since zero-copy has improved greatly in recent releases of Hadoop, the big remaining issue is memory caching. I need to toot the Berkeley horn here and mention Spark and PACMan, which both address this problem. We really need the equivalent of HDFS for memory, since it can have huge performance benefits when there's iterative computation or hot data. The difficulty here is settling on the right abstractions and mechanisms.

JVM Performance Tuning (notes)

2012-01-18T00:15:00Z

A presentation by Attila Szegedi titled "Everything I Ever Learned about JVM Performance Tuning @twitter" has been floating around for a few months. I've restructured much of the content into a set of notes. This covers the basics of memory allocation and garbage collection in Java, the different garbage collectors available in HotSpot and how they can be tuned, and finally some anecdotes from Attila's experiences at Twitter.

I'm still fuzzy on some things, so it's not ground truth. If more experienced people weigh in, I'll fix things up. The very informative hour-long presentation is still highly recommended.

The Price of an Object

Java, as an object-oriented language, naturally results in the creation of a lot of objects. This is one of the things you give up as opposed to a language like C; even basic data-wrapper objects are much heavier weight than structs. The minimal object is 24 bytes: 16B of object overhead, and 8B for the pointer to that object. This includes arrays, which also incur 24B for an empty array, 4B for the size of the array, then per-item costs after that (you only need one pointer).

Primitive types don't need require the 16B of object overhead, and are thus can be a much more compact representation. However, beware autoboxing: many provided data structures will automatically convert your nice compact int into a big fat Integer. Using plain old arrays of primitive types can be the best choice. It's definitely better than allocating lots of tiny objects which will end up being substantially overhead.

As a side note not mentioned by Attila, also beware the use of the Java String class, since it can double your in-memory storage costs. Java internally uses UTF-16 for its strings, which is a 2-byte character encoding. Compare this to the more common UTF-8 or ASCII encodings, which are both 1-byte. This glosses over the details (UTF-16 and UTF-8 are variable-length encodings, and can be up to 4 bytes per character), but this doubling holds true for the common case.

There are two more twists here. First, Java pads out all objects to the nearest 8-byte boundary, which fattens up objects a bit more. This isn't all bad though, since it aids the second twist: pointer compression. Beneath 32GB of heap, the JVM will actually only use 4B per pointer instead of 8B. Why is 4B enough for 32GB? Since everything is padded out to 8B, the pointer can just be left shifted by three bits before doing the normal byte addressing. However, this means if you want a heap bigger than 32GB, you need to jump up a lot; Attila says 48GB.

JVM Memory Management

The JVM heap is split up into two generations which are garbage collected at different rates. All objects are allocated in the young generation; more specifically, they are allocated within Eden in the young generation. As objects survive GC rounds, they get copied to the two successive survivor regions within the young generation, before ultimately being tenured to the old generation, which gets GC'd less frequently than the young generation.

This means that Java optimizes for the common case of short-lived trash that can get quickly collected in the young generation. Eden allocation and garbage collection is also really cheap. It's treated kind of like stack allocation; creating a new object usually just requires bumping the pointer that defines the end of Eden. Garbage collection is also simple; live objects get copied out to the first survivor, and then the Eden pointer gets zeroed back to the start of the heap. Trash doesn't need to be explicitly zeroed out, so deallocation is "free". Note that this efficiency is true only for small objects; larger objects (megabytes) are allocated in a different and more expensive way directly to the old generation.

Garbage collection happens when Eden fills up. This might sound bad since young GCs are stop-the-world and happen erratically, but young generation GC time is proportional to the number of live young objects, which is usually small compared to the amount of trash. Concurrent GC can happen in the background, avoiding that nasty stop-the-world pause, but only for the old generation, and it's not perfect. More on this later.

Garbage Collector Tuning

Attila iterates multiple times that the more memory you can give the young generation, the better, since allocation and deallocation is so cheap. I think he then backpedaled a bit though, because really big young generations can lead to long pauses while the live objects are copied around. What you want is a young generation big enough to hold active and tenuring objects, and for long-lived objects to quickly tenure and reach the old generation. However, you also don't want survivors to get forced to the old generation early by memory pressure on the young generation.

With this in mind, lets talk about the different garbage collectors available in HotSpot. They can be divided into two categories, throughput and latency:

Throughput: scheduled to run when the JVM runs out of memory. Stop-the-world operation.
- SerialGC: Single-threaded garbage collector. Sun probably wrote this one first, and it's around for legacy reasons.
- ParallelGC and ParallelOldGC: Multi-threaded garbage collectors. ParallelOldGC is actually better than ParallelGC since it cleans both the young and old generations (rather than just the young generation).
Latency: scheduled to run periodically by the JVM when it has spare cycles. Can still result in stop-the-world if the GC can't clean fast enough and memory runs out.
- ConcMarkSweepGC (CMS): Concurrent and tries to be "low pause". CMS has a number of caveats though. It kicks in when allocated memory passes a threshold, meaning you need to overprovision memory by 25-33% to give it a buffer to allocate with while it cleans. It also doesn't compact memory, so you can get fragmentation that leads to stop-the-world pauses. As stated earlier, it also only cleans the old generation, and uses a throughput collector for the young generation.
- G1GC: undocumented black magic that Attila had no experience with, and explicitly didn't not cover in his presentation. I found this link to the Oracle documentation though, it's supposed to be a better replacement for CMS.

There are a lot of options here, so Attila breaks it down into some simple heuristics.

Look for ways to reduce the application's memory consumption. Less memory pressure means less garbage collection.
Try a throughput collector with adaptive sizing turned on, which lets the JVM figure out the best sizes for the different generations. If this works, great! -XX:+PrintHeapAtGC can be helpful here.
Next, try ConcMarkSweepGC. -verbosegc and -XX:+PrintGcDetails are useful here. This is a situation where you might adjust the young generation to reduce the pause from young gen GC time, but also make sure that the survivor regions aren't filling up.

If you're interested, a quick search turned up two links to official Oracle documentation for the HotSpot 1.4.2 garbage collectors. It's a bit dated, but I think most of it still applies.

Programming Anecdotes

These are miscellaneous tips pulled from the talk. Most of them boil down to using less memory, which is a much better solution than trying to tune the GC. There were some out-of-the-box solutions, and some Twitter-specific things too.

Get used to profiling your code, especially third party libraries.
- For instance, Guava sucks up 2KB each time you make a map by default to handle concurrency cases.
- Don't use Thrift's RPC classes as your domain objects, deserialize rather than keeping them around. They have extra overhead compared to normal objects.
Normalize your data when possible. If some data is shared between multiple objects, have them all point to the same instance instead of each having their own copy.
Beware of worker thread pools sharing connections to many storage servers; this can result in m * n cached connection objects if you're using thread locals.
- It's better to instead use fewer threads and asynchronous I/O.
- Also, don't be afraid of just creating a new object on demand. It's cheap in Eden, just a single pointer bump!
- Synchronized objects are another alternative to thread locals.
If you're having trouble packing it all into one JVM, try using multiple!
Twitter had a service that would have a terrible GC pause every three days. Solution: just bounce the machine after less than three days.
Don't write your own memory manager, and stop if you find yourself doing something ugly with byte buffers.
- The one notable exception is if you need it for a very limited and simple case. Cassandra has its own slab allocator with fixed size slabs that are flushed to disk when memory is full. This is so simple that it's okay.
Oracle told Twitter that the garbage collector isn't actually complained about that much, since people have figured out how to engineer around it.
You should never have to call the garbage collector yourself in code.

Conclusion

This is far from a complete discussion on memory management in Java, but it's got some easy and immediately applicable findings, and I hope these notes help direct further reading. I want to give Bill Pugh's super detailed "The Java Memory Model" a read, and there are a few Java performance tuning books on my Amazon wishlist too.

Again, leave a comment if something is unclear or incorrect, and I'll do my best to fix it up.

Year in review: 2011 (personal)

2012-01-05T21:44:00Z

I like to take some time every once in a while to think about what I've done that I'm proud of, what I've learned, and what I want to do. With the beginning of a new year comes the perfect opportunity to reflect on my life in the year past.

I've split it into two separate blog posts, one professional (meaning research and grad school life) and one personal (meaning hobbies, self-improvement, life goals). This post covers the latter; my personal life in 2011.

This post is also split up into different sections. The first is again accomplishments, then I cover time management, my 3 main hobbies these days, and then misc other personal stuff.

Accomplishments

Got a bugfix patch into Hadoop. This happened at one of the Cloudera hackathons, I showed up and fixed something off JIRA, and now people are using it. It's not quite the Linux kernel patch on my bucket list, but it's a step.
Bootstrapped myself as a guitarist. Literally just bought one and started playing last January, and it's worked out okay.
Became a better public speaker. I credit the two hours of teaching section every week. I'm not ready to cross it off the bucket list yet (might never), but I think I've improved.
Can order Indian food reasonably well. Credit here goes to Zaika Tuesdays last semester.
I got my RSI under control. I was actually worried in the summer that I'd have to drop out of grad school and find a new profession unrelated to a keyboard, but it turned out okay after physical therapy.

Time management

Overall, I'm reasonably happy with how much I got done. I'd estimate my productivity these days at somewhere around 70-80%, and there's always going to be some slop there.
Balance. I spent last semester either working furiously or burnt out, and that's not optimal. Being burnt out basically gives you an excuse to not do anything for an indefinite amount of time, and I hate making excuses to myself.
To add to the previous point, I think having regular, recurring events each week will help me stay sane.
Intentionally scheduling my meetings early was a great way of forcing myself to work almost a 9-to-5.
Getting enough sleep actually happened last semester, and it was glorious. This also goes back to not making excuses to myself (about underachieving because I'm tired).
I need to make conscious decisions to do less of some things and more of others, since time is precious.

Hobbies

Cooking

Cooking has tapered off as an interest, as I've achieved some level of optimality in the "effort expended vs. output achieved" metric. I don't have hours every week to spend cooking up new stuff.
I still think it'd be good to do something new once a week, or once a month. I'm going to avoid anything elaborate or complicated though, I'm past that phase.
I do want to keep cooking club going this year though, since it fell flat as soon as I got swamped with teaching and research last semester. It fulfills a social niche that isn't met by weekend parties, and it's nice to have some continuity with the same group of people.

Guitar

Guitar-wise, I need to be more structured about how I practice. Focus on repetition until mastery, not just noodling around for an hour.
I want to make it a goal to play more with other people, since doing it over break was a lot of fun.
I'm also going to make it a point to improv 12-bar blues as part of my practice sessions.

Bicycling

Bicycling was something I missed a bunch when I was working furiously, it was a really great way of starting off my Saturdays.
I want to find some new rides, since doing 3 Bears is getting a little repetitive.
The cycling list died out a bit when I stopped organizing rides; I'd like to get it to the point where it's stable without me.

Personal

I'm not convinced I want to stick with the PhD thing all the way through yet. Grad school hasn't exactly been smooth sailing for me, but it seems to be trending upwards. This will be an important year in making a decision, since the master's is a good drop-out point.
Balance again. I still do believe in work-life integration, but when you're working 12-16 hours 6 days a week and eating out for every meal because of it, it's gone too far.
I need to write in my journal more, and write more blog posts about things I find interesting.
Strangely, I think I need to spend more money. When you're living on a grad student stipend, that little bit you could save has a high marginal utility if spent instead.
I'm looking forward to TGIF with my cube from now on, the one time we did it in December before everyone left was great.

Year in review: 2011 (professional)

2012-01-05T21:43:00Z

First, things I've done in the past year that I'm proud of. Next, meta-comments related to research. Finally, a section on teaching, since I TA'd CS162 this past semester (my first teaching experience).

Accomplishments

Getting an advisor. This was really a major milestone in my grad career, and only happened as of about 4 months ago.
Publishing a paper. CrowdDB won best demo at VLDB, and PACMan got into NSDI.
Picking a good problem and starting a project I can call my own. Prior stuff was all sort of handed to me, so this is another major milestone.
Survived teaching for the first time. I feel pretty good about my own contributions to CS162; it was a lot of blood, sweat, and tears, but I love the students and all my fellow TAs.
Worked harder for a prolonged period than I ever have before. This was for Jellyfish, my class project for Ion's Cloud Computing course. Good to have done it, don't wish to repeat it.

Research-related nuggets

Math is really, really useful. I regret not having learned it better in undergrad, since I'm pretty much resigned to taking the stat sequence here at Berkeley.
Queuing theory and control theory are basically black magic, and I don't think I'll ever grok it rigorously. Fortunately, heuristics and approximations oftentimes work, and are the common "git-er-done" systems solution.
Machine learning is an almost mythical solution to some problems (from my perspective). This falls in with my above two points; often I feel like there's a preexisting approach to the problem I'm staring at, but even when I know vaguely what it is, it's inaccessible to mere mortals.
Start out by doing stupidly simple and unrealistic experiments to make sure you understand what's going on. We made the mistake of going directly to a very realistic workload, and got overwhelmed by the complexity. Ion very correctly identified some basic experiments that let us test our mental model of the system.
Time spent on infrastructure is not time wasted. We spent months automating our experiments, setting up clusters, scripting the graph workflow, generalizing functionality, all in the background while trying to make sense of the data. When it finally clicked, all that infrastructure allowed us to run the graphs we wanted in half a day.
Look for fundamental tradeoffs in your problem space (a Ion nugget). It can really clarify the right design choice.
Be glad when things are hard, because it means you've chosen a good problem.
Ask questions. Yes, you might look dumb, but if you don't ask, it'll only hold you back from actually understanding what's going on.

Teaching-related nuggets

I still like teaching and interacting with students, even after what has been noted as one of the more trying CS classes ever offered at UC Berkeley.
Properly preparing an hour of material takes at least two hours, and that's if you actually know as much about the topic as you think you do.
Encouraging participation during section is paramount. I think rewarding interaction with lollipops worked pretty well.
Professors are fallible, and have their own weaknesses like anyone. These weaknesses will very rarely affect their performance as a researcher, but can crop up during teaching.
Teaching can suck up time like nothing else. There's always pressure from students and instructors to spend more time on the class, and you really have to draw a line somewhere.

Paper review: Facebook Haystack

2012-01-03T15:44:00Z

This is a review of Facebook's Haystack storage system, used to store the staggering amount of photos that are uploaded to Facebook everyday. Facebook Photos started out with an NFS appliance, but was forced to move to a custom solution for the reasons of cost, scale, and performance. Haystack is an engineering solution that applies well-known techniques from GFS and log-structured filesystems to their distributed, append-only, key-value blob situation. Metadata management is somewhat novel, as well as their CDN integration.

The paper, "Finding a needle in Haystack: Facebook's photo storage" by Beaver et al., was published at OSDI '10.

Main ideas

Facebook's design requirements break down as follows:

Efficient random disk access. Anything that hits Haystack missed in the CDN cache, and there are too many photos to fit it all in memory. Thus, there has to be at least one random disk seek; Haystack makes it just one (most of the time).
Efficient use of storage space. Using the normal "one file per image" approach killed the # of disk accesses required to serve an image, but it also required huge amounts of extraneous metadata for things like permissions and filenames that don't matter.
Append-only write semantics. Once a photo is written, it cannot be modified. Application "overwrites" are handled by deleting and adding a new photo with the same key. I assume that this ispretty rare.
Fault-tolerance. True of any distributed system.
Scalability and elasticity. Ditto.
Simplicity. Always good if you can get it.

Efficient disk access and use of storage space are handled by essentially keeping metadata in memory and collapsing many images together into a single file that is preallocated on disk, called a "physical volume". These physical volumes are on the order of a hundred gigabytes, and function basically like segments (if you think back to the memory management part of your OS class); each photo is referenced as an offset and length within a file. As long as a server can keep this offset+length metadata and the inodes of the huge files in memory, it can do almost every photo read in a single random seek (with the corner case of falling on an extent boundary requiring two). There are also checksums and flags and other metadata that are stored in the file on-disk.

Recovery of the memory metadata is done through the use of a checkpoint file, that is then updated by examining the end of each file. This works because new photos are added sequentially, like in a log-structured filesystem. Recovering the "is deleted" flag is done lazily when a read request is made, by checking the flag's state on-disk (which will be right, due to all writes being synchronous). Synchronous writes and append-only semantics make consistency a non-issue.

Haystack combines multiple physical volumes into a single "logical volume", through which is how photos are actually accessed. Mapping from logical to physical is again done via an in-memory structure. I think this is how they do geographic replication, by mirroring writes across all the physical volumes in a logical volume, which has to contain volumes in multiple locations.

Photo writes are optimized pretty heavily. Naturally, they are batched, but there is also some interplay with the Haystack cache (which acts like an internal CDN). Machines are marked as either write-enabled or read-only. Write-enabled machines get to keep their data in the Haystack cache, to reduce their read load moving the disk head excessively. The cache also does prefetching of new photos, since they're very likely to be accessed soon. Read-only machines with a lot of deleted photos can be compacted and (I assume) toggled back to write-enabled to receive more photos.

A few more misc takeaways:

XFS worked the best for making the large 100GB files used for physical volumes
RAID is also used underneath, for added availability
Many layers of cache yield diminishing returns
The photo URL on FB starts off pointing at a CDN, but gets stripped down successively as it goes further into Haystack. Ex: http://CDN/HayCache/HayMachine/VolumeAndPhoto.
Centralized master maintains the logical-to-physical mapping and load balances writes across logical volumes. Seems easy to keep in memory and replicate if necessary.

Analysis

This really was a great application to build a new system, since existing stuff wouldn't do it as well, and the requirements made it relatively easy as distributed systems go. It pulls heavily from GFS with the centralized master and many data node approach, and also uses LFS concepts of a sequentially-written log and keeping filesystem metadata in memory. The "write-enabled" vs. "read-only" business is essentially adding journalling to a distributed filesystem, which of course is just a mini-version of the ideas in LFS. Using giant 100GB files means they were able to make their own super-simple user-level filesystem without writing an actual filesystem, a move I applaud for practical reasons.

I can't say there was much "aha" content in the paper though. I normally like industrial papers because of the real-world experiences, but this was a straightforward implementation paper. Their system bears many resemblances to GFS, and being tailored for a single usecase allowed them to greatly simplify the design. I'm a bit disappointed, in that this is a clearly impressive system, but didn't relay any interesting tidbits.

Apache Hadoop committer

2013-08-03T12:52:00Z

A quick post celebrating that I recently was made a committer on the Apache Hadoop project. I owe a big thanks to everyone who's reviewed my patches and helped me along the way (especially my colleagues ATM, Todd, and Colin here at Cloudera).

My very first patch was HDFS-1952 in May 2011, via a Hadoop hackathon hosted at Cloudera. It was the most promising newbie HDFS JIRA on the list, and I still remember all the basic issues I had checking out the repo, setting up Eclipse, using Ant, and generating the diff. Two years later, these things have gotten easier :)

Here's to many more contributions in the future!

Grad school four months out

2013-05-12T18:02:00Z

Here's my account of leaving the PhD program at Berkeley to work at Cloudera. My experience might not be representative or generalize beyond my own situation, but I'm writing this because a number of people have asked me about the differences between grad school and industry. Choosing to leave Berkeley was a very personal decision, but fortunately I'm happy with how it's turned out.

This also serves as my "Year in review: 2012" post, since this was the major change in my life last year.

Background

Last summer, I interned with Cloudera and had a very positive experience. I clicked with the team, enjoyed working on open source and the product, and felt that I could turn some of my research into shipping production code. So, at the end of my internship, I asked for an offer and got it. Before making a decision, I talked with my advisor and my grad student friends who had previously worked in industry, but I was fairly sure I was leaving. Berkeley has a very lenient policy when it comes to this, in that all you need is your advisor's signature to leave and later your advisor's signature to return. Theoretically, my progress toward the PhD remains good forever, so I always have the option of going back (assuming Ion or another professor will take me :).

Cloudera graciously let me defer by a semester while I finished my masters, which I really appreciate, and then gave me another few weeks off in January so I could relax and go on long bike rides before starting work proper. Logistically, both Berkeley and Cloudera made the transition very easy (but no such favors from the ridiculous SF housing market, ouch!).

Day-to-day

Day-to-day work is quite different.

In grad school, I worked fewer median hours per day (probably 7) but with high variance and more than 5 days per week. Class or paper deadlines can mean consecutive 16 hour days with no weekends, with my record being three weeks of that in a row. Not something I enjoyed, but I'm glad it's something I did. On the flip side, the entire lab would empty out for a few days after a grueling deadline. I'd also sometimes skip out early on a Friday if the weather was particularly nice. Taking a longer vacation consisted of telling your advisor and booking your flights. As long as it avoided any major deadlines, it was easy to take off weeks at a time.

A typical day in grad school would be getting into the lab around 10:30 or 11 (normal), talking about big research ideas with my cubicle mates for a while, and then going to some lunch seminar (either a student presentation or an external researcher). These talks were usually high quality and engaging, academics are trained how to give talks. After that, I'd do some work in the afternoon (coding, reading papers, homework, more chatting about research ideas), then probably another meeting in the evening (research group, advisor, project group). Group meetings were often scheduled late since things were so hectic during the day. It wasn't uncommon to get takeout to bring back to the lab for a meeting. The first two years of grad school, your time is sliced quite finely because of classes and TA responsibilities in addition to research. It was hard to find solid blocks of time every day to do work. This gets much better in year 3 and onward.

In industry, I consistently work 5 days a week, 8-9 hours per day. Initially I was working about 8 hours per week more than this, but I was getting burned out and that wasn't doing anyone any favors. I find I'm more productive on a regular schedule. Software progresses during moments of clarity, and for me, the way to maximize those moments is to stay rested and happy. One thing I do miss though is the ability to take random days off. Vacation days are now a scarce commodity and have to be planned months in advance. My commute is also about 20 minutes longer each way, which adds up.

One major Cloudera-specific bonus is that engineers get to work from home on Wednesdays. This is just enough flexibility to let me go to the dentist or get my hair cut when it's not too busy, and it's nice having a break in the middle of the week where you can code interruption-free.

A typical day for me at Cloudera is getting in around 10 or 10:30 (normal) and clearing out my email inbox. There's a lot more email that has to be sifted: bug reports, user mailing lists, JIRA traffic, on-going discussion threads. That takes about 0.5-1 hour, so I can get some coding in before lunch. Lunch is a quick affair, about 30 mins, thanks to catering delivered right to the office. There's a high probability of at least one meeting per day, otherwise it's free development time (coding, testing, email, JIRA). I normally take a break around 3 to avail myself of the free snacks and drinks. I like to leave around 6 or 7, but I'll stay longer if I have a nice flow going. After I go home though, I'm usually tired and just cook dinner and relax.

Interpersonal

Socially, grad school is like a more grown-up continuation of undergrad. Most people are in their mid-twenties, and everyone shares the qualities of being smart, motivated, and hard-working. But, it's hardly a monoculture in terms of background. People come from all over geographically, and many have worked before coming back to school. Computer science is also a very broad field; I know barely anything about graphics, AI, and theory, so it's always interesting to hear from people in those areas.

I made a lot of friends in grad school. You end up working closely with a lot of different people through group projects, and there's a lot of bonding from shared suffering around deadlines. Collaboration is heavily encouraged at Berkeley, so you meet a lot of people through talks, meetings, and random hallway chats. There's also a new crop of graduate students every year, so more people to meet.

Thus, grad school provides a nice social structure. You end up interacting with a lot of active, interesting people.

Cloudera is definitely a contrast. The biggest change is that I'm interacting with far fewer people on a day-to-day basis. Fortunately I really like my coworkers, but I'm interacting face-to-face with pretty much the same ~5 people every day. I'll note this is far better than other places I've worked, where I'd only see 3 people the entire day, but it's not the same level of collaboration and fraternization.

I think this is just part of transitioning to working life. For a company our size, I don't think you can do better. Cloudera has put substantial effort into instilling a flat, open culture from the top down. We have a cubicle-free workspace and a single lunch room where everyone (CEOs, VPs, sales, engineering) eats. Our CEO also makes a real effort to make us all feel onboard through all-hands meetings and weekly company-wide updates, and he speaks with every new employee one-on-one.

Ultimately, this comes down to the nature of the work and differences in priorities. Software development is a more well-defined and thus less collaborative process than research. I spend comparatively more time mangling code and less time hashing out ideas on a whiteboard. Professional developers also tend to go home after 8 hours since they have responsibilities at home, while grad students generally don't and are more willing to hang out.

Problem scope and purpose

This gets down to why I decided to leave grad school in the first place: to be one hop closer to real-world problems, and to get perspective on what is actually important. I went to grad school directly from undergrad, so I didn't have any prior industry experience besides internships. So, to find cool research problems to work on, I'd talk to people from Microsoft, or Google, or Facebook, hear what they thought was important, then try and turn that into a research paper. The main issue was that I never crisply understood all the use-cases and requirements for the systems I was building. I felt weird making assertions about how people wanted to use something, when I had never operated or used it in the real world!

This is why I chose to join Cloudera, and why Cloudera is basically the best place in the world for me to work. We're growing fast and very customer-driven when deciding what to work on. Customer needs directly guide new features and are distilled into design-time requirements. I'm also forced to think more generally about the full breadth of use-cases because of the variety of our downstream users. Yes, this can be frustrating and can pollute the purity of your design, but I can be assured that I'm solving a real, pressing need. I'm fortunate in that I get to work on many of the same problems I was interested in at Berkeley. The separation between industry and academia isn't that large in systems.

Of course, it's not all unicorns and rainbows. While new feature development is fun and great, I spend probably 70% of my time working on bug fixes, backports, customer support tickets, and other etcetera that does not exist in grad school. There are more unplanned interruptions and my todo list is constantly backlogged. It requires strong focus and triaging skills to prioritize what's important and get things done.

I'll add that it's been a life ambition of mine to work on an open-source software platform. Originally, I thought I'd go into operating systems (i.e. Linux kernel development), but I really believe it when I say Hadoop is an operating system for datacenters. That's just the coolest thing to me. We're making previously unthinkable amounts of computational resources accessible to the average programmer, solving problems that were once economically infeasible.

Fin

There is no simple takeaway where I state that one is better than the other. I like working at Cloudera, but there's more for me to learn about research too.

The purpose of writing this post was actually two-fold. Yes, to serve as a reference for others making a similar choice, but also as a way for me to introspect and assess the outcome of my decision. External to this discussion of industry vs. academia, I'm still learning how to maximize my own happiness and productivity, which happens regardless of where I am.

So, read and draw your own conclusions. I'd be happy if even a single person read this and found it useful in their decision-making. I'm also more than willing to talk one-on-one if you want advice; just leave a comment or shoot me an email.

Hadoop 101 slides

2013-05-02T20:25:00Z

I gave a guest lecture on the Hadoop stack last week at Tapan Parikh's INFO 206: Distributed Computing Applications and Infrastructure course at Berkeley. I took a more academic approach than most, talking about the original motivating problem of Google search before moving into a deep dive of HDFS and MapReduce and an overview of the rest of the Hadoop ecosystem.

A couple students came up afterwards to say they enjoyed the talk, so I think it was well-received.

Slides: PPTX with animations and PDF.

Highly-available audio in HDFS

2013-04-01T18:00:00Z

Here on the HDFS team at Cloudera, we believe in eating our own dogfood. Since we value our (substantial) MP3 collections quite dearly, it's only natural to store them in a high performance, highly-available, enterprise-quality distributed filesystem like HDFS. Today, I'm announcing the next generation in aural HDFS enjoyment: listening to music directly from the Namenode web UI.

It's easy to use existing music players with HDFS through FUSE-DFS or Cloudera's NFS proxy, but what if you want a zero-configuration way of listening to your tunes now? This is the goal of my new pet project, hdfs-player.js:

Source code

Bookmarklet maker

Copy paste the source code into the bookmarklet maker, drag the bookmarklet into your toolbar, browse to the directory containing your music in the web UI, and prepare for excitement. Transform this boring old web UI:

Into the spectacular this:

Use Chrome for MP3 playback, FF only supports Ogg and wav.

Happy listening, everyone!

Paper review: DRAM errors in the wild

2013-02-05T22:00:00Z

Today, I'm looking at an excellent study by Schroeder et al., "DRAM errors in the wild: A Large-Scale Field Study". This is the definitive paper on the subject, covering two years, thousands of machines, and millions of DIMM hours. Memory errors are particularly important in the context of growing cluster sizes; one-in-a-million errors become common at scale.

ECC and memory errors

Schroeder and her Google co-authors lead with an overview of DRAM errors and the countermeasures present in today's hardware. Almost all server-grade memory is ECC, meaning it can detect double-bit errors and correct single-bit errors (via something like a 7-4 Hamming code). More advanced "chipkill" schemes can also correct some multi-bit errors. These errors are detected on read; uncorrectable errors (UEs) normally result in a system reboot, while correctable errors (CEs) are fixed up on-the-go. Some systems also have a hardware scrubber, which periodically checks and rewrites errors (at the rate of 1GB every 45 minutes). This is important since it can prevent correctable errors from accumulating and becoming uncorrectable.

Errors are also divided into hard and soft errors. Soft errors are the famed cosmic ray flipping a bit; a random, one-off fault caused by the environment. These are the errors that countermeasures like checksums and hardware scrubbers are designed for. Hard errors are structural, and more difficult to deal with. In practice, these emerge as "stuck bits" which can't be rewritten and fixed, and are caused by things like hardware faults or buggy firmware.

Failure rates

The biggest takeaway from their study is that correctable errors are far more common than previously thought. One third of machines experienced a correctable error in a year. Breaking it down, there's a 1.29% incidence per DIMM-year of a CE. Another important note is that errors are strongly correlated by node; 20% of machines cause 90% of errors. These errors are also strongly correlated by age; a DIMM that experiences 1 error in a month typically experiences 10-100 errors the following month. I think aligning with the aging hypothesis, the authors found that hard errors were much more likely than soft errors.

One surprising data point was that they did not find temperature and error rates to be correlated, once normalized for utilization. Error rates were correlated with utilization though, and utilization is correlated with temperature. I hypothesize that more heavily utilized machines are using more memory and doing more reads, and thus are more prone to errors (else, the random bit flip might just happen in an unused portion of memory and never be encountered).

The uncorrectable error rates plateau over time because Google aggressively decommissions machines that experience UEs. All said, their trace shows a 0.22% incidence per DIMM-year of a UE. Based on this, I don't think there's much danger of undetected errors (should be extremely rare).

In terms of technology, they found no differences between DDR1, DDR2, and FBDIMM. Furthermore, they did not find significant differences between manufacturers. Chipkill is seen as an extremely beneficial technology though, since hardware platforms with chipkill experienced far fewer uncorrectable errors.

Conclusion

Memory errors are scary, and they happen with some frequency. However, the situation seems to be manageable. The study is a strong advocate for ECC memory, chipkill, and hardware scrubbers. Getting proper monitoring to quickly detect and decommission machines with uncorrectable errors is also important. Overall, this paper was a very readable and thorough analysis of memory errors, and will probably remain the gold standard on the topic for some time.

Bucket list: Cycling a Century

2013-01-18T19:28:00Z

I've been taking a little time off in between transitioning from grad life at Berkeley to working full-time at Cloudera. I decided to use some of this vacation time to check off a bucket list item: bicycling an imperial century (100 miles). Here's my experience, and advice for anyone who wants to do the same.

The route I did is called The Cheese Factory Century. It's basically a tour of North Bay; across the Golden Gate Bridge, out to Tiburon, up to San Rafael, Novato, then looping back through Lucas Valley and back across the Bridge. Strava only recorded 4.4k ft of climbing, and there weren't any major climbs. I particularly enjoyed Lucas Valley; cows, rolling green hills, picturesque lakes, and not much traffic. Unfortunately I didn't have time to take any pictures, but it's pretty enough to warrant another trip. Total rolling time was 7:16, but including breaks about 9 hours. I was impressed with the route markers and availability of bike lanes; NorCal really gets this right.

This is by far the longest ride I've ever done, which hopefully is encouraging to others. My previous distance PR was 60 miles, and adding another 40 miles didn't really change things that much physically. Last week I did a 50mi test ride of part of the route and still felt pretty good afterwards, so I decided to tackle the big kahuna.

The real challenge is in your head. I rode about 50 miles before lunch without major difficulties, but around mile 60 the self-pity really started kicking in. Miles 70-85 were probably the hardest. At that point, everything hurts, and there's not much you can do about it. Padded bike shorts only help so much, your arms ache from having held your body up for hours on end, one knee develops a twinge, and your quads feel like they're on the verge of cramping. I spent those miles ineffectually readjusting body position and grip and trying to think happy thoughts.

Around mile 85 I was out of Lucas Valley and starting to recognizing the territory (important morale boost) but those were still slow, slow miles. It was terrible to see tenths coming off my average speed, even after having banked 6 hours of effort. I stopped caring about that though around mile 95 when I realized I was going to make it, and seeing my GPS tick over to 100 right after crossing the GGB was a magical moment.

Advice

If you want to try this yourself, the standard long-ride advice applies, just more so.

Check your bike and check yourself. There were times when I was 15 miles from civilization, and it would have been really bad to have a breakdown or spill. My test ride let me assess my physical condition, and also helped me learn the route (really important).
Wear layers. If you're traversing 100 miles, you're going to both bake in the sun and freeze in the shade. Make sure you can adjust without stopping.
Bring lots of provisions. I ate a Clif bar, a pack of Shot Bloks, and a pack of Honey Stingers while riding. I also had 2x24oz bottles filled with watered-down Gatorade, which I refilled periodically. Still, both bottles were dry when I got home, so the extra capacity was important. Having bonked hard on a ride before, I made sure to eat and drink preemptively and often. I also had a brownie and coffee in Tiburon, and had a massive meatball sub and Gatorade for lunch in Novato.
Start early. I started just before 9AM, but better would have been 8 or even 7:30. I realized before lunch that I was going to run out of daylight. This was definitely a motivator for miles 85+ (chasing the last rays of the sun), and I managed to catch sunset right as I crossed the GGB around 5:30PM.
Stay in the drops. I observed maybe a ~2mph boost from being in the drops, which is huge.
Keep pedaling. It's the only way to eat up the miles. Corollary is to avoid stopping. Personally, there was also a huge psychological benefit to having a simple mantra like this to focus on when the pain set in, e.g. miles 70+.

Conclusion

It's done, and I'm really glad I set out to do this. It was totally different from my other athletic experiences. Crew races only last 5 or 6 minutes, and while there is a big psychological component, it's a totally different kind of hurt. My other long rides were also with groups, while this one I did solo and was much longer. Spending 9 hours in the saddle forces you to confront your physical and mental limitations again and again. This is one of the reasons I like cycling though: when it comes down to it, there's just you, your bike, and the road in front of you. No one else is going to pedal your bike for you, and the sooner that sinks in, the sooner you get up the hill.

Bucket list: Catch a fish and eat it

2014-12-30T17:03:00Z

I checked off one of my bucket list items yesterday: catching a fish, cleaning it, and eating it.

This was the last day of a family vacation in Port St. Lucie in Florida. My original plans to go deep sea fishing fell through, so I went to the surprisingly well-stocked local Walmart to pick up some freshwater gear. I was lucky enough to nab a healthy-looking 15" largemouth bass with a silver Mepps spinner from the lake behind our timeshare.

My mom, being Korean, knows a thing or two about preparing a whole fish, so I undertook the cleaning and cooking under her supervision. We had it for dinner, and suffice to say, it was delicious.

Paper review: Facebook f4

2014-10-29T17:20:00Z

It's been a while since I did one of these! I did a previous review of Facebook Haystack, which was designed as an online blob storage system. f4 is a sister system that works in conjunction with Haystack, and is intended for storage of warm rather than hot blobs. As is usual for Facebook, they came up with a system that is both eminently practical and tailored for their exact use case.

This paper, "f4: Facebook's Warm BLOB Storage System" by Muralidhar et al., was published at OSDI '14.

Background

Haystack is very good at what it was designed to do: fast random access to write-once blobs. In short, it writes out all these objects log-structured to large 100GB files on the local filesystem, and maintains an in-memory index of blob locations so it can serve up a blob with at most a single disk seek.

The downside of Haystack is that it's not very space efficient. Files are replicated both at the node-level because of RAID-6 and also geographically three times, leading to a total replication factor of 3.6x. f4 improves upon this by using erasure coding, which drops the replication factor to 2.1x. Considering that Facebook has 65PB of warm blobs, we're looking at tens of PBs in savings (meaning millions of dollars).

However, the downside of erasure coding is worsened request rate and failure recovery. With erasure coding, there's only a single data replica that can serve read requests. Failure recovery is more expensive since it requires reading the other data and parity blocks in the stripe. In the meanwhile, clients reads require doing online erasure coding, unless they failover to another datacenter.

It's important to note that f4 was never intended to replace Haystack, but to complement it. They are both fronted by the same CDN, caching, and routing layers, and likely expose similar if not identical APIs. Haystack is great for hot data, and f4 is great for warm data, the key is determining where a given blob belongs.

Determining hotness

Facebook's blobs tend to be accessed frequently when they're first uploaded, after which access rates drops off exponentially. There are a couple different types of blobs, e.g. photos, videos, attachments, and each had different access rates and drop offs. They chose to look at a nifty metric over time: 99th percentile IOPS/TB. Based on synthetic benchmarks, they knew f4's 4TB drives could handle handle 80 IOPS with acceptable latency. This meant a blob wamigration made sense when the IOPS/TB for a type of blob fell below 20.

The other component of hotness is deletion rate. This is important for log-structured systems since compaction (rewriting the file) is required to reclaim space from deleted items. Fortunately, deletions illustrated the same sort of drop-off as request rate.

Profile photos, it turned out, do not exhibit a strong drop off, and are never moved to f4. Photos ended up being hot for about 3 months, and everything else was only hot for one month.

Storage format

Haystack and f4 use the same concept of a volume of blobs. The volume's data file is log-structured and contains a bunch of log-structured blobs. The volume's index file tells you how to find the blobs within the data file, without scanning the entire data file. Once a volume grows to about 100GB, it's locked and the data and index file are immutable. At this point, the volume is a candidate for migration to f4.

One interesting note is that f4 is totally immutable, not even supporting deletes. Through a clever trick though, it does support logical deletes. Each blob is encrypted with a unique encryption key which is stored in an external database. By deleting the encryption key, the blob is effectively also deleted even though the storage space is not reclaimed. The thinking is that the delete rate is low enough that this is desirable to simplify the system. As it turns out, only 7% of data in f4 is deleted in this manner, which isn't too bad compared to the savings from erasure coding, and considering the immense amount of data that would likely have to be rewritten.

Erasure coding

f4 uses Reed-Solomon (10,4) encoding, which means 10 data and 4 parity blocks in a stripe, and can thus tolerate losing up to 4 blocks before they lose the entire stripe. They also use XOR encoding for geographic replication, doing essentially XOR (2,1) encoding across three datacenters. So, their replication factor is 1.4 for the RS (10,4), 1.5 for the XOR (2,1), for a total of 2.1x. Before introducing XOR encoding, they were doing simple 2x replication, so some of their data is still encoded at 2.8x.

Because single data files are so large, f4 can use a large block size of 1GB, and use the ~100 blocks to form stripes. There's no need to stripe at a finer granularity like QFS. It's not clear whether they inline the parity blocks or write them to a side file, but I think it'd be nifty to bulk-import the data and index files directly from Haystack and then just add the parity file later.

System architecture

Architecturally, f4 bears a strong resemblance to HDFS, which isn't surprising since it's built on top of it. Their name node is highly-available and maintains the mapping between blocks and storage nodes which have the block (the data node equivalent). Storage nodes store and serve the actual data files and index files for volumes. In a diversion from HDFS though, the name node also distributes this mapping to the storage nodes, and in fact clients issue reads and writes directly to storage nodes. This is great because it better distributes load, but does not actually save an RPC since the client still needs to go to the storage node that has the desired blob.

A lot of the additional functionality was also built out as external services that can run on storage-less, CPU-heavy machines.

When online reconstruction is required to serve a client request, this task is handled by a backoff node which issues offset reads to the other blocks in the stripe, and reconstructs from the first ones that come back. This involves only recovering a single blob, not the entire block.

Rebuilder nodes are the counterpart to backoff nodes, and handle background erasure coding. They throttle themselves heavily to avoid affecting foreground client requests. These nodes are actually also responsible for probing for node failures, and report failures to coordinator nodes, which, as it sounds, coordinate recovery tasks. These coordinators also handle fixing up placement violations, if multiple blocks from a single stripe end up on the same rack.

f4 basically glues a new set of soft-state coordinators and workers onto HDFS, rather than baking the functionality into the existing NN and DN. These services likely still require talking to the NN, but this is okay since NNs are not heavily loaded since client load is being handled by storage nodes. This is not true of us, so performance is a real concern, and we typically shy away from the operational complexity of new services since most of our customers are not as sophisticated as Facebook's ops team.

Hardware notes

The big overall theme here is separation of storage, compute, and cache. Storage nodes get to focus only on storage, since the compute-heavy tasks are handled by the new separate services, and the entire blob store is already fronted by multiple levels of caching (so no need for buffer cache). Storage nodes are thus very CPU light, and only need enough memory to be able to fit the blob index and block-node mapping into memory. The index size seems to dominate, and they mention potentially storing it on SSDs.

One other fun note is that they wanted to further downsize their CPUs to save on power, but were unable to do so without sacrificing memory capacity. Intel probably doesn't want to cannibalize their high-end server market. It's also hard to find support for ECC memory in lower-end processors.

They run with a single cluster configuration: 30 4TB drives per 2U host, 14 racks of 15 hosts per cell. Total unformatted capacity per cell is thus 25PB. 14 is the minimum number of racks for RS (10,4) encoding to protect against rack failures. Running with the minimum means a single failure means you can't get back up to full strength, but since they're also doing geographic replication this is kind of okay.

Somewhat obvious, but their network must be fast and plentiful, or else they wouldn't be doing erasure coding at all. It's great to get additional confirmation that this can be done at scale.

Finally, doing the above trace-driven IOPS/TB analysis also let them do hardware provisioning based on their SLOs. They provisioned f4 such that the weekly peak load on any drive is less than the maximum IOPS it can deliver. Those are some pretty strong guarantees on high-percentile latency.

Other misc notes

Constraints leveraged to simplify the design space:

One workload
One node, rack, and cluster configuration
One file size (~100GB), and it's nice and large
Only have full 1GB blocks
No appends or deletions
Fixed erasure coding scheme

From the HDFS developer's perspective:

Having the block-to-node mapping on their storage nodes is interesting, but probably only possible since their metadata is relatively small. 25PB / 1GB = 25 million blocks, which is pretty manageable compared to some clusters we see.
Since their recovery nodes are external to the NN, they probably have some way of writing a new block directly to a storage node without going through the NN, or a concat-like API that lets them slide a new block into an existing file.

Conclusion

It's great to see a paper that opens with a data-driven analysis of their target workload. They clearly spent a lot of time gathering traces, doing analysis, and running synthetic workloads. The end result was a system that works well in production, in tandem with their other systems.

Although f4 is specifically tailored for Facebook's environment, I find it very heartening that they built it on top of their forked version of HDFS. We're currently working on erasure coding for upstream HDFS, and I'm sure our design will differ substantially from f4, but identifying why we make different design choices will be interesting in and of itself.

In-memory Caching in HDFS: Lower latency, same great taste

2014-05-03T18:27:00Z

My coworker Colin McCabe and I recently gave a talk at Hadoop Summit Amsterdam titled "In-memory Caching in HDFS: Lower latency, same great taste." I'm very pleased with how this feature turned out, since it was approximately a year-long effort going from initial design to production system. Combined with Impala, we showed up to a 6x performance improvement by running on cached data, and that number will only improve with time. Slides and video of our presentation are available online.

Two engineering principles

2014-01-08T13:15:00Z

I received two interesting pieces of advice at the AMP Lab retreat this past week, which concisely state some of my favorite software engineering principles:

Don't be a zealot. Understand in technical detail why a given language, framework, or design should be preferred, not because of technological fascination or fanboy-ism. The canonical examples here are programming language flamewars, e.g. Java vs. C++.
Ruthlessly optimize for your requirements. This means first, carefully defining said requirements, but then being completely unafraid to buck conventional wisdom if it's not a good match. This often means intentionally pruning out features, even common ones implemented by other systems.

Mesos, Omega, Borg: A Survey

2015-05-27T13:57:00Z

Google recently unveiled one of their crown jewels of system infrastructure: Borg, their cluster scheduler. This prompted me to re-read the Mesos and Omega papers, which deal with the same topic. I thought it'd be interested to do a compare and contrast of these systems. Mesos gets credit for the groundbreaking idea of two-level scheduling, Omega improved upon this with an analogy from databases, and Borg can sort of be seen as the culmination of all these ideas.

Background

Cluster schedulers have existed long before big data. There's a rich literature on scheduling on 1000s of cores in the HPC world, but their problem domain is simpler than what is addressed by datacenter schedulers, meaning Mesos/Borg and their ilk. Let's compare and contrast on a few dimensions.

Scheduling for locality

Supercomputers separate storage and compute and connect them with an approximately full-bisection bandwidth network that goes at close to memory speeds (GB/s). This means your tasks can get placed anywhere on the cluster without worrying much about locality, since all compute nodes can access data equally quickly. There are a few hyper-optimized applications that optimize for the network topology, but these are very rare.

Data center schedulers do care about locality, and in fact this is the whole point of GFS and MapReduce co-design. Back in the 2000s, network bandwidth was comparatively much more expensive than disk bandwidth. So, there was a huge economic savings by scheduling your computation tasks on the same node that held the data. This is a major scheduling constraint; whereas before you could put the task anywhere, now it needs to go on one of the three data replicas.

Hardware configuration

Supercomputers are typically composed of homogeneous nodes, i.e. they all have the same hardware specs. This is because supercomputers are typically purchased in one shot: a lab gets $x million dollars for a new one, and they spend it all upfront. Some HPC applications are optimized for the specific CPU models in a supercomputer. New technology like GPUs or co-processors are rolled out as a new cluster.

In the big data realm, clusters are primarily storage constrained, so operators are continually adding new racks with updated specs to expand cluster capacity. This means it's typical for nodes to have different CPUs, memory capacities, number of disks, etc. Also toss in special additions like SSDs, GPUs, shingled drives. A single datacenter might need to support a broad range of applications, and all of this again imposes additional scheduling constraints.

Queue management and scheduling

When running an application on a supercomputer, you specify how many nodes you want, the queue you want to submit your job to, and how long the job will run for. Queues place different restrictions on how many resources you can request and how long your job can run for. Queues also have a priority or reservation based system to determine ordering. Since the job durations are all known, this is a pretty easy box packing problem. If the queues are long (typically true) and there's a good mix of small jobs to backfill the space leftover from big jobs (also typical), you can achieve extremely high levels of utilization. I like to visualize this in 2D, with time as X and resource usage as Y.

As per the previous, datacenter scheduling is a more general problem. The "shape" of resource requests can be quite varied, and there are more dimensions. Jobs also do not have a set duration, so it's hard to pre-plan queues. Thus we have more sophisticated scheduling algorithms, and the performance of the scheduler thus becomes important.

Utilization as a general rule is going to be worse (unless you're Google; more on that later), but one benefit over HPC workloads is that MapReduce and similar can be incrementally scheduled instead of gang scheduled. HPC, we wait until all N nodes that you requested are available, then run all your tasks at once. MR can instead run its tasks in multiple waves, meaning it can still effectively use bits of leftover resources. A single MR job can also ebb and flow based on cluster demand, which avoids the need for preemption or resource reservations, and also helps with fairness between multiple users.

Mesos

Mesos predates YARN, and was designed with the problems of the original MapReduce in mind. Back then, Hadoop clusters could run only a single application: MapReduce. This made it difficult to run applications that didn't conform to a map phase followed by a reduce phase. The biggest example here is Spark. Previously, you'd have to install a whole new set of workers and masters for Spark, which would sit alongside your MapReduce workers and masters. Hardly ideal from a utilization perspective, since they were typically statically partitioned.

Mesos addresses this problem by providing a generalized scheduler for all cluster applications. MapReduce and Spark became simply different applications using the same underlying resource sharing framework. The simplest approach would be to write a centralized scheduler, but that has a number of drawbacks:

API complexity. We need a single API that is a superset of all known framework scheduler APIs. This is difficult by itself. Expressing resource requests will also become very complicated.
Performance. 10's of thousands of nodes and millions of tasks is a lot, especially if the scheduling problem is complex.
Code agility. New schedulers and new frameworks are constantly being written, with new requirements.

Instead, Mesos introduces the idea of two-level scheduling. Mesos delegates the per-application scheduling work to the applications themselves, while Mesos still remains responsible for resource distribution between applications and enforcing overall fairness. This means Mesos can be pretty thin, 10K lines of code.

Two-level scheduling happens through a novel API called resource offers, where Mesos periodically offers some resources to the application schedulers. This sounds backwards at first (the request goes from the master to the application?), but it's actually not that strange. In MR1, the TaskTracker workers are the source of truth as to what's running on a node. When a TT heartbeats in saying that a task has completed, the JobTracker then chooses something else to run on that TaskTracker. Scheduling decisions are triggered by what's essentially a resource offer from the worker. In Mesos, the resource offer comes from the Mesos master instead of the slave, since Mesos is managing the cluster. Not that different.

Resource offers act as time-bounded leases for some resources. Mesos offers resources to an application based on policies like priority or fair share. The app then computes how it uses them, and tells Mesos what resources from the offer it wants. This gives the app lots of flexibility, since it can choose to run a portion of tasks now, wait for a bigger allocation later (gang scheduling), or size its tasks differently to fit what's available. Since offers are time-bounded, it also incentivizes applications to schedule quickly.

Some concerns and how they were addressed:

Long tasks hogging resources. Mesos lets you reserve some resources for short tasks, killing them after a time limit. This also incentivizes using short tasks, which is good for fairness.
Performance isolation. Use Linux Containers (cgroups).
Starvation of large tasks. It's difficult to get sole access to a node, since some other app with smaller tasks will snap it up. The fix is having a minimum offer size.

Unaddressed / unknown resolution:

Gang scheduling. I think this is impossible to do with high utilization without either knowing task lengths or preempting. Incrementally hoarding resources works with low utilization, but can result in deadlock.
Cross-application preemption is also hard. The resource offer API has no way of saying "here are some low-priority tasks I could kill if you want them". Mesos depends on tasks being short to achieve fairness.

Omega

Omega is sort of a successor to Mesos, and in fact shares an author. Since the paper uses simulated results for its evaluation, I suspect it never went into production at Google, and the ideas were rolled into the next generation of Borg. Rewriting the API is probably too invasive of a change, even for Google.

Omega takes the resource offers one degree further. In Mesos, resource offers are pessimistic or exclusive. If a resource has been offered to an app, the same resource won't be offered to another app until the offer times out. In Omega, resource offers are optimistic. Every application is offered all the available resources on the cluster, and conflicts are resolved at commit time. Omega's resource manager is essentially just a relational database of all the per-node state with different types of optimistic concurrency control to resolve conflicts. The upside of this is vastly increased scheduler performance (full parallelism) and better utilization.

The downside of all this is that applications are in a free-for-all where they are allowed to gobble up resources as fast as they want, and even preempt other users. This is okay for Google because they use a priority-based system, and can go yell at their internal users. Their workload broadly falls into just two priority bands: high-priority service jobs (HBase, webservers, long-lived services) and low-priority batch jobs (MapReduce and similar). Applications are allowed to preempt lower-priority jobs, and are also trusted to stay within their cooperatively enforced limits on # of submitted jobs, amount of allocated resources, etc. I think Yahoo has said differently about being able to go yell at users (certainly not scalable), but it works somehow at Google.

Most of the paper talks about how this optimistic allocation scheme works with conflicts, which is always the question. There are a few high-level notes:

Service jobs are larger, and have more rigorous placement requirements for fault-tolerance (spread across racks).
Omega can probably scale up to 10s but not 100s of schedulers, due to the overhead of distributing the full cluster state.
Scheduling times of a few seconds is typical. They also compare up to 10s and 100s of seconds, which is where the benefits of two-level scheduling really kick in. Not sure how common this is, maybe for service jobs?
Typical cluster utilization is about 60%.
Conflicts are rare enough that OCC works in practice. They were able to go up to 6x their normal batch workload before the scheduler fell apart.
Incremental scheduling is very important. Gang-scheduling is significantly more expensive to implement due to increased conflicts. Apparently most applications can do incremental okay, and can just do a couple partial allocations to get up to their total desired amount.
Even for complicated schedulers (10s per-job overheads), Omega can still schedule a mixed workload with reasonable wait times.
Experimenting with a new MapReduce scheduler was empirically easy with Omega

Open questions

At some point, optimistic concurrency control breaks down because of a high conflict rate and the duplicated work from retries. It seems like they won't run into this in practice, but I wonder if there are worst-case scenarios with oddly-shaped tasks. Is this affected by the mix of service and batch jobs? Is this something that is tuned in practice?
Is a lack of global policies really acceptable? Fairness, preemption, etc.
What's the scheduling time like for different types of jobs? Have people written very complicated schedulers?

Borg

This is a production experience paper. It's the same workload as Omega since it's also Google, so many of the metapoints are the same.

High-level

Everything runs within Borg, including the storage systems like CFS and BigTable.
Median cluster size is 10K nodes, though some are much bigger.
Nodes can be very heterogeneous.
Linux process isolation is used (essentially containers), since Borg predates modern virtual machine infrastructure. Efficiency and launch time were important.
All jobs are statically linked binaries.
Very complicated, very rich resource specification language available
Can rolling update running jobs, meaning configuration and binary. This sometimes requires a task restart, so fault-tolerance is important.
Support for "graceful stop" via SIGTERM before final kill via SIGKILL. The soft kill is optional, and can not be relied on for correctness.

Allocs

Resource allocation is separated from process liveness. An alloc can be used for task grouping or to hold resources across task restarts.
An alloc set is a group of allocs on multiple machines. Multiple jobs can be run within a single alloc.
This is actually a pretty common pattern! Multi-process is useful to separate concerns and development.

Priorities and quotas

Two priority bands: high and low for service and batch.
Higher priority jobs can preempt lower priority
High priority jobs cannot preempt each other (prevents cascading livelock situations)
Quotas are used for admission control. Users pay more for quota at higher priorities.
Also provide a "free" tier that runs at lowest priority, to encourage high utilization and backfill work.
This is a simple and easy to understand system!

Scheduling

Two phases to scheduling: finding feasible nodes, then scoring these nodes for final placement.
Feasibility is heavily determined by task constraints.
Scoring is mostly determined by system properties, like best-fit vs. worst-fit, job mix, failure domains, locality, etc.
Once final nodes are chosen, Borg will preempt to fit if necessary.
Typical scheduling time is around 25s, because of localizing dependencies. Downloading the binaries is 80% of this. This locality matters. Torrent and tree protocols are used to distribute binaries.

Scalability

Centralization has not been an impossible performance bottleneck.
10s of thousands of nodes, 10K tasks per minute scheduling rate.
Typical Borgmaster uses 10-14 cores and 50GB of RAM.
Architecture has become more and more multi-process over time, with reference to Omega and two-level scheduling.
Single master Borgmaster, but some responsibilities are still sharded: state updates from workers, read-only RPCs.
Some obvious optimizations: cache machine scores, compute feasibility once per task type, don't attempt global optimality when making scheduling decisions.
Primary argument against bigger cells is isolation from operator errors and failure propagation. Architecture keeps scaling fine

Utilization

Their primary metric was cell compaction, or the smallest cluster that can still fit a set of tasks. Essentially box packing.
Big gains from the following: not segregating workloads or users, having big shared clusters, fine-grained resource requests.
Optimistic overcommit on a per-Borglet basis. Borglets do resource estimation, and backfill non-prod work. If the estimation is incorrect, kill off the non-prod work. Memory is the inelastic resource.
Sharing does not drastically affect CPI (CPU interference), but I wonder about the effect on storage.

Lessons learned

The issues listed here are pretty much fixed in Kubernetes, their public, open-source container scheduler.

Bad:

Would be nice to schedule multi-job workflows rather than single joba, for tracking and management. This also requires more flexible ways of referring to components of a workflow. This is solved by attaching arbitrary key-value pairs to each task and allowing users to query against them.
One IP per machine. This leads to port conflicts on a single machine and complicates binding and service discovery. This is solved by Linux namespaces, IPv6, SDN.
Complicated specification language. Lots of knobs to turn, which makes it hard to get started as a casual user. Some work on automatically determining resource requirements.

Good:

Allocs are great! Allows helper services to be easily placed next to the main task.
Baking in services like load balancing and naming is very useful.
Metrics, debugging, web UIs are very important so users can solve their own problems.
Centralization scales up well, but need to split it up into multiple processes. Kubernetes does this from the start, meaning a nice clean API between the different scheduler components.

Closing remarks

It seems like YARN will need to draw from Mesos and Omega to scale up to the 10K node scale. YARN is still a centralized scheduler, which is the strawman for comparison in Mesos and Omega. Borg specifically mentions the need to shard to scale.

Isolation is very important to achieve high utilization without compromising SLOs. This can surface at the application layer, where apps themselves need to be design to be latency-tolerant. Think tail-at-scale request replication in BigTable. Ultimately it comes down to hardware spend vs. software spend. Running at lower utilization sidesteps this problem. Or, you can tackle it head-on through OS isolation mechanisms, resource estimation, and tuning your workload and schedulers. At Google-scale, there's enough hardware that it makes sense to hire a bunch of kernel developers. Fortunately they've done the work for us :)

I wonder also if the Google workload assumptions apply more generally. Priority bands, reservations, and preemption work well for Google, but our customers almost all use the fair share scheduler. Yahoo uses the capacity scheduler. Twitter uses the fair scheduler. I haven't heard of any demand or usage of a priority + reservation scheduler.

Finally, very few of our customers run big shared clusters as envisioned at Google. We have customers with thousands of nodes, but this is split up into pods of hundreds of nodes. It's also still common to have separate clusters for separate users or applications. Clusters are also typically homogeneous in terms of hardware. I think this will begin to change though, and soon.

Transparent encryption in HDFS

2015-05-27T12:34:00Z

I went on a little European roadshow last month, presenting my recent work on transparent encryption in HDFS at Hadoop Summit Brussels and Strata Hadoop World London. I'll also be giving the same talk this fall at Strata Hadoop World NYC, which will possibly be the biggest audience I've ever spoken in front of.

Slides: pptx

Video: Hadoop Summit Brussels (youtube)

If you have access to O'Reilly, there should be a higher quality video available there.

The Next Generation of Apache Hadoop

2016-08-25T21:44:00Z

Apache Hadoop turned ten this year. To celebrate, Karthik and I gave a talk at USENIX ATC '16 about open problems to solve in Hadoop's second decade. This was an opportunity to revisit our academic roots and get a new crop of graduate students interested in the real distributed systems problems we're trying to solve in industry.

This is a huge topic and we only had a 25 minute talk slot, so we were pitching problems rather than solutions. However, we did have some ideas in our back pocket, and the hallway track and birds-of-a-feather we hosted afterwards led to a lot of good discussion.

Karthik and I split up the content thematically, which worked really well. I covered scalability, meaning sharded filesystems and federated resource management. Karthik addressed scheduling (unifying batch jobs and long-running services) and utilization (overprovisioning, preemption, isolation).

I'm hoping to give this talk again in longer form, since I'm proud of the content.

Slides: pptx

USENIX site with PDF slides and audio

Talking big ideas like this with Karthik also made me nostalgic for graduate school. Karthik is one of the most impressive people I know; I thought he'd left graduate school for Cloudera like me, but he's actually been working on his PhD nights and weekends! While we were prepping this presentation for ATC, he was also working on a submission for SoCC, and is apparently close to graduating.

Distributed Testing

2016-08-25T20:46:00Z

I gave a presentation titled Happier Developers and Happier Software through Distributed Testing at Apache Big Data 2016, which detailed how our distributed unit testing framework has decreased the runtime of Apache Hadoop's unit test suite by 60x from 8.5 hours to about 8 minutes, and the substantial productivity improvements that are possible when developers can easily run and interact with the test suite.

The infrastructure is general enough to accommodate any software project. We wrote frontends for both C++/gtest and Java/Maven.

This effort started as a Cloudera hackathon project that Todd and I worked on two years ago, and I'm very glad we got it across the line. Furthermore, it's also open-source, and we'd love to see it rolled out to more projects.

Slides: pptx

Source-code: cloudera/dist_test

Windows Azure Storage

2016-02-04T21:23:00Z

What makes this paper special is that it is one of the only published papers about a production cloud blobstore. The 800-pound gorilla in this space is Amazon S3, but I find Windows Azure Storage (WAS) the more interesting system since it provides strong consistency, additional features like append, and serves as the backend for not just WAS Blobs, but also WAS Tables (structured data access) and WAS Queues (message delivery). It also occupies a different design point than hash-partitioned blobstores like Swift and Rados.

This paper, "Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency" by Calder et al., was published at SOSP '11.

Background

Most people are familiar with filesystems. Filesystems have a hierarchical namespace composed of a nested directory tree, where directories contain files. Directories and files can have various bits of metadata attached to them like permissions, modtime, ACLs, xattrs, etc. Directories are used to logically group files together, and there are commands that work on entire recursive directory trees (mv, rm -r, chown -r, etc).

Blobstores are like filesystems, except simpler. A unifying characteristic of blobstores is that they do not provide a hierarchical namespace. Instead, you get multiple flat namespaces (which S3 calls buckets and WASB partitions), in which you can store blobs. Blobstores also provide fewer features than filesystems. It's typical to not support an operation like rename, setting per-blob permissions, and also have preconditions around IO (e.g. S3 requires a full-blob checksum at upload time, and no random writes or appends) which push complexity to the application-level.

You might read this and think that blobstores sound terrible, but there's a very good reason for throwing away these features: horizontal scalability. It's difficult to shard a hierarchical namespace, and it's even harder to support operations like directory rename, so blobstores punt on these problems. As a result, you have a system that architecturally has infinite scale.

Overview

In the datacenter, WAS is composed of stamps, which are sets of 10-20 racks of servers. This is what others might call a cell or pod, it's used as a unit of deployment and management. There are many stamps per datacenter, and many datacenters which are geographically distributed for fault-tolerance.

Users have accounts and all of the data in an account is stored on a single stamp. Accounts are another unit of management, and are migrated between stamps based on load.

WAS is a very layered system, so let's take it from the bottom up, starting with how it works within a single stamp, then talking about how multiple stamps are glued together into a global namespace.

Stream Layer

The bottom-most layer in the stack is the stream layer. The stream layer exposes a flat namespace of append-only logs called streams. Streams are composed of extents, which are a unit of replication and are about 1GB and stored on as files on a local filesystem. Only the last extent in the stream can be appended to. Extents in turn are composed of blocks, which variable-length up to 4MB in size, and are the unit of a client read or write. Blocks are also the unit of checksumming, so the entire block is read at read time to verify the checksum.

Architecturally, this looks a lot like HDFS. There is a Paxos-replicated stream master which maintains a mapping of streams-to-extents and extents-to-nodes. It chooses which nodes to use for incoming writes, routes reads to the correct and re-replicates extents when nodes fail. The stream master needs to keep these mappings in memory, and it's designed for approximately 100k streams and 50 million total extents.

Notably, the stream master does not track the extent-to-block mapping, which would not fit on a single machine. Instead, this is handled by the extent nodes, which maintain an index of the block offsets alongside the extent file.

The stream layer uses chain replication when writing an extent, the same method as HDFS but with some differences I'll be highlighting. Chain replication is nice since it's less complicated than Paxos replication and you can get better throughput by pipelining. The master is off the hot-path during data writes; the writer goes directly to the three chosen extent nodes. One of these nodes is the primary and pushes whole-block writes down the pipeline and acks the client when complete. These block writes are atomic (and there is even an atomic multi-block append). Combined, this has the nice property of allowing concurrent writers to an extent, since the primary orders the incoming block appends and serves all reads to the extent while it's being written to. HDFS does not have atomic appends, but does allow applications to block for data to be durable (hflush/hsync) which provides similar properties if you use a length-prefixed record format with a checksum, and roll a new file on failure.

Talking more about failures, in WAS, failures during a write are handled by sealing the extent and starting a new one. Sealing the extent requires agreeing on the length of the extent, which is coordinated with the stream master. The SM asks the remaining nodes for the length of the extent, and uses the smallest length. This is safe since only writes are only ack'd to the client after they are fully-replicated. Longer lengths are also okay, since stream clients are required to handle duplicate blocks in a stream. Once sealed, this is the final length of the extent. If a version of the extent with a different length appears, it is safely discarded.

HDFS does something more complicated called pipeline recovery to try and keep writing rather than rolling to a new HDFS block. This is because we want to produce fewer, larger blocks to reduce NameNode memory usage, and for a long time HDFS did not support variable-length blocks (for no really good reason). HDFS pipeline recovery has been described as "an informally-specified implementation of two-phase commit", and you can read all about it (and other recovery processes) in an excellent series of blog posts written by my colleague Yongjun Zhang.

The stream layer also implements background erasure coding of sealed extents, as well as latency-levelling by a similar mechanism to Jeff Dean's The Tail at Scale work. They also do some ops tricks to further improve latency, like allocating a separate SSD to buffer writes and doing deadline IO scheduling.

Partition Layer

Now, we move onto the co-designed user of the stream layer: the partition layer, which maintains the user-visible constructs like blobs, tables, and queues.

The partition layer is a range-partitioned distributed database. These ranges can be split and merged and moved around based on load.

There is a table for each of the three user-visible constructs, a table that describes the schema of these three tables, and finally a table of the mapping of ranges to servers (like the meta table in HBase). The primary key for the three object tables is a compound key of (account, partition, object name) (user-level identifiers), and other columns describe what stream, extent, offset, and length have the corresponding data for that object. Since it's a range-partitioned distributed database that uses LSM-trees under the hood, I'm going to point to Kudu and HBase as similar systems if you want to learn more. The split and merge process looks the same as HBase, and implementation details like the memstore, bloom filters, and row caching are present in all of these systems.

Each range uses a couple streams to maintain its state. The two important ones are a commit stream (a WAL) and a row data stream (checkpoints of WAL mutations, HFiles in HBase parlance) which are used to maintain the LSM tree. They also implement a BLOB type which writes blob data into a side stream to avoid the write amplification of LSM trees, instead using pointers and efficient stream concat operations to avoid rewriting data.

One interesting point is that on a per stamp basis, they see 75 splits and merges and 200 partition moves every day. That's a lot more than I would have guessed for HBase, but since the partition layer worry about storage locality, moving a partition is cheap. An efficient stream concat operation means you can avoid rewriting data when doing merges.

Front-end service

The front-end service is a proxy for user requests which interacts with WAS on behalf. Front-ends are stateless, perform authentication/authorization, and caches the partition map for faster lookups. This is pretty standard in web services.

Location service

Gluing together stamps is the location service, which sits above the partition layer and maintains a global namespace of accounts-to-stamps. We now have the full picture of how to find an object given its (account, partition, name) tuple. The location service tells us which stamp has the account, the partition layer at that stamp translates partition and name to (stream, extent, offset, length), and then the stream layer translates the (extent, offset, length) into a block on disk (the read unit).

The location service is responsible for moving accounts between stamps for load balancing, and also for inter-stamp replication for disaster fault-tolerance. An account has a primary stamp and some number of secondaries. The location service links them together, and the partition layer in the primary stamp asynchronously replicates to the secondaries. This async process is pretty fast, they say on average it takes 30s to replicate changes.

In the case of a disaster, the location service flips DNS and VIP over to point to a secondary cluster. This does mean that something like 30s of data will be lost, since replication is asynchronous. These events do happen too. It's not just meteor strikes, I've heard some funny stories about hunters shooting down power lines or utility crews cutting network cables.

Questions and comments

How many hops / IOs?

It's interesting to try and count how many network hops and IOs are required to do operations in this system. Let's look at a typical read request.

Client has the DNS cached so goes to a front-end at the primary stamp. 1 hop, 0 IOs.
Front-end has the partition map cached, so goes to the partition server for that range. 1 hop, 0 IOs.
Partition server looks in the row data stream for the extent/offset/length of the blob. This is an LSM tree, so requires 0-N lookups to the stream layer (let's assume 1), and the blob layer is 1 lookup. For each stream lookup:
1. Go to the stream master to figure out which extent nodes have the extent. 1 hop, 0 IOs.
2. Go to the extent node to read the data. This requires looking in the index and then the actual data. 1 hop, ~3 IOs. This depends on caching of file handles and the filesystem, so an estimate.

My math says this is a total of 6 hops and 6 IOs to do a read. How about writing a small blob?

Client has the DNS cached so goes to a front-end at the primary stamp. 1 hop, 0 IOs.
Front-end has the partition map cached, so goes to the partition server for that range. 1 hop, 0 IOs.
Partition server writes to the WAL stream and the side blob stream. For each write:
1. Go to the stream master to find the primary extent node for the extent. 1 hop, 0 IOs.
2. Go to the primary extent node, write the block. This gets pushed to the two other nodes in the pipeline. Appending to an open file is ~2 IOs (one for the data, one to update the file length in the inode). I'll assume the index is in memory while the extent is unsealed, and checksums are stored inline, else this blows up. 3 hops, 6 IOs.
At this point the write can be ACK'd.

Add it up, this is ten hops and 12 IOs. This is multiplied by the number of stamp replicas, and again by some small factor for LSM rewrites, and by 2 since all writes are journaled.

Account limits

S3 lets you can store an unlimited number of items in a bucket with no performance implications. This is not the case in WAS, which limits accounts to 500TB usage, 20K IOPS, and 10-30Gbps network throughput. Remember also that an account can have multiple partitions! This is because accounts are assigned to a stamp, so it's limited by the capacity of that one stamp.

Forcing users to partition manually is distinctly worse than the unrestricted buckets provided by S3. Their reasoning was that pinning to a single stamp gives better performance isolation and lets users choose in what geographic region their data is stored. I don't understand why you couldn't do the same thing with range-partitioned buckets. With system-managed partitioning, you could get even better stamp utilization by splitting/merging an account across stamps.

I'm told that Azure Data Lake is the next thing coming down the pipe, is built on top of WAS, and handles sharding across multiple accounts to get around these limits. At this point I begin to wonder; WAS is already a very layered system, and indirections do come at a cost.

Centralized control

They chose not to use hash-partitioning since they wanted ranged listing and more control over placement for isolation. This is a criticism of hash-partitioning I share. It's really convenient to split/move partitions around based on load. Range partitioning has the issue of hotspotting writes to sequential keys, but there are tricks like reversing or hashing the key to work around this.

Conclusion

There are lots of things to like about WAS. They built a number of useful user abstractions on top of a single storage system which provides strong consistency, a fuller feature set than S3, and geo-replication. WAS also has nice provisions to improve latency, increase utilization, and make operations easier.

I view this as a practical way of stitching together multiple clusters. This is something we support in HDFS via federation, and MapR does something similar with its concept of volumes. It's also probably true that most users can live just fine with 500TB of storage. I just find it somewhat dissatisfying that WAS rids itself of a hierarchical namespace, but doesn't exploit that fact to the fullest extent to expose a truly scale-out system.