Alex Feinberg

Reliability, availability and scale - an interlude

2011-06-25T00:00:00-07:00

Reliability, availability and scale – an interlude

An interlude

My last post on distributed systems was dense with concepts. Before continuing with much more discussion, let’s take a quick detour and define several frequently used, but often confused, terms in distributed computing.

The term scalability is often conflated with other related, important concepts. See for example an article by 37Signals “Don’t scale: 99.999% uptime is for Wal-Mart” — in the article, the notions of scalability and an availability SLA (which are typically stated as percentages) are used as if they were interchangeable.

However, as we’ll see in this post, meeting one or more of these related non-functional (i.e., ones which often come after the core functionality has been implemented) requirements does not imply meeting the others.

The non-functional requirements (or “ilities”) will be separated into three “buckets”: reliability, availability and scalability. It’s very difficult to agree on what these terms mean, but based on systems engineering practice, here’s the way that I approach it.

Reliability

In the previous post, the term “reliability” was used informally and the term “fault tolerance” was used more formally, e.g., in discussion of fault tolerance properties of algorithms. Rigorously speaking, fault tolerance is only a part of the reliability story: in a fault tolerant multi-component system, it is sufficient that failure of one component doesn’t cause failure of other components. A system that continues to function in a degraded state is fault tolerant, but unless the full functionality of the previous state can be restored, it’s not fully reliable. In other words, a reliable system requires fault tolerance, but a fault tolerant system may not require reliability.

Recovery

“Recovery” refers to restoring full functionality (defined to be the previous state in this context) when a failure occurs. Recovery is not often an explicitly stated goal, and is sometimes not included in formal definitions of reliability. However, recovery is an important consideration in the the discipline of deploying and maintaining production systems. Certain design choices (e.g., not maintaining a transaction log) can hurt a system’s recovery profile despite helping scalability and improving the availability of a system.

“MTTR” stands for Mean Time To Recovery: the average time from when a failure is encountered to when the previous state is restored, i.e., a system’s recovery time.

Availability

In Tannenbaum, Steen Distributed Systems: Principles and Paradigms availability is defined as

[The] property that a system is ready to be used immediately. In general, it refers to the probability that the system is operating correctly at any given moment and is available to perform its functions on behalf of its users. In other words, a highly available system is one that will most likely be working at a given instant in time.

Here we see two definitions — first sentence defines availability at a specific point in time, while rest of the paragraph gives a way to characterize the overall availability of a system. Enterprise vendors frequently talk about high availability of their solutions, however, this could mean different things.

For example, a system that goes down for a minute in the case of failure and then recovers can still be marketed as “highly available”: this could be honest marketing if the system is designed such that the failures are rare, i.e., the MTBF is particularly high in relation to MTTR.

Recently the trend has become to build systems that either maintain availability in the face of failure or recover it quickly, rather then systems with especially high MTBF. This systems engineering view is well summarized by John Allspaw in “MTTR is more important than MTBF (for most types of F)”.

For the purpose of this blog, a “median” definition will be used: a system is highly available if, in the case of failure, it can still respond within a reasonable (acceptable to the end-user) timeout.

Scalability

Scalability is a property of systems that are able to handle an increase in requests without performance degradation, e.g., in terms of latency and/or throughput. In the context of a distributed system, scalability requires that requests are handled in parallel by multiple nodes.

Note that there are multiple ways to distribute load across nodes. With a stateless system (or a system whose state can fit within a single machine’s main memory), a simple way to increase scalability would be to use a high degree of replication (replicating the full instance of the service, allowing it to take both reads and writes) and round-robin requests between multiple machines. In a system where state does not fit in a single machine’s main memory, scalability generally requires partitioning the data, i.e., a shared-nothing architecture.

Soft state

In addition to stateful and stateless services, there are services that maintain soft state. “Soft state” is loosely defined as state that has several properties including relaxed consistency semantics , and is not critical to the core of the service (although soft state may often be required for optional functionality) (Chiappa, “Soft and Hard State”). In this case, there are several options of where the soft state could be stored: in memory of local machines (which frequently implies using sticky sessions) or in a separate system, e.g., in a distributed cache. The former may imply certain scalability and availability characteristics, e.g., possibility of hot spots in the load balancer and need for sessions to be restarted when service nodes fail; in the later case, the availability and scalability properties of the separate stateful system carry over to the service itself.

Elasticity

Elasticity is a concern closely related to scalability: the ability to add or remove resources (in our case, nodes) to change a system’s capacity without downtime. A scalable system may not always be elastic, e.g., if adding a node requires taking the system down, manually moving data around, reconfiguring the system, and then starting the system up again. In other words, a scalable system without elasticity would be taking a hits to its availability when nodes need to be added or removed.

Case study: a shared nothing database

Now that we’ve looked at these concepts in abstract, let’s use an example: a shared nothing database. Shared nothing architecture means the nodes in the system don’t share memory or disk: data resides independently on the nodes which communicate over a network (Stonebraker, The Case for Shared Nothing). The space of all possible primary keys is partitioned (a frequently used synonym for partitioning, especially when done at the application level is “sharding”) by using either hashing or range based partitioning, such that one or more partitions could be assigned to a primary physical location.

Since data is spread across several nodes, assuming a uniform key and request distribution, the system scales linearly to multiple nodes. It could be also made elastic by using consistent hashing and/or virtual partitions. For availability and reliability, different types of replication can be used, placing the data at multiple physical locations.

In case of independent failures, partitioning also provides fault isolation: provided the system knows how to serve results from a partial dataset, only the partitions held by the failed nodes are affected.

What’s next?

We’re now left with an important series of questions, related to maintenance or recovery of availability (including maintaining latency) for the affected partitions in case of various failure and high-load scenarios.

Various approaches and the systems that take them will be discussed in the next post: “Alternatives to total transactional replication”. As this detour ends and the journey continues, pay attention to how the various theoretical approaches and real-world systems work in situations such as:

Providing availability under failure. This shouldn’t be seen as simple either/or trade-off, but rather on a sliding scale, ranging from responses to simpler (non-correlated, of individual nodes) to more complex (correlated failures, potentially of majority of nodes, split-brain scenarios) failures
Adding a new node to either expand capacity (elasticity) or take place of a node that failed (recovery), or recovering a node from a temporary failure
Handling high write throughput and contention

The next post will also look at impact (or, at times, non-impact) of scalability, atomicity and reliability (non-functional requirements) upon functional requirements such as support for ordered operations and atomicity.

Contributions

Thanks to Ted Nyman (@tnm), Jeff Hodges (@jmhodges), Justin Sheehy (@justinsheehy), Daniel Weinreb, Peter Alvaro, Dave Fayram (@KirinDave), Anil Madhavapeddy, Neil Conway and C. Scott Andreas (@cscotta) for proof-reading and editing this post.

Replication, atomicity and order in distributed systems

2011-06-17T00:00:00-07:00

Replication, atomicity and order in distributed systems

Distributed systems are an increasingly important topic in Computer Science. The difficulty and immediate applicability of this topic is what makes distributed systems rewarding to study and build.

The goal of this post (and future posts on this topic) is to help the reader develop a basic toolkit they could use to reason about distributed systems. The toolkit should help the reader see the well known patterns in the specific problems they’re solving, to identify the cases where others have already solved the problems they’re facing and to understand the cases where solving hundred percent of the problem may not be worth the effort.

Leaving a Newtonian universe

For the most part, a single machine is a Newtonian universe: that is, we have a single frame of reference. As a result, we can impose a total Happened-Before order on events i.e., we can always tell that one event happened before another event. Communication can happen over shared memory, access to which can be synchronized through locks and memory barriers¹.

When we move to a client and server architecture, message passing architecture is required. In the case of a single server (with one or more clients), we can still maintain an illusion of a Newtonian universe: TCP (the transport layer used by popular application protocols) gives a guarantee that packets will be delivered to the server in the order sent by the client. As we’ll later see, this guarantee can be used as a powerful primitive upon which more complex guarantees can be buit.

However, there are reasons why we no longer want to run an application on a single server: in recent times it has become consensus that reliability, availability and scalability are best obtained using multiple machines. Mission critical applications must at least maintain reliability and availability; in the case of consumer (and even many enterprise) web applications, with success often come scalability challenges. Thus, it’s inevitable that we leave Newton’s universe and enter Einstein’s².

¹ This is not to belittle the fascinating challenges of building parallel shared memory systems: the topic is merely very well covered and outside of this post. I highly recommend The Art of Multiprocessor Programming (by Maurice Herlihy) and Java Concurrency In Practice (Goetz, Lea et al) to those interested in shared memory concurrency.

² The comparison with theory of relativity is not original: Leslie Lamport and Pat Helland have used this comparison. Several concepts in distributed systems such as Vector Clocks and Lamport Timestamps are explicitly inspired by relativity.

Intuitive formulation of the problem

Suppose we have a group of (physical or logical) nodes: perhaps replicas of a partition (aka a shard) of a shared nothing database, a group of workstations collaborating on a document or a set of servers running a stateful business application for one specific customer. Another group of nodes (which may or may not overlap with the first group of nodes) is sending messages to the first group. In the case of a collaborative editor, a sample message could be “insert this line into paragraph three of the document”. Naturally, we would like these messages delivered to all available machines in the first group.

Question is, how do we ensure, that after the messages are delivered to all machines, that the machines remain in the same state? In the case of our collaborative editor application, suppose Bob is watching Alice type over the shoulder and sees her type “The” and types “quick brown fox” after: we’d like all instances of the collaborative editor to say “The quick brown fox” and not “quick brown fox The”; nor do we want messages delivered multiple times e.g., not “The The quick brown fox” and especially not “The quick brown fox The”!

We’d like (or, in many cases, require) that if one of the servers goes down, the accumulated state is not lost (reliability). We’d also like to be able to view the state in the case of server failures (read availability) as well as continue sending messages (write availability). When a node fails, we’d also like to be able to add a new node to take its place (restoring its state from other replicas). Ideally, we’d like the later process to be as dynamic as possible.

All of this should have reasonable performance guarantees. In the case of the collaborative editor, we’d like characters to appear on the screen seemingly immediately after they are typed; in the case of the shared nothing database, we’d like to reason about performance not too differently from how we reason about single node database performance i.e., determined (in terms of both throughput and latency) primarily by the CPU, memory, disks and ethernet. In many cases we’d like our distributed systems to even perform better than analogous single node systems (by allowing operations to be spread across multiple nodes), especially under high load.

Problem is, however, that these goals are often contradictory.

State machines, atomic multicast and consensus

An approach commonly used to implement this sort of behavior is state machine replication. This was first proposed by Leslie Lamport (also known as the author of LaTeX), in the paper Time, Clocks and the Ordering of Events in a Distributed System. The idea is that if we model each node in a distributed system as a state machine, and send the same input (messages) in the same order to each state machine, we will end up in the same final state.

This leads to our next question: how do we ensure that the same messages are sent to each machine, in the same order? This problem is known as atomic broadcast or more generally atomic multicast. We should take special care to distinguish this from the IP multicast protocol which makes no guarantees about order or reliability of messages: UDP, rather than TCP is layered on top of it.

A better way to view atomic multicast is a as a special case of the publish subscribe pattern (used by message queing systems such as ActiveMQ, RabbitMQ, Kafka and Virtual Synchrony based systems such as JGroups and Spread ³).

A generalization of this problem is the distributed transaction problem: how we do ensure that either all the nodes execute the exact same transaction (executing all operations in the same order), or none do?

Traditionally two phase commit (2PC) algorithm has been used for distributed transactions. The problem with two phase commit is that it isn’t fault tolerant: if the coordinator node fails, the process is blocked until the coordinator is repaired (Consensus on Transaction Commit)

Consensus algorithms solve the problem of how multiples nodes could arrive at a commonly accepted value in the process of failures. We can use consensus algorithm to build fault tolerant distributed commit protocols by (this is somewhat of an over-simplification) having nodes “decide” whether or not a transaction has been committed or aborted.

³ Virtual synchrony (making asynchronous systems appear as synchronous) is itself a research topic that is closesly related to and at times complemented by consensus work. Ken Birman’s group at Cornell has done a great deal of work on it. Unfortunately, it was difficult to work much of this fascinating research into a high level blog post.

Theoretic impossibility, practical possibility

Problem is that it’s impossible to construct a fault tolerant consensus algorithm that will terminate in a guaranteed time-bound in an asynchronous system lacking a common clock: this is known (after the Fisher, Lynch, Patterson) as the FLP impossibility result. Eric Brewer’s CAP theorem (a well covered topic) can be argued to be an elegant and intuitive re-statement of the FLP.

In practice, however, consensus algorithms can be constructed with reasonable liveness properties. It does, however, imply that consensus should be limited in its applications.

One thing to note is that consensus protocols can typically handle simple or clean failures (failures of minority of nodes), at the cost of greater latency: handling more complex (split brain scenarios) where a quorum can’t be reached is more difficult.

Paxos and ZAB (Chubby and ZooKeeper)

The Paxos Consensus and Commit protocols are well known and are seeing greater production use. A detailed discussions of these algorithms is outside the scope of this post, but it should be mentioned that practical Paxos implementations have somewhat modified the algorithms to allow for greater liveness and performance.

Google’s Chubby service is a practical example of a Paxos based system. Chubby provides a file system-like interface and is meant to be used for locks, leases and leader elections. One example of use of Chubby (that will be discussed in further detail in the next post) is assigning mastership of partitions in a distributed database to individual nodes.

Apache ZooKeeper is another practical example of a system built on a Paxos-like distributed commit protocol. In this case, the consensus problem is slightly modified: rather than assume a purely asynchronous network, the TCP ordering guarantee is taken advantage of. Like Chubby, ZooKeeper exposes a file-system like API and is frequently used for leader election, cluster membership services, service discovery and assigning ownership to partitions in shared nothing stateful distributed systems.

Limitations of total transactional replication

A question arises: why is transactional replication only used for applications such as cluster membership, leader elections and lock managers? Why aren’t these algorithms used for building distributed applications e.g., databases themselves? Wouldn’t we all like a fully transactional, fault tolerant, multi-master distributed database? Wouldn’t we like message queues that promise to deliver exactly the same messages, to exactly the same nodes, in exactly the same order, delivering each message exactly once at the exact same time?

The above mentioned FLP impossibility result provides one limitation of these systems: many practical systems require tight latency guarantees in even in the light of machine and network failures. The Dangers of Replication and a Solution also discusses scalability issues such as increases in network traffic, potential deadlocks in what the authors called “anywhere-anytime-anyway transactional replication”.

In the case of Chubby and ZooKeeper, this is less of an issue: in a well designed distributed system, cluster membership and partition ownership changes are less frequent than updates themselves (much lower throughput, less of a scalability challenge) and are less sensitive to latency. Finally, by limiting our interaction with consensus based systems, we are able to limit the impact of scenarios of where consensus can’t be reached due to machine, software or network failures.

What’s next?

The next post will look at common alternatives to total transactional replication as well as several (relatively recent) papers and systems that do apply some transactional replication techniques at scale.