Peter Bailis: Highly Available, Seldom Consistent

How To Make Fossils Productive Again

2016-04-30T00:00:00-07:00

At NorCal Database Day 2016, I served on a panel titled 40+ Years of Database Research: Do We Have an Identity Crisis? What follows is a loose transcript of my talk, which I enjoyed writing and delivering and which I hope you enjoy reading.

The title of my short talk today is “How Can We Make Fossils Productive Again?’’ and the central question of this panel is “Does the Database Community Have an Identity Crisis?’’

I say: “Sure! We’ve got a crisis. We’ve got an identity crisis. We’ve got a mid-life crisis. Our community is in all sorts of crises.”

At the heart of our crises is the fact that this is literally the golden age for data. Things have never been better for data. Everyone wants a piece of data. Data is all over Silicon Valley, and Big Data, and Machine Learning, and all of these awesome tools, and yet much (or even maybe most) of the greatest, cutting-edge data-related research is not happening in the core database community. Much of the highest-impact data-related research appears in venues like OSDI and SOSP or NIPS and ICML. It’s like we’ve somehow lost control and are suddenly no longer the kings and queens of data. We’ve missed out on producing much of the seminal (and earliest) research in fields including Big Data, MapReduce and Hadoop, NoSQL, Cloud Computing, Graphs, Deep Neural Networks, and “Big ML.”

(It’s not like these results are just coming out of the machine learning or statistics communities; they’re coming out of adjacent communities such as “systems.”)

So, what’s up? Why are we in crisis?

Let’s consider a brief history of database research:

Back in the 1970s, we were the cool kids! Codd came out with his relational model, and the earliest database researchers got super excited and said: “we’re going to run behind this crazy idea.” Back in the 1970s, we were the revolutionaries! Remember that the relational model wasn’t always the way to do data management. System R was a batshit project where someone said “we’re going to take this theory paper and throw 25+ people behind it and build an actual system” and along the way at least two Turing awards and a number of groundbreaking concepts came out of it. Today, it’s easy to take this crazy bet for granted, but remember: we went all in on a really cutting-edge, guerilla research agenda, and it paid off.

Today, it’s 46 years later, and we have four Turing awards, many hundreds of billions of dollars of revenue, and tons of wonderful research to show for our big bet. However, despite these many successes, as this panel’s abstract suggests, we’re also at risk of ossification, at risk of becoming reactionary fossils. The community narrative often runs along the lines of: “What is this ‘new’ stuff? What is this ‘NoSQL/Big Data/etc.’? This doesn’t look like a relational database or like conventional database research! This doesn’t fit my idea of a ‘database’!” In fact, many amazing ideas explored in database research over the last fifteen years were largely ignored, were of the form: “here’s a great idea from [some interesting field like signal processing], but I have to cram it inside of SQL!” Or, alternatively, we say: “Bah! We did this in the 1980s. Strong reject”; and then, just a few years later: “Oh wow, industry cares about this problem and it turns out we overlooked a critical difference from the past. Let’s start publishing!”

So, as the panel abstract suggests, we may in fact be at risk of becoming reactionary fossils…

…which brings us to the title of my talk: how do we make fossils productive again?

The answer is simple: set them on fire and use them as fuel.

We take the great ideas this community has developed and leverage them to push onward, upward, forward.

How exactly are we going to use these fossils as fuel?

I believe our role as a community is to discover and define the future standards for how data-intensive platforms and tools should look and operate.

No other community has greater claim to data. We own the core idioms for reasoning about and dealing with data. We invented (or were instrumental in developing) critical concepts, including declarativity, query processing, materialized views, transaction processing, data storage, consistency, and scalability.

As a result, we deserve and owe it to ourselves to bring these concepts to new domains, applications, and platforms. It’s time to reclaim our roots as systems revolutionaries, powered by the principles underlying our historical successes.

I’m not going to tell you what to work on (and, in fact, I already spoke earlier today about what I think you should work on next: analytic monitoring). Instead, as a young, irreverent professor (or, rather, professor-to-be-in-32-hours), I’ll offer four pieces of advice:

Kill the reference architecture and rethink our conception of “database.” The article titled “Architecture of Database System” should be considered harmful. If a system doesn’t have a buffer pool, it can still be a database, and, in fact, I’d prefer not to read any more papers on “databases” that have buffer pools. Instead, I’d prefer you shock me with your radical, new (and useful!) conception of a data management platform.
Solve new, emerging, real problems outside traditional relational database systems. We’re in Northern California. This is NorCal DB Day. As far as I’m concerned, this is the center of the universe for data. If you go talk to real users or go and talk to people outside CS working at universities, it’s astounding to see the number of data-related tools that have yet to be written that don’t look anything like existing database engines. [This advice also holds for just about anyone with an Internet connection.]
Use data-intensive tools, both the tools that you’re building and the tools that others have built. Most data-intensive tools are awful to use. NoSQL and Hadoop partly came out of the fact that setting up data infrastructure sucks. It’s often impossible to load data if you don’t know your schema, which sucks. Data warehouses also cost a lot of money, which sucks. Treat these things that suck as inspiration. The world isn’t just some giant tire fire; rather, the world is waiting to be filled with beautiful data management tools that you can design, build, and work to understand in your research.
Do bold, weird, and hard projects and actually follow through. Just publishing a paper isn’t enough. Today, our prerogative as researchers goes beyond simply publishing one paper, or some CIDR vision paper that remains a vision unrealized. As a community, we have an opportunity to build real projects and products—especially via open source—that take on lives of their own beyond the basics of research and can also inform our research agenda going forward. Let’s recognize that opportunity and the encourage the ambition it requires.

There are a number of recent projects that illustrate this advice. Three examples:

Peter Alvaro’s work on Molly and Lineage-Driven Fault Injection takes classic techniques from data provenance and applies them to understanding and debugging distributed systems and distributed systems failures. This research isn’t just a SIGMOD paper; Peter recently gave a keynote in London about a month ago with his industrial collaborator describing how they productionized this work at Netflix, where this work finds real bugs. Theory + practice, applied.
Chris Ré’s work on DeepDive illustrates how to build dark data extraction engines that match or exceed human quality. Like MacroBase, DeepDive is a new kind of data processing system that’s providing value by leveraging core concepts from databases such as probabilistic inference and weakly-consistent query processing. And it works on real problems, like combating human trafficking.
A recent project I wish the database community had done is TensorFlow at Google, which is a distributed, scale-out Torch/Theano/Caffe-like model training framework. This is work that I think we should have done in our community. [Theo interrupts and says Chris’s group is solving this problem. Since Chris is a genius, I believe Theo.]

Before I end, does anyone know who this person is?

[No one knows.]

This is one of my role models. This is Le Corbusier, a great architect who pioneered many of the ideas and designs in modern architecture that we know and love today.

In the early 1920s, Le Corbusier and friends faced a similar problem to the one the database community faces today. Le Corbusier and friends felt that the practice of architecture had ossified and had become stifled by tradition, that the field was essentially dead.

In response, Le Corbusier wrote a beautiful book that translates to Towards A New Architecture. Le Corbusier wrote, in effect, that “yes, the Parthenon is perhaps the most beautiful instance, the perfect example of a particular standard of architecture. The Parthenon may have achieved the platonic ideal of the standard of architecture we’ve previously established. But there are many possible standards to acknowledge, each dependent on need and use. Standards are established by experiment.”

To reiterate, and to quote directly, Le Corbusier showed us that “a standard is definitely established by experiment.” Le Corbusier went on to revolutionize our conceptions of architecture and design. Le Corbusier wasn’t always right, but he catalyzed the field.

Today, relational database engines and the status quo in data management reflect an impressive standard. But there are many possible standards, suitable for different needs and uses. Today, it’s our time to experiment, to establish new standards.

My name is Peter Bailis, and I’m putting my shoulder to the wheel.

You Can Do Research Too

2016-04-24T00:00:00-07:00

I was recently discussing gatekeeping and the process of getting started in CS research with a close friend. I feel compelled to offer a note.

As a practicing academic researcher, I’m personally thrilled by the degree of excitement regarding CS research today in the broader technical community. Reading papers and doing research have always been favorite activities for me, and it’s tremendously heartening to see organizations like Papers We Love and its many members sharing their excitement as well. Research is a very human way to engage with our curiosity, and curiosity deserves cultivation, celebration, and sharing.

To someone interested in learning more about research, I’d offer the following words of encouragement and advice:

No one is born a researcher

There is no prescribed way that a researcher has to look, act, or be. One of my closest colleagues started off doing technical support during the first dot-com boom with only an undergraduate degree in literature and no background in Computer Science. Today, my colleague is a tenure-track professor doing work I deeply respect and admire. Two other colleagues who are faculty at top-tier departments started their careers after emigrating as refugees, and each did their undergraduate work at non-traditional institutions. Another colleague recently started a Ph.D. after spending several years working closely with researchers while in industry, without an undergraduate degree. The CS academy is highly homogeneous, with a long way to go. But if you look closely enough, you may be surprised to find someone who looks and feels more like you than you might otherwise think exists.

Pedigree and privilege

Granted, pedigree and privilege make many things much easier. But pedigree and privilege are not strictly necessary to do great research, and they are certainly not by themselves sufficient to do great research. Among other things, great research comes from curiosity, creativity, hard work, determination, some amount of brilliance, and many failures.

Some concrete suggestions

Read papers.
Discuss them with friends, or strangers.
Go to your local Papers We Love meetup, or start your own chapter.
Try implementing what you read. Open sourcing and sharing your code is even better. There’s a good chance your code is better written than the code used for the paper.
If you get confused, check Wikipedia. Almost everyone reads Wikipedia, including professors. In addition, consider checking out a book from your local library. Textbooks are great guides to new and unfamiliar areas.
Ask yourself: how would I improve this research if I did it myself?
Ask yourself: how could I produce a follow-on project?
Blog about what you’ve learned and about questions like these.
Spend some time trying to improve and follow on to your favorite paper(s). Write some code. Run an experiment. Or try to find a big-O improvement.
Email authors if you have questions (or want to express your excitement about their work). If they happen not to respond, don’t worry; research makes for busy schedules, and I’d bet that your polite enthusiasm may have made their day a little bit brighter.
Attend research seminars. Your local university should have at least one public talk series, or watch them online.
If you’re in school, make sure to attend office hours. Ask questions. Enroll in a graduate-level seminar, ideally one focused on reading and discussing papers.
If you’re not in school, still ask questions. Seek out a community that can help you find answers. If you don’t know of a community, I suggest reaching out on social media like Twitter.
Don’t get discouraged.
Trust in your curiosity.
Be respectful, but be bold.

Some final thoughts

Make no mistake: getting started in research is hard. The above steps aren’t enough. Getting started in research requires perseverance and will over time require many people to make investments in you and your success. But you can be the first person to make an investment—by learning, educating yourself, doing hard things, and beginning to develop your skills. You have much more agency and power than you may believe.

Lean Research

2016-02-20T00:00:00-08:00

Recently, my favorite questions for myself and my students have been: what hypothesis are you currently testing, what is your goal in testing it, are you testing it as efficiently as possible?

A research project ought to consist of multiple hypotheses, at multiple scales. You want at least one vaguely defined hypothesis regarding the space you’re working in; this could take years to test and could become a dissertation. You also want a working hypothesis that you can test in a few weeks or less. In-between, you probably want some intermediate hypotheses that are testable in a few months to a year. Generally, nearer-term hypotheses ought to be more concrete, while longer-term hypotheses ought to be more loosely held. Seek both.

These hypotheses matter because, in systems research, great work is often achieved via rapid iteration, by repeated formulation and testing of smaller hypotheses in service of a set of larger goals. Failing fast allows you to hone your understanding of a problem and continually evolve your set of facts and beliefs about it. You should expect your hypotheses to change over time while courageously pursuing and refining them at multiple levels.

The alternative is to sink an enormous amount of time and energy into a project that may or may not go anywhere. Given that time is the most scarce, least fungible resource in research, it’s in your interest to fail fast, learning along the way.

Complaints about a lack of “long-term” research often conflate short-term research taste with short-term execution. This is a mistake. Many projects never have a chance of real success because they don’t aim high enough. However, it’s possible to aim high while iterating quickly. Many great projects I admire were the result of a series of small steps, executed in service of a larger, worthwhile vision. It’s easy to focus only on big outcomes and forget this process.

So, if your hypotheses are wrong, will you know as soon as possible? If you’re right, will the grand payoff be worthwhile? As an advisor, one of my goals is to encourage audacious projects while helping break them down into manageable, intermediate steps.

Ultimately: think big in the long term, and think small in the short term. Have a loosely held plan for what comes between the two.

Call this “lean research.”

I Loved Graduate School

2016-01-01T00:00:00-08:00

It’s increasingly in vogue to complain about graduate school and the state of higher education. Now that I’ve officially graduated, I want to come out and say, simply: I loved graduate school. Pursuing a Ph.D. was one of the best decisions I’ve ever made. Here’s why:

Growth

Getting a Ph.D. is an extreme growth experience. In four to six (or more) years, you are expected to become a world expert in a topic, produce original results, and communicate them to your field. Becoming an expert in anything is hard. But consider this: by the time you’re done with your dissertation, you’ll have answered questions no one ever answered (or possibly asked) before. Now, the questions may be narrowly scoped, and maybe only you or your dissertation committee care about the answers. But those possibilities don’t diminish the fact that you’ve made a difference, an original contribution.

If you haven’t seen it, read Matt Might’s Illustrated Guide to a Ph.D. It’ll take you less than five minutes. Seriously, read it now if you haven’t already.

I don’t know any other opportunity where you, as a person of any age, whether 18 or 85, are given the freedom and the opportunity to so boldly venture into the unknown, for the sake of the unknown. To finish a Ph.D., you have to:

Understand the state of the art in your field (by learning the fundamentals, seminal results, and latest developments, and learning to read research papers).
Perform original research (by learning the methodology of your field, often by building your own instruments and/or learning to use existing ones, formulating new research questions, designing experiments, analyzing the results, and repeating).
Communicate the results to your field (by writing papers, or at least a dissertation, probably by giving several talks along the way, and defending the relevance of your contributions to your field and peers).

This all requires developing a highly diverse skill set in a very short time period. This is hard. It’s one of the hardest things I know to do. Getting a Ph.D. requires creativity, courage, a hefty dose of stubbornness, and a lot of hard work. People will help you, as they helped me, by providing wisdom and guidance and training, but it’s ultimately up to you to do the original research. Again, at the end of the day, you are responsible for answering a question (or several questions) that literally no one else has answered before.

As a result, I personally found the Ph.D. process to be immensely rewarding. My favorite memories of graduate school are of staying up late, jittery and unwilling to sleep because I desperately wanted to know the answer to a question that came up earlier in the day. I wanted to learn, and it was up to me to find the answer. Without exaggeration, I found (and still find) the questioning and answering part of the research process – really, a core aspect of research – to be exhilarating.

Beyond this core process, handling all of the other aspects of graduate school require growth in different dimensions: writing, public speaking, strategy, management, and sometimes politics. I was constantly being stretched. After a big submission or presentation, I’d often look back and be surprised at how far I’d come in the preceding months. Looking back on several years, I’m proud of what I’ve learned and what I’ve worked to accomplish. No dissertation is perfect, but I think many feel similarly about what their dissertations represent in terms of their own personal growth.

Community

As I’ve hinted, along the way towards a Ph.D., you’re surrounded by other bright, courageous, curious individuals, including students, research staff, and faculty. Getting to nerd out all day on deeply technical ideas with other deeply technical people who share your intellectual passions is insanely fun. When I was writing my dissertation acknowledgements, the word that came to mind to describe my time with colleagues in graduate school was “joyful.” I met several of my best friends in graduate school. Maybe because no one else in my program was studying the specific topic that I chose to study, I rarely felt like I was in a competition with the other students. Departments and academic communities have personalities, but they also have people, many of whom are just as excitable as you and are excited about the same topics that you are.

Freedom

Another amazing aspect of graduate school is that, provided you can find someone (or some department) to support you, you can work on almost anything you want. I think it’s wonderful that people can do dissertations on any number of topics, however practical or theoretical or obscure. The latitude afforded by academic freedom is one of the most beautiful parts of our modern social contract. Funding can be a challenge, but, nevertheless, there are few other institutions that permit such freedom today.

In computer science, we’re especially fortunate to have a huge number of interesting problems at the intersection of cutting-edge research and issues in current practice. There’s good reason why I thank 25 industry practitioners by name in my dissertation acknowledgments: they provided feedback, insight, and inspiration during my dissertation research. It was so much fun to work on research, present it to people who materially cared about the results, then improve the research as a result. At least in computer science, I think this process is much rarer than it needs to be.

Opportunity and Opportunity Cost

Make no mistake: graduate school has a high opportunity cost. In computer science, you’re likely giving up over $100,000 per year to be in school. In expectation, in monetary terms, a Ph.D. in Computer Science is not worth your time.

That said, with a Ph.D. in Computer Science, you can do many things:

The academic job market is extremely challenging. However, computer science is much better than other fields. As I was told early on in my career, you can’t bet on an academic position by any stretch, but you can try.

You can do research in an industrial lab, such as Microsoft Research, VMWare Research, Samsung Research, IBM Research, or HP Research.

You can also start a company, like Andy Konwinski, Patrick Wendell, Reynold Xin, and Matei Zaharia (Databricks), Haoyuan Li (Tachyon Nexus), Ben Hindman (Mesosphere), Sean Kandel (Trifacta), Kuang Chen (Captricity), Joey Gonzalez (Dato), Adam Oliner (Kuro Labs, acquired by Salesforce), and many others did over the past few years (or, a little further back, the students who started Google, VMWare, Tableau, and Nicira). If you work in a practical area, you can use your research as the basic technology behind a commercial venture. Moreover, some of your smart, creative colleagues might want to start something too.

If the entrepreneurial life doesn’t call your name, there will almost certainly be a Google or Facebook that is hiring (along with a number of companies of intermediate sizes), especially if you’re good and you’ve kept up on practical skills. In fact, you might want to work for a larger, established company anyway; there are fun problems to solve there, too.

In fact, a Ph.D. in Computer Science gives you a skill set that can serve you well in industry, wherever you end up. You may not be the best engineer to start, but your critical reasoning skills, communication skills, and discipline can be real assets. In addition, you might end up on a more specialized or technical team than you might have otherwise.

Of course, maybe you’d prefer to start a restaurant or become a artisan shoemaker. There’s no one stopping you, just like there was no one stopping you before; you just happen have a Ph.D. now, and, chances are, you learned a ton along the way.

Maybe not all of the lessons you learned during your Ph.D. are immediately, explicitly marketable. But was that the point? Especially if you read this post before starting a Ph.D., I hope your answer is “no.”

Final Thoughts

I recognize that I’m fortunate. Extremely fortunate. As a young, white male without debt or dependents in one of the most rapidly growing fields in academia, I’m extraordinarily, unbelievably privileged. In addition, I didn’t come this far alone. I had a fantastic team of mentors and advisors beginning in high school and college who believed in me and taught me to believe in myself. More broadly, academia is not a perfect place. Like many other human institutions, academia has systemic flaws that demand greater attention and improvement.

While acknowledging all of these things, my statement stands: I loved graduate school. N=1 says graduate school can be a challenging, rewarding, and fulfilling experience, and I believe many of my former graduate student friends and colleagues would agree. As a professor, I don’t expect all of my students to love graduate school. Instead, I hope to help cultivate their curiosities, to empower them to ask and answer their own questions. Maybe some of them will love it too.

NSF Graduate Research Fellowship: N=1 Materials for Systems Research

2015-09-03T00:00:00-07:00

The National Science Foundation Graduate Research Fellowship (NSF GRFP, or “the NSF”) is one of several fellowships for graduate study in the sciences. If you’re a graduate student in STEM or are applying for STEM-related graduate school in the United States, it’s a great idea to apply for the NSF, which is due in late October this year. In a nutshell, fellowships are useful because they give you flexibility and freedom by providing you with independing funding (and are also fairly prestigious).

Neha Narula, Jean Yang, and Philip Guo have all already written wonderful guides for applicants in Computer Science. If you aren’t already in graduate school, getting your materials together is a good exercise that’ll make you think hard about why you want to go to graduate school and what you want to do. There’s a lot of overlap between typical fellowship application materials and graduate application materials, so, if you’re also applying to Ph.D. programs, doing fellowship applications saves you work in November and December.

In addition to adding yet another pointer to the above guides, I wanted to provide my own N=1 example of a successful NSF application. While there are several good examples of successful NSF proposals online, I’m not aware of any in software systems or databases. I’ve meant to share these for a while, but I keep forgetting – so, finally, I’m posting my materials from 2011 below:

The most useful part of these materials is probably the research proposal. As an applicant in systems research, I remember having a difficult time thinking about how to scope my proposal. In retrospect, I think it’s reasonable to propose a project that, if executed, could lead to one or more solid papers in a top-tier venue like SIGMOD or SOSP, with a clear and well-defined idea, or “nugget” of novelty. As a starting point, I think a good heuristic for problem selection is to read the proceedings of the top-tier conferences in your area, figure out some interesting “hot” or open questions, and ask: how could I do better? While it’s intimidating to read the recent research literature and even consider the possibility of improving it, be bold, and propose something!

In my proposal, I focused on a problem that was both fascinating to me and also very “hot” at the time: many-core operating system scalability. There were a number of papers proposing new many-core OS architectures coming out at the time, but none of them were clear winners. I was in love of the somewhat vintage idea of capabilities at the time that I wrote my proposal, and I thought they would enable an interesting architecture. Were capabilities actually a good idea for shared-nothing multi-core? Possibly, but a definitive answer would require… research! At the proposal stage, a research project should ask some non-obvious questions, with a clear strategy for evaluating them and a clear payoff for doing so. Although my graduate research ultimately went in a different direction, I would still love to see the questions in this proposal answered!

In terms of evaluation, my application received relatively positive reviews, and I’m immensely grateful for the support that the NSF has given me during my graduate career. However, as a caveat, there were a few areas I could have improved on. From my reviews: “Some comments on interactions with other students are missing; also, there are no comments on what was involved in regard to the applicant’s TA fellowship. Also no comments are provided as to the impact of the research work proposed.”; “The applicant does not mention any outreach efforts. Involving more broader activities would help strengthen the application.”; “not much information is available to evaluate other broader impacts, including promoting diversity, integration education and research, etc.” Broader impacts criteria have changed since I applied, but, in the context of a research proposal in software systems, they’re challenging to convey and are therefore worth extra effort!

So, best of luck, and, as Neha points out, not applying gives you a 0% chance!

Worst-Case Distributed Systems Design

2015-02-03T00:00:00-08:00

Designing distributed systems that handle worst-case scenarios gracefully can—perhaps surprisingly—improve average-case behavior as well.

When designing CAP “AP” (or coordination-free) systems, we typically analyze their behavior under worst-case environmental conditions. We often make statements of the form “in the event of arbitrary loss of network connectivity between servers, every request received by a non-failing server in the system will result in a response.” These claims pertain to behavior under worst-case scenarios (“arbitrary loss of network connectivity”), which is indeed useful (e.g., DBAs don’t get paged when a network partition strikes). However, taken at face value, this language can lead to confusion (e.g., “in my experience, network failures don’t occur, so why bother?”). More importantly, this language tends to obscure a more serious benefit of this kind of distributed systems design.

When we design coordination-free systems that don’t have to communicate in the worst case, we’re designing systems that don’t have to communicate in the average case, either. If my servers don’t have to communicate to guarantee a response in the event of network partitions, they also don’t have to pay the cost of a round-trip within a datacenter or, even better, across geographically distant datacenters. I can add more servers and make use of them—without placing additional load on my existing cluster! I’m able to achieve effectively indefinite scale-out—simply because my servers never have to communicate on the fast-path.

This notion of worst-case-improves-average-case is particularly interesting because designing for the worst case doesn’t always work out so nicely. For example, when I bike to my lab, I put on a helmet to guard against collisions, knowing that my helmet will help in some but not all situations. But my helmet is no real match for a true worst case—say, a large meteor, or maybe just an eighteen-wheeler. To adequately defend myself against an eighteen-wheeler, I’d need more serious protection that’d undoubtedly encumber my bicycling. By handling the worst case, I lose in the average case. In fact, this pattern of worst-case-degrades-average-case is common, particularly in the real world: consider automotive design, building architecture, and processor design (e.g., for thermal, voltage, and process variations).¹ Often, there’s a pragmatic trade-off between how much we’re willing to pay to handle extreme conditions and how much we’re willing to pay in terms of average-case performance.

So why do distributed systems exhibit this pattern? One possibility is that networks are unpredictable and, in the field, pretty terrible to work with. Despite exciting advances in networking research, we still don’t have reliable SLAs from our networks. A line of research in distributed computing has asked what we could do if we had better-behaved networks (e.g., with bounded delay)—but we (still) don’t yet.² Given the inability to (easily and practically) distinguish between message delays, link failures, and (both permanent and transient) server failures, we do well by assuming the worst. Essentially, the defining feature of our distributed systems—the network—encourages and rewards us to minimize our reliance on it.

Over time, I’ve gained a greater appreciation for the subtle power of this worst-case thinking. It’s often instrumental in determining the fundamental overheads of a given design rather than superficial (albeit important) differences in implementations or engineering quality.³ It’s a clean (and often elegant) way to reason about system behavior and is a useful tool for systems architects. We’d do well by paying more attention to this pattern while fixating less on failures. Although formulations such as Abadi’s PACELC are a step in the right direction, the connections between latency, availability and scale-out performance deserve more attention.

Asides

There’s an interesting related challenge (design theme/meme?) regarding how to design systems that behave sanely in the worst case but take advantage of the gap between average-case and worst-case conditions (when it exists). I used to hang out with computer architects, and there’s some cool work on this topic in their community (e.g., “Better than Worst-Case design”). One of my favorite papers is called DIVA and employs a dirt-simple, in-order co-processor to verify the computations of a less reliable primary processor, which employs all sorts of microarchitectural bells and whistles to speed up the primary computation. DIVA only “pays” for errors (worst-case, due to, say, noise, process variation, bugs in the processor design) when they occur (and are subsequently caught by the checker, which is much easier to verify and build correctly). (Fun fact, from Footnote 1: “This [single-author paper] was performed while the author was (un)employed as an independent consultant in late May and June of 1999.”) Some of my favorite algorithms (selfishly, e.g., RAMP) have this flavor, where we’re only penalized when a bad thing occurs (in the case of RAMP-Fast, readers only pay—using a second RTT with servers to repair missing values—when there is actually a race). ↩
Lest I come off as grumpy: I am actually very excited about this possibility, but it’s a challenging chicken-and-egg problem. Which comes first: better hardware, or applications that could make use of the hardware if the hardware existed? I think we can go either way, but I’ll note that (in the latter scenario), there are troves of brilliant ideas in the distributed computing and database literature that are lying in wait for better hardware. As one of my favorite examples, Barbara Liskov’s PODC 1991 keynote on “Practical Uses of Synchronized Clocks in Distributed Systems” has a bunch of great ideas (and associated papers). Now that Barbara’s former student has managed to convince Google to add atomic clocks to their datacenters, these algorithms can be efficiently implemented at scale. I think academics have a serious role to play in this conversation, and this type of co-design is on my research agenda. But there’s probably lower-hanging fruit for practitioners today. ↩
Three scenarios where this kind of analysis has useful implications:

a.) Whenever someone claims their coordinated (e.g., serializable) system achieves unilateral performance or availability with a coordination-free design, we know they’re either misguided, are not exercising an interesting workload, and/or are simply being misleading. In these cases, we know that, regardless of how the system is implemented, because they’re choosing to implement a fundamentally “hard” problem (e.g., serializability, atomic commitment, or maybe just a consensus register) that requires coordination to prevent some “bad” execution, there’s going to be a cost associated. This is effectively the converse of our thesis: if you have to pay in the worst case, you (may) have to pay in the average case. Moreover (and perhaps most importantly), we don’t have to point fingers and argue about someone’s code or implementation—in many cases, there is literally no way to implement a better system, and we can prove that there’s no use in trying to do so. (Of course, there’s plenty of fun [and money] in building fast databases, but the point is that we can separate out what’s fundamentally new—which also happens to be a key criterion for good research—and what’s simply built better.)

b.) “Scale out” versus “scale up” systems design is a hot topic—should we build (often more sophisticated) single-node systems or build simpler systems that are geared toward horizontal scalability? These two are often opposed, and proponents of each strategy will argue that the other is misguided for some particular workload or system implementation. Yet, couched in the terms of worst-case systems design, advances in single-node performance simply serve to increase the capacity of each node within a cluster. Making each node in a coordination-free system two times faster via “scale up” techniques makes the whole cluster two times faster. If the conversation is typically “scale out” versus “scale up,” if we’re coordination-free, we get to choose “scale out” while “scaling up.” This may seem obvious, but consider the fact that speeding up single-node performance does not always, for example, improve the throughput of distributed transaction processing if (ha!) you are bottlenecked on the network.

c.) As hardware changes, the constants associated with communication costs may change—thus affecting the relative benefit of our worst-case designs—but the fundamental gains likely won’t. For example, for decades we’ve been running shared-everything architectures on single-node servers. However, there’s mounting evidence that a shared-nothing approach is desirable in a multi- and (especially) many-core setting. Andy Pavlo and friends recently did a study of OLTP scalability to 1000 cores (in simulation, with some folks at MIT). They found that—regardless of implementation—serializability tanks in performance (due to a variety of factors, depending on the algorithm, but—I’d argue—ultimately due to the fundamental costs of providing serializability). In the distributed setting, today’s datacenter networks make serializable transaction performance an easy target to beat—yet, even as network performance improves and performance bottlenecks lessen in severity, we’ll still see benefits. (An argument I like to make is that, despite DRAM accesses being ridiculously fast—at least relative to remote memory accesses or RPCs—we still spend a lot of time devising cache-friendly algorithms—albeit not for every application, but for many that matter.) ↩

When Does Consistency Require Coordination?

2014-11-12T00:00:00-08:00

My coauthors and I recently published a paper (that will appear at VLDB 2015) answering one of my longest standing research questions: when does consistency require coordination? It’s well known that many “strong” properties like “ACID” serializability and linearizability are not achievable without coordination, or synchronous communication between concurrent operations. But why is it that we can still implement reliable distributed counters and shopping carts that don’t lose writes, build indexes that are “consistent” with base data, and ensure useful properties like read-your-writes—all without coordination? Why do some operations require coordination while others don’t, and what’s the fundamental difference at play?

In the paper, we present a property, called invariant confluence (I-confluence), that precisely answers this question. I-confluence is necessary and sufficient for safe, coordination-free, available, and convergent execution (think CAP “AP”, without breaking your app). That is, if I-confluence holds, there exists a coordination-free execution strategy that preserves these properties. If it does not hold, no such strategy exists, so don’t waste your time looking for a coordination-free algorithm or concurrency control mechanism that’ll work in all cases.

The intuition behind I-confluence is pretty simple. We capture “consistency” (or safety) using invariants, or declarative correctness criteria about database states (e.g., no user has a negative account balance). Then, given a user’s transactions and a merge procedure used to reconcile divergent states (e.g., last-write-wins, set union, or a convergent datatype), we can check for I-confluence. Simply put, under I-confluence: all local commit decisions must be globally invariant-preserving. If I commit an operation on my local copy of database state, I must make sure that no other concurrent operation would invalidate my commit decision upon merge. If no such state exists, I don’t have to coordinate to commit. That’s it.

“That’s it? That sounds too simple to be true!” you protest.

It’s true, formalizing this property requires some care,¹ but the basic idea is pretty simple.² The key is to specify correctness in terms of invariants rather than reads and writes.³ Without knowledge of what “correctness” means to your app (e.g., the invariants used in I-confluence), the best you can do to preserve correctness under a read/write model is serializability.⁴ (Hopefully your database offers it.)

“So what?” you ask.

We applied I-confluence to a range of integrity constraints found in database systems, including foreign key, uniqueness, and row-level check constraints as well as some invariants over abstract datatypes.⁵ The results show that, while coordination can’t always be avoided, in many cases, it can (even when serializability would indicate otherwise). By applying these results to the industry standard TPC-C Benchmark, we achieve a 25-fold improvement over the prior best result (12.7M New-Order transactions per second on 200 servers—I’ll write about the details in another blog post, but, for now, the paper has details). We’ve also been looking at open source applications and applying I-confluence analysis (the subject of another paper and post) and have seen similar wins. Our experience indicates I-confluence works.

Consistency doesn’t always require coordination. In fact, in practice, it seldom does. I’m excited to finally have a piece of work that exactly characterizes this trade-off, backed by some serious experimental data that shows why it matters.

Notes

The paper provides actual formalism, which, if you’re a footnote reader, you should consider checking out. In a little more detail, the set of states reachable via invariant-preserving executions must be closed under merge. We also formalize coordination-freedom, availability, and convergence more carefully. If you absolutely love formalism, you can also check out the extended version of the paper on the arXiv. ↩
When I was doing this work, I was convinced someone else had come up with a property like this already. There’s a huge amount of work on semantics-based concurrency control (mostly from the 1980s) and even more on concurrent program analysis. But, for a bunch of reasons, we actually found the basic property we were seeking in the literature on rule-based rewriting systems rather than database concurrency control or classic program analysis. In 2007, some very smart folks working on rule-based rewriting wanted a more expressive way to talk about “correct” rewrites without resorting to a formulation based on determinism, or confluence (see below), so they came up with the idea of using invariants to talk about the validity of intermediate and final steps in the rewriting process. Our use of I-confluence is, if you squint, similar in nature to what they proposed (e.g., treat transactions as rules, and merge as “meet”), and so we adopted the name. I’d love to spend more time digging into that literature and taking greater advantage of rule-based rewriting techniques in I-confluence analysis.

To go a little deeper on the rest of the literature: we have a pretty extensive related work section in the paper, but there were two main problems I ran into in the literature. First, much of the work on semantics-based concurrency control in databases was helpful but wasn’t directly focused on necessary conditions for coordination-free execution. Second, in the program analysis literature, many techniques assume linearizable update to shared state. For example, if you declare pre- and post-conditions (which we don’t), you can often determine the safety of executing each of these operations in a linearizable manner. Not only is this linearizability already a non-starter in a coordination-free environment, but also if you’re in a shared-memory setting, any writes to the same item conflict and can cause terrible problems in many analysis techniques. In contrast, in a distributed setting (or even single-node environment with the ability to rewrite/curry writes to local program variables, then play with them later), writes to the same data item can be reconciled via potentially complex logic (e.g., merge). I’d be shocked if you couldn’t find some encoding of merge and the general I-confluence rules in these program analysis techniques, but given that the rewriting literature made this basically “free”, we stuck with that. ↩
Another (more subtle) benefit along these lines is that I-confluence allows us to reason about arbitrary, abstract implementations and deployments of database systems rather than any particular system implementation or deployment. We effectively “lift” the partitioning arguments typified by Gilbert and Lynch’s proof of the CAP Theorem to the level of arbitrary program logic (rather than read/write logic executed in parallel on different, adversarially partitioned servers). If that’s inscrutable, compare the paper’s I-confluence proofs to those in the HAT paper or those used by Gilbert and Lynch. This higher-level abstraction is, in our experience, more intuitive than reasoning about “partition” behavior—and also applies to single-node systems as well. Basically, instead of having to cook up a scenario where servers become partitioned, forcing “inconsistency,” we can rely on a basic property of application logic instead. ↩
Those of you who live and breathe this stuff (like me) will probably be asking: “what about commutativity?” or “what about monotonicity?”

Commutativity, or, in effect, order insensitivity of operations, is useful and is usually sufficient for safe, concurrent execution. But it’s not always necessary for correctness: for example, reads don’t commute with writes, but, if all we care about is that no one writes the value 0xDEADBEEF, who cares about reads? We provide some more examples in the paper, and so do Clements et al. Bonus: commutativity isn’t always safe. For example, if we we use a commutative counter to ensure we don’t “lose” the effects of increment or decrement operations, there’s no guarantee that our counter will obey the invariant that the counter is non-negative. (You might argue that this wasn’t really a “commutative” operation, at which we could have a fun conversation.)

Monotonicity, or, in effect, “grow-only” program logic, is also useful and is necessary and sufficient to guarantee confluence, or determinism, of outcomes despite nondeterministic ordering of program inputs. This underlies languages like Bloom, Bloom^L, and LVars. Determinism is useful, but it’s not necessary for all forms of correctness. As a classic example, consensus protocols allow any proposed value to be accepted insofar as all processes agree on that value—the outcome is, moreover, nondeterministic due to network delays in many common protocols like Paxos. That is, Paxos is not confluent (nor monotonic), but it’s still useful. In the coordination-free domain, the read operation in our commutativity example above wouldn’t be monotonic (we could do a blocking threshold read, which might work in some cases, but then we’ll have to block). Moreover, if we only decremented a shared counter, the decrement operations would be monotonic, but, if we had the same non-negative invariant, we could end up with a deterministic (negative) but incorrect state. So, in short, confluence and invariant preservation are both useful properties, but we care about the latter in this paper.

The silver lining here is that—despite false positives for coordination (and, depending on your definitions, false negatives)—both commutativity and monotonicity are in some cases easier to analyze. Commutativity, monotonicity, and I-confluence are all undecidable properties. But the first two depend only on the program logic, while I-confluence requires users to also specify invariants. (However, note that these invariants are really only one-per-database—or one conjunction per database—rather than one per operation, as in Hoare-style program analyses.) The open source applications we’ve been studying actually declare these surprisingly often, but there’s still a trade-off between user participation and precision in determining coordination requirements. ↩
If you’re interested in how we actually apply the test, there are details in the paper. In brief, our current approach is to pre-classify invariant-operation pairs (by manual proof—check out the arXiv for a full walk-through; these aren’t too hard) and then use simple static analysis to check pairs in code (I hesitate to really call this “program analysis”).

My current goal is to use I-confluence as a basis for principled systems optimizations like RAMP, the TPC-C results, and some of the open source work I’ve been doing. That said, I see a huge amount of low-hanging fruit in improving and further automating this analysis. ↩

Data Integrity and Problems of Scope

2014-10-20T00:00:00-07:00

Mutable state in distributed systems can cause all sorts of headaches, including data loss, corruption, and unavailability. Fortunately, there are a range of techniques—including exploiting commutativity and immutability—that can help reduce the incidence of these events without requiring much overhead. However, these techniques are only useful when applied correctly. When applied incorrectly, applications are still subject to data loss and corruption. In my experience, (the unfortunately common) incorrect application of these techniques is often due to problems of scope. What do I mean by scope? Let’s look at two examples:

1.) Commutativity Gone Wrong

For our purposes, commutativity informally means that the result of executing two operations is independent of the order in which the operations were executed. Consider a counter: if I increment by one and you also increment by one, the end result (counter += 2) is equivalent despite the order in which we execute our operations! We can use this fact to built more scalable counters that exploit this potential for reorderability.

However, we’re not off the hook yet. While our operations on the individual counter were commutative, does this mean that our programs that use this counter are “correct”? If we’re Twitter and we want to build retweet counters, then commutative counters are probably a fine choice. But what if the counter is storing a bank account balance?¹ Or if it’s monitoring the amount of available inventory for a given item in a warehouse? While two individual decrement operations are commutative, two account withdrawal operations (or two item order placement operations) are not. By looking at individual data items instead of how the data is used by our programs, we risk data corruption. When analyzing commutativity, myopia is dangerous.

2.) Immutability Gone Wrong

As a second example, consider the recent trend of making all data immutable (e.g., the Lambda architecture). The idea here is that we can collect a grow-only set of facts and subsequently run queries over those facts (often via a mix of batch and streaming systems). There’s no way to corrupt individual data items as, once they’re created, they can’t change. If I can name a fact that’s been created, I can always get a “consistent” read of that item from the underlying store holding my facts. Sounds great, right?

Again, it’s easy to miss the forest for the trees. Simply because individual writes are immutable doesn’t mean that program outcomes are somehow “automatically correct.” Specifically, the outputs of the functions we’re computing over these facts are—in many cases—sensitive to order and are “mutable” with respect to the underlying facts. Pretend we use immutable data to store a ledger of deposits and withdrawals. In our bank account example, simply making individual deposit and withdrawal entries immutable doesn’t solve our above problems: we can insert two “immutable” withdrawal requests and get a negative final balance!²

The Real Problem

The crux of the problem is that, if used to provide data integrity, these properties must be applied at the appropriate scope.³ Simply making individual operations commutative or immutable doesn’t guarantee that their composition with other operations in your application also will be, nor will this somehow automatically lead to correct behavior. Analyzing operations (or even groups of operations) in isolation—divorced from the context in which they are being used—is often unsafe. If you’re building an application and want to use these properties to guarantee application correctness, you likely need to analyze your whole program behavior.

To Be Fair…

To be fair, commutativity and immutability—even applied at too fine a scope—do have merits. They may lead to a decreased likelihood of data loss and/or overwrites than simply using mutable state without coordination. Many of the issues involved in “Lost Updates” go away. It’s often easier to detect integrity violations and issue compensating actions in a no-overwrite system. And remember, some programs are actually safe under arbitrary composition of commutative and/or immutable operations with program logic. Just not all programs.

These are good design patterns that help simplify many of the challenges we started with. But naïvely applying design patterns isn’t a substitute for understanding your program behavior.

In a sense, these problems of scope are a natural side-effect of good software engineering. As systems builders, we design and implement abstractions that can be re-used across multiple programs. We have to make decisions about what information we need from our applications and how to take advantage of that information.⁴ In many cases, it’s unrealistic for us to require our programmers to submit their entire program for commutativity analysis on every request (e.g., increment(key: String, program: JARFILE)!). We have to make compromises, and exposing operation-commutative counters is a reasonable compromise—as long as users understand the implications and are wise enough to guard against any undesirable behavior.

These problems of scope crop up in traditional database systems as well. Consider a serializable database—say, the best that money can(’t) buy, with no bugs, zero latency, infinite throughput, and guaranteed availability. If your bank application fails to wrap its multi-statement increment and decrement operations in a transaction, the database will still corrupt your data. Implicit in the serializable transaction API is the assumption that your transactions will preserve data integrity on their own. If you scope your transactions incorrectly, your serializable database won’t help you.

Conclusions and Some References

Where’s the silver lining? If you reason about your application’s correctness at the correct scope (often, the application level), you’ll be fine. There are a number of powerful ways to achieve coordination-free execution without compromising correctness—if used correctly. Figure out what “correctness” means to you (e.g., state an invariant), and analyze your applications and operations at the appropriate granularity.

If you’re interested, many people are thinking about these issues. My colleagues and I wrote a short paper on reasoning about data integrity at different layers of abstraction, from read/write to object and whole program. Some smart people from MIT wrote a paper last year on how to reason about commutativity of interfaces, and my collaborators have developed ways to guarantee determinism of outcomes via whole-program monotonicity analysis in the Bloom language (see also note two). Finally, some of my own recent work examines when any coordination is strictly required to guarantee arbitrary application integrity (that is, when we can preserve application “consistency” without coordination and when we must coordinate).

With a better understanding of what is and is not possible—along with tools that embody this understanding and systems that exploit it—we can build much more robust and scalable systems without compromising safety or usability for end users.

Notes

Yes, I used the bank example, and, indeed, banks deal with this problem via compensating actions and other techniques. But the point remains: otherwise commutative decrements on bank account balances aren’t enough to prevent—on their own, without other program logic—data integrity errors like negative balances. ↩
In short, I can’t safely make decisions based on the output of the functions I’m computing without making additional guarantees on the underlying set of facts. This is actually a pretty well studied (albeit challenging) problem in database systems. We’d call the “Lambda Architecture” an instance of “incremental materialized view maintenance” (or “continuous query processing”; via a combination of stream processing and batch processing). This has been, at various times, a hot topic in the research community and even in practice. For example, Jennifer Widom’s group at Stanford spent a considerable amount of time in the early 2000s understanding the relationship between streams and relations. Dave Maier and collaborators developed punctuated stream semantics that’d allow you to “seal” time and actually make definitive statements about a streaming output that, by definition, won’t change in the future. And Mike Franklin and friends built a company to commercialize earlier stream processing research from Berkeley’s TelegraphCQ project. Not surprisingly, the semantics of continuous queries became a serious issue in practice. ↩
Neil Conway and the Bloom team’s work on Bloom^L contains one of the first–and perhaps most lucid—discussions of what they call the “scope dilemma.” As part of the paper, they show that the composition of individual commutative replicated data types can yield non-commutative and, generally, incorrect behavior. Their later work (driven by Peter Alvaro) on Blazes examines problems of composition of monotonic and non-monotonic program modules across dataflows. ↩
These same issues apply to research, too. For example, the distributed computing community has standard abstractions like registers and consensus objects in part because these abstractions allow researchers to talk about, compare, and work on a common set of concepts.

Moreover, research papers are not immune to making these sorts of errors. I actually started writing this post in response to a paper (on a particular distributed computing application) that we read in a campus reading group last week that abused a notion of “commutativity and associativity” and sparked a discussion along these lines. (I’m sure I’ve made these mistakes in some of my own papers.) To be clear, I’m not necessarily advocating for greater formalization of these concepts—you can formalize anything (and sometimes it’s even useful to do so)—but am instead advocating for greater care in thinking about when these concepts can be successfully applied. ↩

Linearizability versus Serializability

2014-09-24T00:00:00-07:00

Linearizability and serializability are both important properties about interleavings of operations in databases and distributed systems, and it’s easy to get them confused. This post gives a short, simple, and hopefully practical overview of the differences between the two.

Linearizability: single-operation, single-object, real-time order

Linearizability is a guarantee about single operations on single objects. It provides a real-time (i.e., wall-clock) guarantee on the behavior of a set of single operations (often reads and writes) on a single object (e.g., distributed register or data item).

In plain English, under linearizability, writes should appear to be instantaneous. Imprecisely, once a write completes, all later reads (where “later” is defined by wall-clock start time) should return the value of that write or the value of a later write. Once a read returns a particular value, all later reads should return that value or the value of a later write.

Linearizability for read and write operations is synonymous with the term “atomic consistency” and is the “C,” or “consistency,” in Gilbert and Lynch’s proof of the CAP Theorem. We say linearizability is composable (or “local”) because, if operations on each object in a system are linearizable, then all operations in the system are linearizable.

Serializability: multi-operation, multi-object, arbitrary total order

Serializability is a guarantee about transactions, or groups of one or more operations over one or more objects. It guarantees that the execution of a set of transactions (usually containing read and write operations) over multiple items is equivalent to some serial execution (total ordering) of the transactions.

Serializability is the traditional “I,” or isolation, in ACID. If users’ transactions each preserve application correctness (“C,” or consistency, in ACID), a serializable execution also preserves correctness. Therefore, serializability is a mechanism for guaranteeing database correctness.¹

Unlike linearizability, serializability does not—by itself—impose any real-time constraints on the ordering of transactions. Serializability is also not composable. Serializability does not imply any kind of deterministic order—it simply requires that some equivalent serial execution exists.

Strict Serializability: Why don’t we have both?

Combining serializability and linearizability yields strict serializability: transaction behavior is equivalent to some serial execution, and the serial order corresponds to real time. For example, say I begin and commit transaction T1, which writes to item x, and you later begin and commit transaction T2, which reads from x. A database providing strict serializability for these transactions will place T1 before T2 in the serial ordering, and T2 will read T1’s write. A database providing serializability (but not strict serializability) could order T2 before T1.²

As Herlihy and Wing note, “linearizability can be viewed as a special case of strict serializability where transactions are restricted to consist of a single operation applied to a single object.”

Coordination costs and real-world deployments

Neither linearizability nor serializability is achievable without coordination. That is we can’t provide either guarantee with availability (i.e., CAP “AP”) under an asynchronous network.³

In practice, your database is unlikely to provide serializability, and your multi-core processor is unlikely to provide linearizability—at least by default. As the above theory hints, achieving these properties requires a lot of expensive coordination. So, instead, real systems often use cheaper-to-implement and often harder-to-understand models. This trade-off between efficiency and programmability represents a fascinating and challenging design space.

A note on terminology, and more reading

One of the reasons these definitions are so confusing is that linearizability hails from the distributed systems and concurrent programming communities, and serializability comes from the database community. Today, almost everyone uses both distributed systems and databases, which often leads to overloaded terminology (e.g., “consistency,” “atomicity”).

There are many more precise treatments of these concepts. I like this book, but there is plenty of free, concise, and (often) accurate material on the internet, such as these notes.

Notes

But it’s not the only mechanism!

Granted, serializability is (more or less) the most general means of maintaining database correctness. In what’s becoming one of my favorite “underground” (i.e., relatively poorly-cited) references, H.T. Kung and Christos Papadimitriou dropped a paper in SIGMOD 1979 on “An Optimality Theory of Concurrency Control for Databases.” In it, they essentially show that, if all you have are transactions’ syntactic modifications to database state (e.g., read and write) and no information about application logic, serializability is, in some sense, “optimal”: in effect, a schedule that is not serializable might modify the database state in a way that produces inconsistency for some (arbitrary) notion of correctness that is not known to the database.

However, if do you know more about your user’s notions of correctness (say, you are the user!), you can often do a lot more in terms of concurrency control and can circumvent many of the fundamental overheads imposed by serializability. Recognizing when you don’t need serializability (and subsequently exploiting this fact) is the best way I know to “beat CAP.” ↩
Note that some implementations of serializability (such as two-phase locking with long write locks and long read locks) actually provide strict serializability. As Herlihy and Wing point out, other implementations (such as some MVCC implementations) may not.

So, why didn’t the early papers that defined serializability call attention to this real-time ordering? In some sense, real time doesn’t really matter: all serializable schedules are equivalent in terms of their power to preserve database correctness! However, there are some weird edge cases: for example, returning NULL in response to every read-only transaction is serializable (provided we start with an empty database) but rather unhelpful.

One tantalizingly plausible theory for this omission is that, back in the 1970s when serializability theory was being invented, everyone was running on single-site systems anyway, so linearizability essentially “came for free.” However, I believe this theory is unlikely: for example, database pioneer Phil Bernstein was already looking at distributed transaction execution in his SDD-1 system as early as 1977 (and there are older references yet). Even in this early work, Bernstein (and company) are careful to stress that “there may in fact be several such equivalent serial orderings” [emphasis theirs]. To further put this theory to rest, Papadimitriou makes clear in his seminal 1979 JACM article that he’s familiar with problems inherent in a distributed setting. (If you ever want to be blown away by the literature, look at how much of the foundational work on concurrency control was done by the early 1980s.) ↩
For distributed systems nerds: achieving linearizability for reads and writes is, in a formal sense, “easier” to achieve than serializability. This is probably deserving of another post (encouragement appreciated!), but here’s some intuition: terminating atomic register read/write operations are achievable in a fail-stop model. Yet atomic commitment—which is needed to execute multi-site serializable transactions (think: AC is to 2PC as consensus is to Paxos)—is not: the FLP result says consensus is unachievable in a fail-stop model (hence with One Faulty Process), and (non-blocking) atomic commitment is “harder” than consensus (see also). Also, keep in mind that linearizability for read-modify-write is harder than linearizable read/write. (linearizable read/write《 consensus《 atomic commitment) ↩

MSR Silicon Valley Systems Projects I Have Loved

2014-09-19T00:00:00-07:00

Microsoft confirmed yesterday that it’s shuttering its Silicon Valley lab, home to 75 brilliant Computer Science researchers including 2013 Turing Award winner Leslie Lamport. Others have more and wiser things to say about this decision. However, I want to highlight some of the fantastic work that’s come out of MSR Silicon Valley in the recent past in my research area of databases and distributed systems:

Dryad and DryadLINQ were hugely influential systems that challenged the then-popular notions that high-performance distributed dataflow had to be i.) simple map and reduce tasks only and ii.) hard to program. Ideas such as Dryad’s general-purpose DAG execution model, flexible and lightweight data transfers, and lineage-based recovery model can be found in almost every later distributed dataflow system, from Microsoft SCOPE to Apache Tez and Spark. DryadLINQ provides language-integrated access to the Dryad engine, which set the stage for abstractions like RDDs and the return of automatic and online query optimizers. Both are now open source on GitHub.

Papers: Dryad @ EuroSys 2007; DryadLINQ @ OSDI 2008; Optimus @ EuroSys 2013
Naiad is a more recent system for streaming, cyclic, distributed dataflow. Like Dryad, think MapReduce but for arbitrary task graphs, while combining low-latency incremental processing and large-scale batch operation. At Naiad’s core is an new dataflow abstraction called timely dataflow that allows cyclic computations to safely proceed in parallel. The team’s also been working on some exciting extensions such as graph processing. This research won Best Paper at SOSP 2013, one of highest honors in the systems community and is open source on GitHub. The team’s earlier work on differential dataflow illustrated the potential for this efficient fixpoint processing.

Papers: Naiad @ SOSP 2013; Differential Dataflow @ CIDR 2013
CORFU (Clusters of Raw Flash Units) is a system that exposes a cluster of servers loaded with flash drives as a high-throughput shared log abstraction. This is, in itself, a challenging distributed systems problem that the team solved elegantly via a mix of fast sequencing and clever protocol design. However, CORFU’s power is perhaps better demonstrated by Tango, a system the researchers built on top. Tango demonstrates how to build fault-tolerant, high-performance distributed data structures such as trees, maps, and serializable transactions by making efficient use of the shared log abstraction. This architecture is not only creative, but it’s a great use of modern hardware with excellent empirical results to boot.

Papers: CORFU @ NSDI 2012; Tango @ SOSP 2013
MSR Silicon Valley has been on the bleeding edge of distributed data serving. Doug Terry (of Bayou fame) and others have built a system called Pileus that allows fine-grained control and SLAs in geo-replicated storage systems. Marcos Aguilera and others have similarly been working on fast, wide-area transaction execution, from snapshot isolation in Walter to serializability (leveraging one of my favorite ideas in concurrency control: transaction chopping) via transaction chains. Both of these lines of work are highly relevant to the ongoing redesign of large-scale cloud databases; it’s great to see services like Microsoft Azure DocumentDB adopt ideas like Pileus’s tunable consistency.

Papers: Pileus @ SOSP 2013; Walter @ SOSP 2011; Lynx @ SOSP 2013
There’s too much to list. But here’s a few more anyway: Nectar (OSDI 2010) automates the caching and reuse of intermediate results in data-parallel compute systems. Dandelion (SOSP 2013) provides a language-integrated automated runtime for running applications on both data-parallel compute clusters and GPUs. Quincy (SOSP 2009) pioneered the study of fair cluster scheduling algorithms. Dahlia Malkhi has done and continues to do amazing work on distributed algorithms (e.g., PODC 2014) in addition to working on systems projects such as CORFU. And, of course, Leslie Lamport’s recent work on TLA+ continues to make waves—for example, at Amazon.

You’ve probably heard of many of the brilliant folks behind this work before: Doug Terry, Dahlia Malkhi, Martin Abadi, Michael Isard, Derek Murray, Frank McSherry, Marcos Aguilera, Yuan Yu, and the list goes on. And, as I’ve said, there have been many others and many other exciting projects (and entire groups outside distributed systems) at MSR Silicon Valley.

Fortunately, MSR still has other branches—for example, many of the researchers studying core database issues are in Redmond. However, the above projects help illustrate why MSR Silicon Valley was such a research powerhouse and a welcome industrial neighbor to the west.

Understanding Weak Isolation Is a Serious Problem

2014-09-16T00:00:00-07:00

Modern transactional databases overwhelmingly don’t operate under textbook “ACID” isolation, or serializability. Instead, these databases—like Oracle 11g and SAP HANA—offer weaker guarantees, like Read Committed isolation or, if you’re lucky, Snapshot Isolation. There’s a good reason for this phenomenon: weak isolation is faster—often much faster—and incurs fewer aborts than serializability. Unfortunately, the exact behavior of these different isolation levels is difficult to understand and is highly technical. One of 2008 Turing Award winner Barbara Liskov’s Ph.D. students wrote an entire dissertation on the topic, and, even then, the definitions we have still aren’t perfect and can vary between databases.

The core problem: a monstrous abstraction in (most) every database

Despite the ubiquity of weak isolation, I haven’t found a database architect, researcher, or user who’s been able to offer an explanation of when, and, probably more importantly, why isolation models such as Read Committed are sufficient for correct execution. It’s reasonably well known that these weak isolation models represent “ACID in practice,” but I don’t think we have any real understanding of how so many applications are seemingly (!?) okay running under them. (If you haven’t seen these models before, they’re a little weird. For example, Read Committed isolation generally prevents users from reading uncommitted or non-final writes but allows a number of bad things to happen, like lost updates during concurrent read-modify-write operations. Why is this apparently okay for many applications?) In the research community and in our classrooms, we’ve historically contented ourselves with studying serializability and, to a certain extent, Snapshot Isolation. Understanding weak isolation as deployed in real-world databases is a wide open problem with serious consequences.¹

To put this problem in perspective, there’s a flood of interesting new research that attempts to better understand programming models like eventual consistency. And, as you’re probably aware, there’s an ongoing and often lively debate between transactional adherents and more recent “NoSQL” upstarts about related issues of usability, data corruption, and performance. But, in contrast, many of these transactional adherents and the research community as a whole have effectively ignored weak isolation—even in a single server setting and despite the fact that literally millions of businesses today depend on weak isolation and that many of these isolation levels have been around for almost three decades.²

Why might weak isolation actually work, and how did we get into this mess?

I can offer a few guesses as to why weak isolation works:

Non-serializable isolation “anomalies” are really just different kinds of race conditions. To have a race, you need concurrency. For low-traffic and low-contention applications, it’s possible that anomalies don’t arise (e.g., the read-modify-write race from above might not affect applications because there simply aren’t concurrent transactions).
When anomalies do arise, it’s possible that the read-write anomalies don’t translate into application data corruption. Just because a read/write race occurs doesn’t mean all outcomes are necessarily wrong (e.g., two writers might perform the exact same read-modify-write operations with the same outcome regardless of order).
It’s possible data is actually (occasionally) corrupted, and apps just don’t care. Or, when they do, they manually issue the customer an IOU and/or send them an overdraft notice.

However, these are still conjectures. With a few exceptions,³ we’re in uncharted terrority.⁴

Why haven’t we already solved this problem? One possibility: as a community tradition, database architects have prided themselves on top-down design of systems that fulfill beautiful interfaces. For example, the relational database revolution in the 1970s was fomented by the introduction of an amazing abstraction: relational algebra. Serializability is a similarly powerful abstraction that vastly simplifies end-user programming. In contrast, weak isolation is grotesque. Its development has been overwhelmingly mechanism-driven rather than policy- or application-driven. Do you know how Jim Gray et al. invented Read Committed back in 1975? Realizing serializability via two-phase locking is expensive, the System R gang changed their long read locks to short read locks. No joke. (Aside: look at that typesetting!) What about Snapshot Isolation? In the early 1990s, companies like Interbase and Microsoft started shipping databases with a fancy new multi-version concurrency control mechanism. It took another paper by Gray and friends to define what these systems had actually implemented.

Towards a better understanding and better system design

Understanding weak isolation is not “just” an academic problem—it’s a problem database users face every day. If a user wants to make responsible use of weak isolation, she first needs to learn the particulars of each isolation model (which receive only partial treatment even in the best textbooks). Second, she has to manually translate the set of low-level read/write anomalies that define each model to her application logic. This is hard. With effort and a lot of education, it can be done. But remember, if our user’s database doesn’t support serializability and she cares about correctness, it must be done. Is this good design? Is this the best we can do? Exposing models that benefit the systems builder rather than the end user is, in my opinion, antithetical to the database tradition and to empathetic, user-centric design.

I think there are at least three fronts for making progress on this problem:

As a community, I think it’s time to pay attention to and give existing weak isolation models the deep treatment they deserve. We’ve a few hints already. For example, my dissertation work (and plenty of excellent related work) helps identify and apply conditions under which we can race and preserve correctness (and concurrent execution). But there’s much work to be done in specifically examining existing and widely-deployed models. There are numerous Ph.D. dissertations to be written in this space and, more importantly, serious potential for impact on practice.
As we develop new systems, we can avoid making the same mistakes as our architectural ancestors by focusing on applications, not mechanisms. I’d personally welcome a moratorium on writing papers on new isolation, data consistency, and read/write transaction models unless there’s a clear and specific set of motivating and specific application-driven use cases.⁵
There’s real promise in moving beyond the read-write interface of traditional isolation models entirely and building concurrency control systems that operate at a higher level of abstraction (holy grail: the application level). This helps both systems and users. Systems can exploit greater concurrency and therefore provide higher performance and availability. Users don’t have to think about data races. This is a topic for another post, but work on Bloom/CALM, CRDTs, and I-confluent coordination avoidance hints at what’s achievable by reasoning about applications, not read/write histories.

We can do better, and there’s a ton of interesting research and design to be done. Go!

Notes

There’s actually some research (see below) on this topic, but I don’t think we have a satisfying answer yet—and, from the work I’ve seen, we definitely don’t have an answer to explain the prevalence of these models in practice. I don’t think that there’s a clear, unambiguous one, either: given no knowledge about your program semantics, any model other than serializable isolation can corrupt data. However, as an example of a paper I’d like to read, a PBS-style white-box probabilistic analysis of Postgres, MySQL, or another RDBMS could be enlightening.

We had a fun discussion about this topic in the session on transaction processing at VLDB this year. Again, no one was able to come up with a great answer to explain the prevalence of these models. However, there was a considerable amount of excitement in the room (perhaps also surprising given the number of papers on serializability and Snapshot Isolation)—much more than I’ve previously seen in the database community. ↩
To be clear, I think we should work on both distributed data consistency and weak isolation. In fact, the solutions may be similar. For example, Read Committed is, according to the usual definitions, not much stronger than what most eventually consistent stores provide (at the minimum, there are few—if any—guarantees about the orderings of concurrent transaction execution). So, many approaches to coordination-free or coordination-avoiding distributed implementations may indeed apply to weak isolation, even on a single server. The main difference I see (and what I’m agitating for in this post) is that, while there’s a considerable amount of interest in examining often even more esoteric read/write data consistency models in the distributed setting, there’s a stunning lack of work on what many more people today are already running. It’s possible we can kill two birds with one stone by, say, moving beyond the read-write abstraction, which causes all sorts of problems once we drop serializability.

Also, it’s pretty funny to discuss non-serializable “weak” distributed data consistency models such as eventual consistency as if their inherent usability challenges are foreign to traditional data management systems (that, as I’ve discussed, often aren’t serializable either). ↩
A few examples:
- “Quantifying Isolation Anomalies” by Fekete, Goldrei, and Asenjo, VLDB 2009
- “Building on Quicksand” by Helland and Campbell, CIDR 2009
- “Semantic conditions for correctness at different isolation levels” by Bernstein and Lewis, ICDE 2000
- “Highly Available Transactions: Virtues and Limitations” by Bailis, Davidson, Fekete, Franklin, Ghodsi, Hellerstein, and Stoica, VLDB 2014
↩
Read: a potential high impact research goldmine.

Related note: I’m a big fan of papers on new techniques (e.g., program analysis and synthesis, or theory) for increasing the scalability or concurrency of applications. However, I wish we saw more discussion of the applications themselves. Techniques are valuable in their own right, but the idea of analysis-as-design-tool can be powerful—this paper from MIT does a great job in this regard. For example, was a previously expensive syscall or transaction non-scalable because it was simply implemented in a silly way? Did the new tool you’re describing also teach you something about the actual intent of the application that might have been obfuscated by its implementation? Did it surprise you? I’m fascinated by techniques for determining if weak isolation is safe but, even moreso, when (in terms of applications) and why (in terms of programmer intent) it’s safe (or not). As a shameless plug, we’ve some work in the pipeline looking at a slew of open-source applications in this vein. Open source is great for this stuff—thank goodness for Github! ↩
As a self-serving example, after our HAT research (on understanding which isolation levels are achievable without coordination/”AP”), we realized there was no existing isolation model that’d efficiently serve the indexing, materialized view, and multi-put requirements in Google’s Megastore, Facebook Tao, LinkedIn Espresso, Yahoo! PNUTS, and a number of open-source databases (you’d have to use Repeatable Read or Snapshot Isolation, both of which require coordination/”CP”). So we devised a new model, called Read Atomic (RA) isolation that directly addresses these use cases and is achievable via high-performance, coordination-free “AP” algorithms. Does RA further complicate this polluted space of isolation levels? You bet. But, when I talk to an end user, I can tell them “RAMP transactions ensure you don’t have dangling pointers in your global secondary index entries and also preserve foreign key relationships” rather than tell them “RAMP transactions prevent G0 (Write Cycles), G1a (Aborted Reads), G1b (Intermediate Reads), and G1c (Circular Information Flow), PMP (Predicate-Many-Preceders), and Fractured Reads but not G2 (Anti-dependency cycles), G-single, G-SIa (Interference) or G-SIb (Missed Effects).” The second explanation is meaningful (and rightly belongs in the paper!), but the former is immediately use-case driven.

(A side benefit of that HAT work along these lines: an existing application running on a coordinated implementation of a HAT isolation model could hypothetically run faster in a distributed/coordination-free manner.) ↩

Bridging the Gap: Opportunities in Coordination-Avoiding Databases

2014-04-22T00:00:00-07:00

Background: I recently co-organized the Principles and Practice of Eventual Consistency workshop at EuroSys, where I also gave a talk on lessons from some of our recent research. This post contains a summary (in the form of my talk proposal). This is joint work with Alan Fekete, Ali Ghodsi, Mike Franklin, Joe Hellerstein, and Ion Stoica.

My slides are available on Speaker Deck.

Abstract: Weakly consistent systems surface a controversial tension between, on the one hand, availability, latency, and performance, and, on the other, programmability. We propose the concept of coordination avoidance as a unifying, underlying principle behind the former and discuss lessons from our recent experiences mitigating the latter.

Trouble in Paradise

Faced with the task of operating “always on” services and lacking sufficient guidance from the literature regarding alternative distributed designs and algorithms, many Internet service architects and engineers throughout the 2000s discarded traditional database semantics and transactional models in favor of weaker but less principled models: eventual consistency, few if any multi-object, multi-operation (i.e., transactional) guarantees, and ad-hoc application-specific compensation—collectively, much of “NoSQL” [1]. From a research perspective, this space of weaker models has proven to be a fertile area: new (or re-discovered), often esoteric, and almost always nuanced semantics present many opportunities for new systems design, optimizations, and formal study.

Unfortunately, these weaker models come with serious usability disadvantages: programmability suffers. Understanding the implications of non-serializable isolation models for end-user applications is difficult. Programmers have little practical guidance as to how to choose an appropriate model for their applications, and understanding the differences between models effectively requires graduate-level training in distributed systems and/or database theory [2]. Some members within the internet services industry that birthed the resurgence of interest in these semantics have begun a backlash against them: one recent and prominent industrial account unequivocally claims that “designing applications to cope with concurrency anomalies in their data is…ultimately not worth the performance gains” [15]. Statements like these (which, in our experience, enjoy some popularity among practitioners and considerable acceptance in the database community [17]) suggest that, as a research community centered on weak consistency, we are possibly failing to communicate and demonstrate the benefits achievable with these semantics, have underestimated the burden placed on programmers, or a combination of both.

Coordination-Free Execution: A Unifying Principle Behind “AP” Benefits

In tribute to the CAP Theorem [11] that widely popularized these trade-offs, much of the dialogue around weakly consistent models concerns the availability of operations under failures. Availability is an important property, but, in our opinion, a sole focus on availability undervalues the benefits of weak semantics. Daniel Abadi has successfully argued that, while “availability” is primarily relevant in the presence of failures, weakly consistent (“AP”) systems can also offer low latency [1]. Any replica can respond to any request, alleviating the need for many communication delays—in our experience, average LAN latencies can be up to 720x faster than those over WAN.

We would take Abadi’s position even further: weak consistency also allows aggressive scale-out, even at the level of a single data item—more servers can be added without communication between them. Modern, strongly consistent “NewSQL” systems can indeed provide horizontal scale-out using shared-nothing database replication techniques popularized in the 1980s [16]. However, especially for worst-case accesses, these systems are far from “as scalable as NoSQL” [15] systems offering weak isolation. In recent research, we have examined the throughput penalties associated with these “strong” models: in modern LAN and WAN networks, distributed serializable transactions face a worst-case throughput limits of 1200 and 12 read-write transactions per item per second, independent of implementation strategy [6]. Recent systems [10, 15] are no exception: operations over disjoint data items can proceed concurrently, increasing throughput, but conflicting operations over non-disjoint data items are limited by network latency. In contrast, appropriate implementations of weak consistency face no such overheads.

The three properties above—availability, low latency, and scale-out—are consequences of a more fundamental principle underlying weakly consistent systems: a lack of synchronous communication, or coordination, between concurrent operations. If operations can execute coordination-free, they can run concurrently, on any available resources, without communicating with or otherwise stalling concurrent operations. The cost of coordination is easily and simultaneously cast in the form of (minority) unavailability, latency (minimum 1 RTT), and throughput (maximum 1/RTT). Moreover, and more importantly, the concept of coordination-free execution is portable to a range of system architectures: whereas traditional formulations of availability are inherently tied to physical replication, coordination-freedom is a property of the execution strategy and is independent of physical deployment or topology. For example, a system providing clients with snapshot reads can effectively act as a coordination-free “replicated” system even if implemented by a set of linearizable multi-versioned masters [7]. Judicious use of strong semantics in correct application execution equates to coordination-avoidance [6]: the use of as little coordination as possible.

Bridging the Gap: Experiences Applying Coordination Avoidance

As F1’s authors highlight above, the decision to consider coordination-avoiding algorithms or not requires a cost-benefit judgment [8]: will performance, availability, or latency benefits outweigh the cost of ascertaining whether weak models are sufficient? In a sober assessment, many applications will likely be able to (over-)pay for “strong consistency”: single-site operations are inexpensive [16], while improvements in datacenter networks [19] lower the cost for non-geo-replicated systems. Yet, a large class of applications—for example, non-partitionable applications [9], applications with high mutation rates (i.e., write contention) [18], and geo-replicated applications [20]—will continue to be sensitive to extraneous coordination costs and will likely necessitate further study.

Identifying and serving this latter class of applications is paramount to ensuring the future adoption of coordination-avoiding algorithms. While represents a difficult task, we offer three examples from our recent research:

High-value, well-specified applications are ripe for optimization. We have found success in coordination-avoiding optimization of high-value workloads. As an example, we recently combined our recent work on RAMP transactions with our recent results on necessary and sufficient conditions for coordination-free execution to attain an 25x improvement in throughput on the traditionally challenging TPC-C OLTP benchmark [6, 18]. As the industry- and academic-standard benchmark for transactional performance, TPC-C is accompanied by a rigorous specification for compliance, which acted as a correctness specification in our coordination analysis and was instrumental guiding our implementation strategy. Beyond TPC-C, we have encountered few database workloads and integrity constraints that require coordination for all queries: rather, many queries are amenable to coordination-free execution, while a handful (like TPC-C New-Order ID assignment) require coordination for correctness. The resulting challenge is two-fold: first, identify the operations that do require coordination (ideally few or none), and second, determine an appropriate coordination-free execution plan for those that do not.
Applications on ACID databases (surprisingly often) do not use ACID transactions. Traditional databases also have an equivalent of weak consistency: although they were not explicitly developed for distributed environments, databases provide a range of “weak isolation” models, such as Read Committed and Repeatable Read. Many “ACID” databases today adopt these weak isolation guarantees by default (only 3 of 18 in a survey we recently performed) and sometimes as the strongest level supported (e.g., Oracle 11G) [5]. Existing applications deployed on these systems are necessarily either tolerant of or otherwise must for compensate for these weak semantics, hinting at opportunities for optimization. Moreover, while many of these weak semantics are not typically implemented in a coordination-free manner (e.g., relying on locking or validation protocols)—a remnant of their single-node origins—they are often, in fact, implementable without resorting to coordination. We recently classified commonly-deployed isolation models as achievable via Highly Available Transactions or not [5]: existing applications deployed on “HAT” isolation models are excellent candidates for study in coordination-free environments.
ACID databases are not built using ACID transactions. High-performance database internals are maintained using specialized, highly optimized algorithms that are carefully designed to maximize safe concurrency [12] (e.g., in the case of indexing, exotic data structures like B-link trees [13]). The database designer does not use serializable access to the database data structures for at least two reasons. First, doing so would be prohibitively expensive, and, second, the expert designer does not need to: she has a well-defined specification (e.g., secondary index lookup behavior) that she can use to ensure correctness (without end-user intervention).

In a distributed database system, coordination-avoiding techniques are similarly applicable to internal data structure implementation. As experts of both databases and fast distributed algorithms, we can ensure that the anomalies of our weakly consistent but fast algorithms do not interfere with application-level correctness; we can encapsulate the side-effects of weak semantics behind well-defined (and existing) interfaces. Our recent work on Read Atomic Multi-Partition (RAMP) transactions was developed in this context and, as motivating use cases, focuses on foreign key constraint maintenance, distributed secondary indexing, and materialized view maintenance [7]. Our focus on coordination avoidance yielded algorithms that consistently outperformed alternatives, especially those based on synchronous coordination such as locking.

Overall, we have found success in i.) focusing on existing applications and ii.) incorporating existing specifications to enable coordination-avoiding execution. Towards these goals, our ongoing research directly incorporates application-level invariants (derived from real-world application and the SQL language) for analysis under a necessary and sufficient property for coordination-free execution [6]. The use of application-level invariants is key to safely maximizing concurrency without requiring programmer expertise in weak isolation models. By focusing on existing invariants and specifications as above, we can provide tangible improvements without necessarily affecting end-user experience.

A Coordinated Future

The continued success of weakly consistent systems requires a demonstration of and focus on utility. Delivering an understanding of exactly why these weakly consistent semantics can provide (in a fundamental sense) greater availability, lower latency, and higher throughput is paramount; our proposed focus on coordination avoidance is our attempt at providing a unified answer. By applying the lens of coordination avoidance to a range of existing, well-defined and ideally high-value domains, we have the opportunity to demonstrate exactly when weak consistency is adequate and, equally importantly, when it is not. Without a full specification, language techniques [3, 4] and library-based optimizations [14] are helpful to programmers. However, with a full specification and increased knowledge of application semantics [6, 7], we can fully realize the benefits of coordination avoidance while further mitigating programmer burden. While coordination cannot always be avoided, we are bullish on a continued ability to effectively manage it.

References

[1] D. J. Abadi. Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. IEEE Computer, 45(2):37–42, 2012.

[2] P. Alvaro, P. Bailis, N. Conway, and J. M. Hellerstein. Consistency without borders. In SoCC 2013.

[3] P. Alvaro, N. Conway, J. M. Hellerstein, and D. Maier. Blazes: Coordination analysis for distributed programs. In ICDE 2014.

[4] P. Alvaro, N. Conway, J. M. Hellerstein, and W. Marczak. Consistency analysis in Bloom: a CALM and collected approach. In CIDR 2011.

[5] P. Bailis, A. Davidson, A. Fekete, A. Ghodsi, J. M. Hellerstein, and I. Stoica. Highly Available Transactions: Virtues and Limitations. In VLDB 2014.

[6] P. Bailis, A. Fekete, M. J. Franklin, A. Ghodsi, J. M. Hellerstein, and I. Stoica. Coordination-Avoiding Database Systems. arXiv:1402.2237, 2014.

[7] P. Bailis, A. Fekete, A. Ghodsi, J. M. Hellerstein, and I. Stoica. Scalable Atomic Visibility with RAMP Transactions. In SIGMOD 2014.

[8] P. Bailis and A. Ghodsi. Eventual Consistency Today: Limitations, extensions, and beyond. ACM Queue, 11(3), 2013.

[9] N. Bronson et al. Tao: Facebook’s distributed data store for the social graph. In USENIX ATC 2013.

[10] J. C. Corbett et al. Spanner: Google’s globally-distributed database. In OSDI 2012.

[11] S. Gilbert and N. Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, 33(2):51–59, 2002.

[12] J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993.

[13] P. L. Lehman and S. B. Yao. Efficient locking for concurrent operations on B-trees. ACM TODS, 6(4):650–670, 1981.

[14] M. Shapiro, N. Preguiça, C. Baquero, and M. Zawirski. A comprehensive study of convergent and commutative replicated data types. Technical Report 7506, INRIA, 2011.

[15] J. Shute et al. F1: A distributed SQL database that scales. In VLDB 2013.

[16] M. Stonebraker. The case for shared nothing. IEEE Database Engineering Bulletin, 9(1):4–9, 1986.

[17] M. Stonebraker. Why enterprises are uninterested in NoSQL. ACM Queue Blog, September 2010.

[18] TPC Council. TPC-C Benchmark Revision 5.11. 2010.

[19] Y. Xu, Z. Musgrave, B. Noble, and M. Bailey. Bobtail: avoiding long tails in the cloud. In NSDI 2013.

[20] M. Zawirski, A. Bieniusa, V. Balegas, S. Duarte, C. Baquero, M. Shapiro, and N. Preguiça. SwiftCloud: Fault-tolerant geo-replication integrated all the way to the client machine. arXiv:1310.3107, 2013.

Without Conflicts, Serializability Is Free

2014-04-14T00:00:00-07:00

Common pitches for modern, serializable databases include claims that they are “as scalable as NoSQL,” they “combine the speed and scale advantages of NoSQL systems with ACID guarantees,” or they demonstrate that “the scalability, fault-tolerance, and performance of NoSQL databases are still achievable with [serializable] transactions.” These claims are somewhat misleading, and here’s why:

Any two operations on the same data—at least one of which is a write—can compromise serializability, or the illusion of a sequential execution, if executed concurrently. So, when executing transactions that read and write to the same data, a database will have to stall some of the transactions in order to preserve serializability. Adding more servers won’t necessarily improve throughput: if a workload bottlenecks on read/write synchronization, adding physical resources like extra servers won’t help.

In contrast, a NoSQL system like Riak or Cassandra offering “weak” consistency can avoid these synchronization bottlenecks. Additional servers can process additional requests in parallel, without communicating. This provides availability, low latency, and scalability—even for single-record accesses—allowing literally unbounded throughput. Of course, there’s no free lunch: these scalable systems provide weaker guarantees that can—but do not always—compromise application-level consistency.

However, for operations over disjoint data—that is, for transactions without read-write conflicts—serializable databases can perform as well as weakly consistent systems. Under these workloads, there’s no need for synchronization between operations, which can safely execute concurrently. This is why I say that without conflicts, serializability is free.

Will exploiting this disjoint data parallelism result in a quantum leap in distributed database design? Mike Stonebraker would probably say “no”. Database system designs have optimized for data-parallel access patterns for decades: “shared nothing” serializable databases provide excellent programmability and perform well—just not for all workloads.

Anyone providing strong semantics and claiming absolute performance, latency, or availability parity with “NoSQL” is either confused about database isolation, isn’t running workloads with conflicts, or is just trying to sell you a database. In practice, your mileage may vary: understand your read-write conflict patterns, and plan accordingly.

Scalable Atomic Visibility with RAMP Transactions

2014-04-07T00:00:00-07:00

We recently wrote a paper that will appear at SIGMOD called Scalable Atomic Visibility with RAMP Transactions. This post introduces RAMP Transactions, explains why you should care, and briefly describes how they work.

Executive Summary

What they are: We’ve developed three new algorithms—called Read Atomic Multi-Partition (RAMP) Transactions—for ensuring atomic visibility in partitioned (sharded) databases: either all of a transaction’s updates are observed, or none are.

Why they’re useful: In addition to general-purpose multi-key updates, atomic visibility is required for correctly maintaining foreign key constraints, secondary indexes, and materialized views.

Why they’re needed: Existing protocols like locking couple mutual exclusion and atomic visibility. Mutual exclusion in a distributed environment can lead to serious performance degradation and availability problems.

How they work: RAMP transactions allow readers and writers to proceed concurrently. Operations race, but readers autonomously detect the races and repair any non-atomic reads. The write protocol ensures readers never stall waiting for writes to arrive.

Why they scale: Clients can’t cause other clients to stall (via synchronization independence) and clients only have to contact the servers responsible for items in their transactions (via partition independence). As a consequence, there’s no mutual exclusion or synchronous coordination across servers.

The end result: RAMP transactions outperform existing approaches across a variety of workloads, and, for a workload of 95% reads, RAMP transactions scale to over 7 million ops/second on 100 servers at less than 5% overhead.

Where the overhead goes: Writes take 2 RTTs and attach either a constant (in RAMP-Small and RAMP-Hybrid algorithms) or linear (in RAMP-Fast) amount of metadata to each write, while reads take 1-2 RTTs depending on the algorithm.

Background

Together with colleagues at Berkeley and the University of Sydney, I’ve spent the last several years exploring the limits of coordination-free, “AP” execution in distributed databases. Coordination-free execution is powerful: it guarantees a response from any non-failing servers, provides low latency, and allows scale out, even at the granularity of a single database record. Of course, coordination-free execution permits races between concurrent operations, so the question is: what useful guarantees are achievable under coordination-free execution?

We recently spent time studying traditional database isolation levels from the perspective of coordination free execution. In this work, we realized that a classic property from database systems was achievable without coordination—even though most almost all databases implement it using expensive mechanisms! That is, we realized that we could provide the atomic visibility property—either all updates should be visible to another transaction or none are—without resorting to typical strategies like locking. We introduced an early version of our algorithms in a paragraph of our VLDB 2014 paper on Highly Available Transactions but felt there was more work to be done. The positive reaction to my post on our prior algorithm further inspired us to write a full paper.

Why Atomic Visibility?

It turns out that many use cases require atomic visibility: either all or none of a transaction’s updates should be visible to other transactions. For example, if I update a table in my database, any secondary indexes associated with that table should also be updated. In the paper, we outline three use cases in detail: foreign key constraints, secondary indexing, and materialized view maintenance. These use cases crop up in a surprising number of real-world applications. For example, the authors of Facebook’s Tao, LinkedIn Espresso, and Yahoo! PNUTS each describe how users can observe inconsistent data due to fast but non-atomic updates to multiple entities and secondary indexes in their systems. These systems have chosen scalability over correctness. We wanted to know: can we provide these use cases with both scalability and correctness?

If you pick up a database textbook today and try to figure out how to implement atomic visibility, you’re likely to end up with a solution like two-phase locking or some version of optimistic concurrency control. This works great on a single machine, but, as soon as your operations span multiple servers, you’re likely to run into problems: if an operation holds a lock while waiting on an RPC, any concurrent lock requests are going to have to wait. Throughput effectively drops to 1/(Round Trip Time): a few thousand requests per second on a modern cloud computing network like EC2. To allow partitioned (sharded) operations to perform well, we’ll have to avoid any kind of blocking, or synchronous coordination.

Our thesis in this work is that traditional approaches like locking (often unnecessarily) couple atomic visibility and mutual exclusion. We can offer the benefits of the former without the penalties of the latter.

RAMP Internals, In Brief

In RAMP transactions, we allow reads and writes to proceed concurrently over the same data. This provides excellent scalability, but it introduces a race condition: what if a transaction only observes a subset of another, concurrent transaction’s writes? We address this race condition via two mechanisms (for now, assume read-only and write-only transactions):

Use metadata to detect the race: Clients attach metadata to each write that allows other clients to detect partial reads from in-flight writes. For example, in the RAMP-Fast algorithm, writers attach a unique, per-transaction timestamp and a list of items written in the transaction. Clients can combine all of the metadata from versions they have actually read to determine the highest timestamp that they should have read for each item. If any item they read from servers has a lower timestamp than calculated, it means that the write was in-flight at the time of the read. The client can subsequently fetch the right version(s) from the server(s) in a second (parallel) set of requests and return the resulting set of versions.
Prevent readers from waiting: To prevent readers from ever having to wait for the “right version” to arrive at a partition, writes proceed in two rounds. The first round places each write on its respective partition but doesn’t yet make it visible to readers (i.e., prepares the write). The second round makes each write visible to readers (i.e., commits the write). This means that, if a reader observes a write from a transaction, all of the other writes the reader might request are guaranteed to be present on their respective servers. The reader simply has to identify the right version (using metadata above) and the server can return it.

The three RAMP algorithms we develop in the paper offer a trade-off between metadata size and round trip times (RTTs) required for reads. RAMP-Fast requires 1 RTT in the race-free case and 2 RTTs in the event of a race but stores the list of keys written in the transaction as metadata. RAMP-Small always requires 2 RTTs for reads but only uses a single timestamp as metadata. RAMP-Hybrid uses a Bloom filter to compress the RAMP-Fast metadata and requires between 1-2 RTTs per read (determined by the Bloom filter false positive rate). Writes always require 2 RTTs.

As we show in the paper, RAMP-Fast and RAMP-Hybrid perform especially well for read-dominated workloads; there’s no overhead on reads that don’t race, and, for reads that do race writes, reads take an extra RTT. The worst-case throughput penalty is about 50% due to extra RTTs (either by using RAMP-Small or executing write-heavy workloads), and we observed a linear relationship between write proportion and throughput penalty. None of the existing algorithms we benchmarked (and describe in the paper) performed as well.

Scalability, For Reals

The term “scalability” is badly abused these days. (It’s almost as meaningless as the use of the term “ACID” in modern databases.) When devising the RAMP transaction protocols, we wanted a more rigorous set of criteria to capture our goals for “scalability” in a partitioned database. We decided on the following:

Synchronization independence prevents one client’s operations from causing another’s to stall or fail. This means locks are out of the question.
Partition independence means that each client only has to contact partitions responsible for items it wants to access.

We can (and, in the paper, do) prove that RAMP transactions satisfy these two criteria. Moreover, when we actually implemented and experimentally compared our algorithms to existing algorithms that failed one or both of these properties, the effects were measurable: algorithms that lacked partition independence required additional communication (in pathological cases, a lot more), while algorithms that lacked synchronization independence simply didn’t work well under high read/write contention. You can see the gory details in Section 5 and read more about these properties in Section 3 of the paper.

Note that, by providing the above criteria, we’re disallowing some useful semantics. For example, synchronization independence effectively prevents writers from stalling other writers. So, if you might have update conflicts that you need to resolve up-front, you shouldn’t use these algorithms and you should expect to incur throughput penalties due to coordination.

For Distributed Systems Nerds

If you’ve made it this far, you’re probably a distributed systems nerd (welcome!). As a distributed systems researcher, there two aspects of our approach that I think are particularly interesting:

First, we use an atomic commitment protocol (ACP) for writes. However, every ACP can (provably) block during failures. (Remember, AC is harder than consensus!) Our observation is that ACPs aren’t harmful on their own; ACPs get a bad rap because they’re usually used along with with blocking concurrency control primitives like locking! By employing an ACP with non-blocking concurrency control and semantics, we allow individual ACP rounds to stall without blocking non-failed clients. We talk about (and evaluate the overhead of) unblocking blocked ACP rounds in the paper.

Second, in the paper, we use “AP” (i.e., coordination-free) algorithms in a “CP” environment. That is, we can execute RAMP transactions in an available active-active (multi-master) environment, but we instead chose to implement them in a single-master-per-partition system. Master-based systems can also benefit from coordination-free algorithms; it’s as if each concurrent operation gets to execute on its own (logical) replica, plus we can make strong guarantees on data recency. For example, after a write completes, all later transactions are guaranteed to observe its effects. This probably deserves another post, but I see the application of coordination-free algorithms to otherwise linearizable systems as an exciting area for further research.

Conclusions

RAMP Transactions provide a scalable alternative to locking and other coordination-intensive solutions to atomic visibility. We’ve found them very useful (e.g., our recent 25x prior-best on the TPC-C New-Order benchmark was implemented using a variant of RAMP-Fast), and there’s a principled reason for their performance: synchronization and partition independence. Please check out our SIGMOD paper for more details, and many thanks for the encouraging feedback so far. Happy scaling!

Causality Is Expensive (and What To Do About It)

2014-02-05T00:00:00-08:00

In this post, I briefly motivate the use of causality in distributed systems, discuss (likely) fundamental lower bounds on metadata overheads required to capture it, and discuss four strategies for circumventing these overheads.

Why care about causality?

In 1978, Leslie Lamport introduced the important concept of partial ordering in distributed systems: given a partial view over global system state, how can we safely say whether a particular event “happens before” another? Instead of relying on a total order (e.g., using synchronized clocks) to order events, Lamport’s proposed “happens-before” relation captures dependencies between events as a partial order: “happens-before” reflects the order of events within each process as well as the order of events across processes, captured via message channels. This formulation conveniently means that reasoning about “happens-before” does not require synchronous coordination between processes and also captures the possibility that two events may be completely independent of one another (i.e., are concurrent; just like light cones in the real world). Accordingly, “happens-before” is a powerful concept and forms the basis of causality in distributed systems, which is used in many contexts:

Distributed snapshot algorithms (e.g., consistent cuts) and global predicate detection algorithms typically leverage causal ordering for efficient execution (e.g., enable consistent snapshots without forcing processes to pause). This is particularly useful in debugging.
Version vectors are used in databases like Dynamo, Riak, and Voldemort to track concurrent updates to data and manage update conflicts without fast-path synchronization between replicas.
Causal consistency and causal broadcast provide databases and messaging systems with ordering guarantees that respect Lamport’s “happens-before” relation. This means, for example, that replies on Twitter won’t be seen without their parent Tweets. These two use cases in particular have recently seen a resurgence of interest in the research community.

As a theoretical construct and, increasingly, in real-world distributed systems, causality is important. I’ll defer a full description and discussion of causality to the expansive literature (here’s a survey, and here’s my favorite—check out that subtitle!). Instead, I want to ask a specific question: what does causality cost?

Causality is expensive

To use causal ordering, we need some way to capture it via a data structure or other piece of information. There are a variety of techniques for doing so in the literature that you may have heard of, like vector clocks (note that the related Lamport clocks don’t allow us to distinguish between “concurrent” and “earlier” events). If you’re familiar with vector clocks, you’ll know that each process in the system requires a position in the data structure; this means that, with N processes, each vector clock takes up O(N) space.

This leads to a tough question: how much space is required in order to capture causality? This is a difficult question to answer, but it’s fascinating to think about and has serious implications for our above use cases. Fortunately, Bernadette Charron-Bost thought seriously about this problem, and, in 1991, published a surprising result; the actual paper is fairly hairy, but Schwarz and Mattern summarize well:

Is there a way to find a “better” timestamping algorithm based on smaller time vectors which truly characterizes causality? As it seems, the answer is negative. Charron-Bost showed…that causality can be characterized only by vector timestamps of size N.

Wow. Charron-Bost’s result seems to imply that we can’t use less than O(N) metadata! For small numbers of processes, this isn’t so bad, but, if we scale to hundreds or thousands of nodes, each message (or, in a database, operation) is going to require a lot of metadata. Schwarz and Mattern (do you recognize Mattern from earlier?) continue:

It is not immediately evident that — for a more sophisticated type of vector order than < — a smaller vector could not suffice to characterize causality, although the result of Charron-Bost seems to indicate that this is rather unlikely…A definite theorem about the size of vector clocks would require some statement about the minimum amount of information that has to be contained in timestamps in order to define a partial order of dimension N on them. Finding such an information theoretical proof is still an open problem.

So, we don’t have a definitive proof, but, in all likelihood, we’re not going to do better. Moreover, in the now 23 years following this result, we haven’t seen anyone do better.

Intuitively, I think of the lower bound as follows: if I’m a process, and I want to perform an event, I need some way to distinguish my new event from all of the prior events that I’ve performed. This hints that I’ll need some sort of unique marker for my event—as in a vector clock, I can use a local timestamp that I increment on every event (which requires O(log(events)) space). Now, if every other process simultaneously wants to perform a new event, then we’ll collectively need N timestamps. We can’t coalesce these timestamps, since they’re due to unique events, so this puts us at (at least) O(N) metadata! Some recent results from Microsoft Research and the CRDT team show similar bounds for vector-based data structures.

…and what to do about it

There are many optimizations that reduce the overhead of causal tracking in the best case, but these worst-case overheads are too costly for many modern services running at scale. (Perhaps surprisingly, many modern implementations are even more expensive, with worst-case metadata overheads that are linear in the number of events or the number of keys in a database.) If you’re interested, we wrote a paper a while ago about how bad this overhead can become (voiceover) for causally consistent databases backing modern internet services.

Can we do anything to avoid these overheads?

Restrict the set of participants: To reduce the O(N) factor, we can reduce N, or the number of processes across which we track causal information. For example, if we’re building a distributed database, we can simply track causality across replicas of a each data item instead of causality across all servers. This sacrifices causal guarantees across data items but allows us to detect update conflicts for a single data item and is exactly the strategy adopted by version vectors. In most systems, the number of replicas for an item is much smaller than the number of servers in the system (e.g., 3 vs. 100), so this is a substantial reduction in practice. (Carlos Baquero has a good post on this distinction.)
Explicitly specify relevant relationships: The above discussion assumes that all events matter equally; in practice, this isn’t necessarily the case. On Twitter, when a user posts a reply to a Tweet, the causal relationship between the reply and the parent Tweet is—from a UX perspective—more important than the relationship between all of the Tweets the user read at login and her new reply. Effectively, if traditional forms of causality (i.e., potential causality) treat all possible (transitive) influences equally, what if we could explicitly specify which partial orders matter? In our Twitter example, tracking this explicit causality would only require a metadata overhead of O(1) for the “reply-to” relationship. The trade-off is that (like foreign key dependencies in database systems), the user now has to specify her causal dependencies manually at write time; our paper I mentioned earlier describes this strategy in greater detail.
Reduce availability: The problem with reducing the set of participants or using explicit causality is that we will necessarily throw away some causal dependencies. The upshot is that we were able to to reduce metadata while preserving availability. An alternative strategy is to attempt to compress causality by restricting availability: if we bound the number of processes that can simultaneously perform operations to a constant factor K, we only need K entries in our vector at any given time (i.e., to perform an operation, a process must “reserve” a spot in the vector, then “catch up” to the current vector position in the causal history—by, say, processing the events created and received by the prior occupant of the position). Under this strategy, metadata size determines maximum concurrency; in the limit, with K=1, we have a total order on events (close—if not identical to—linearizability). With this strategy, we’ve traded metadata by sacrificing availability and forcing some processes to effectively “share” causal dependencies.
Drop happens-before entirely: If we don’t want to suffer metadata overheads, require programmer intervention, or sacrifice availability, we can always use a weaker partial order (i.e., weaker but still available model). For example, if, in a database, we simply want each user to read her writes, we don’t (necessarily) need any metadata and can simply use sticky routing policies. Vanilla eventual consistency is even cheaper. Of course, this strategy can clearly compromise application consistency because we lose the ability to distinguish between concurrent writes and overwrites to the same item, but, on the plus side, it doesn’t get much cheaper!

It’s also important to remember that, regardless of the model we choose, if we want true “availability”, we necessarily lose the ability to make many useful guarantees, like preventing concurrent updates. There’s no free lunch, but, given that not all “weak” models are created equal (at least in terms of metadata cost), sometimes it makes sense to drop full causal ordering across all events and all processes and settle for enforcing a less costly alternative.

Takeaways

Causality is an immensely powerful concept in distributed systems, but it’s unlikely that we’ll discover a more compact, sub-linear representation that is sufficient to characterize it. I have no doubt that causality will remain important for debugging and reasoning about global states of distributed computations and am excited by the recent work in causally consistent distributed systems (full disclosure: I spent some time on this earlier in my Ph.D.). As researchers, it’s our job to push the envelope, and understanding the compromises required in light of the (likely) fundamental trade-offs I’ve described is a worthwhile exercise. However, given the worst-case overheads of causality tracking—at least in real-world deployments—and lack of a more compact counterexample, I’m more bullish on the four alternatives I’ve outlined.

Stickiness and Client-Server Session Guarantees

2014-01-13T00:00:00-08:00

Session Guarantees

One of the most common consistency requirements I encounter in modern web services is a guarantee called “read your writes” (RYW): each user’s reads should reflect the client’s prior writes. This means that, for example, once I successfully post a Tweet, I’ll be able to read it after a page refresh. Without RYW, I have no idea whether my update succeeded or was lost, and I might end up posting again, resulting in a second update.

RYW is part of a larger set of “session guarantees” developed in the 1990s (and popularized by Werner Vogels; also useful). These session guarantees are useful for at least two reasons. First, they capture intuitive requirements for end-user behavior: RYW and other guarantees, like “monotonic reads” (roughly requiring that time doesn’t appear to go backwards) are easy to understand and, as we saw above, often desirable for human-facing services. Second, session guarantees are cheap: while stronger models like linearizability (“C” in CAP) provide every session guarantee and then some, they are notoriously expensive—usually requiring unavailability during partial failure and increased latency. What early systems like Bayou discovered is that there are implementation techniques for achieving session guarantees without paying these costs.

Session Guarantees and Availability

Interestingly (and as the subject of this post), most—but not all—session guarantees are achievable with CAP-style availability. Gilbert and Lynch’s proof of the CAP Theorem defines availability by requiring that every non-failing server guarantees a response to each request despite arbitrary network partitions between the servers. So, if we want to build an available system providing the monotonic reads session guarantee, we can ensure that read operations only return writes when the writes are present on all servers. This ensures that, regardless of which server a client connects to, it won’t be forced to read older data and “go back in time.”

Via a classic partitioning argument, we can see that RYW is not achievable under the stringent CAP availability model. We can partition a client C away from all but one server S and require C to perform a write. If our implementation is available, S should eventually acknowledge the write as successful. If C reads from S, it’ll achieve RYW. But, what if we partition C away from S and allow it to only communicate with server T? If we require C to perform a read, T will have to respond, and C will not read its prior write. This demonstrates that it’s not possible to guarantee RYW for arbitrary read/write operations in an available manner.

In our recent work on Highly Available Transactions, we performed a similar analysis for each of the session guarantees, and found that all but RYW are achievable with availability (see Figure 2 and Table 3 on page 8; perhaps surprisingly, this also means that causal consistency is not available in a client-server model). The question is: if RYW isn’t available, why does it still seem to be cheaper than, say, linearizability?

Sticky Availability and Mechanisms

To understand why RYW is still “cheap” but not quite as cheap as other session guarantees, we formalized a new model of availability. RYW is indeed achievable (as Vogels points out) if clients stay connected to—or are “sticky” with—a server (really, a complete copy of the database). This requires a stronger assumption than CAP-style availability, but it’s still much weaker than, say, requiring that clients contact a majority of servers. In the HAT paper, we formalize this as “sticky availability” (page 4, Section 4.1).

In practice, there are two primary ways of achieving sticky availability. First (and easier) is to rethink the definition of a “server”: if clients cache their writes (thereby acting as a “server” in our above model), they’ll be able to read them in the future. By keeping a local copy of data, the clients trivially maintain stickiness. Several systems (including systems from both our group and Marc Shapiro’s CRDT group) leverage these techniques and can provide unparalleled low latency. The problem here is that caches can grow large, and, based on my experiences, it’s unclear how well caching works for general-purpose applications. Second, clients can use sticky request routing to ensure their requests always contact the same servers. In a single datacenter, this can be difficult, requiring the storage tier’s request routers to know the identity of the end-user making a request. This is feasible but potentially requires tight coupling between application logic and the database. In a multi-datacenter deployment, if each datacenter has a linearizable cluster (e.g., COPS), users can be assigned to a given region and their requests routed at the edge—also doable, but with availability penalties.

In my experience, sticky routing is more common, with memcached acting as a non-durable cache with likely (but not guaranteed) stickiness. However, I’m not aware of any public accounts of actual sticky available (but non-linearizable) architectures, hinting that these approaches may either fall into the realm of what Jim Gray calls “exotics” or, more optimistically, may simply be on the engineering horizon.

Why Sticky Availability Deserves More Study

Stickiness only becomes evident in a client-server model. In traditional models of distributed computing, stickiness is often guaranteed by default. If we consider a set of communicating processes (take Lamport’s paper on causality or even the CAP proof), each process (modulo limits on memory capacity) can trivially observe its past actions. This model effectively presumes the existence of a cache in the form of process memory. In a classic client-server model, the server is often tasked with maintaining the client’s state. This is, in most cases, fundamental to the utility of the system architecture. Yet, as we have seen above, remembering the past in the client-server model—especially without stateful clients—is non-trivial! The difference between these models is subtle but makes a big difference in practice, and I don’t think the implications have been sufficiently explored.

I also find it interesting that most implementations of the available session guarantees (even those in Bayou) presume sticky availability. While every session guarantee except RYW is achievable with availability, they’re often implemented using sticky available mechanisms! This means that these implementations are “less available” than they could be, with implications for latency, fault tolerance, and scalability. The trade-off is between what I’ll call “per-user visibility” and availability. In our above, available implementation of monotonic reads, a write might never become visible to any readers if a non-failed server is permanently partitioned. In contrast, if we have stickiness, the write can become visible to the writer (and, possibly, other readers) without sacrificing liveness. I haven’t seen a system evaluate this trade-off in depth, though many systems (including the HAT research) have addressed the more general (and classic) trade-off between global visibility and scalability.

As a final note, sticky available guarantees still face many of the same problems as traditional available systems when it comes to maintaining application correctness. Mutual exclusion is unavailable, so how does RYW help applications? The per-user visibility benefits are useful to end-users, but this is often a matter of user experience rather than an issue of data integrity. One benefit is that web services are frequently single-writer-per-data-item, meaning conflicts are rare or impossible as long as the single writer observes her updates. But, in general, I haven’t encountered many constraints on data that benefit from stickiness.

On Consistency and Durability

2013-12-10T00:00:00-08:00

In case you’ve missed it, there’s been a great discussion about consistency, availability, and durability on the Redis mailing list and Twitter over the past few days. I wanted to weigh in and specifically address antirez’s point that

While CAP and durability are orthogonal they are very related in actual systems….

We can effectively cast all statements about availability and consistency into the form:

If operations can contact AF of N correct replicas, the system provides a guaranteed response that is correct with respect to semantics S.

Availability is all about the precondition (AF of N): under what conditions is a safe response guaranteed? Gilbert and Lynch’s proof of the CAP theorem, shows that when S means linearizability and N is greater than 1, AF cannot equal 1. In fact, most implementations of linearizability use a notion of majorities to pick AF = (N+1)/2.

Now, let’s consider statements about durability:

The effects of operations will survive DF fail-stop server failures.

To survive DF failures, we need to contact DF+1 servers. Therefore, we can provide availability and durability only when enough servers are online and reachable.

As stated, two concepts are remarkably similar, but there’s an important difference. For semantics like linearizability, AF is typically a function of N and grows with replication factor. In contrast, DF is typically constant and independent of replication factor.

This brings us to antirez’s point. When N=3 and we want writes to survive one server failure, majority quorums require AF=2 and durability also requires DF=2; they’re the same! When we want higher durability without having to contact all servers, N=5 with DF=3 is a reasonable choice, and, again, durability matches majority quorum size. For large replication factors, N=100, the difference grows: we can still get DF=3, while majority quorums require AF=51. But, in practice, replication factors are often small, so the preconditions for availability when maintaining both durability and consistency are often equivalent.

It’s worth noting that AF=1 and DF=1 is an option, and it’s fast, but it will preclude durability in the event of server failures and also disallows linearizable semantics in the event that you have multiple active replicas (N > 2).

The above analysis doesn’t take into account reads, which, in weakly consistent systems, can contact any non-failing replica, but I think this sheds some light on the discussion.

Non-blocking Transactional Atomicity

2013-05-28T00:00:00-07:00

tl;dr: You can perform non-blocking multi-object atomic reads and writes across arbitrary data partitions via some simple multi-versioning and by storing metadata regarding related items.

N.B. This is a long post, but it’s comprehensive. Reading the first third will give you most of the understanding.

Edit 4/2014: We wrote a SIGMOD paper on these ideas! Check it out or read an updated post on the new algorithms.

Performing multi-object updates is a common but difficult problem in real-world distributed systems. When updating two or more items at once, it’s useful for other readers of those items to observe atomicity: either all of your updates are visible or none of them are.¹ This crops up in a bunch of contexts, from social network graphs (e.g., Facebook’s Tao system, where bi-directional “friend” relationships are stored in two uni-directional pointers) to distributed data structures like counters (e.g., Twitter’s Rainbird hierarchical aggregator) and secondary indexes (a topic for a future post). In conversations I’ve had regarding our work on Highly Available Transactions, atomic multi-item update, or transactional atomicity, is often the most-requested feature.

Existing Techniques: Locks, Entity Groups, and “Fuck-it Mode”

The state of the art in transactional multi-object update typically employs one of three strategies.

Use locks to update multiple items at once. Grab write locks on update and read locks for reads and you’ll ensure transactional atomicity. However, in a distributed environment, the possibility of partial failure and network latency means locking can lead to a Bad Time™.²
Co-locate distributed objects you’d like to update together. This strategy (often called “entity groups”) makes transactional atomicity easy: locking on a single machine is fast and not subject to the problems of distributed locking under partial failure and network latency. Unfortunately, this solution impacts data layout and distribution and does not work well for data that is difficult to partition (social networks, anyone?).
Use “fuck-it mode,” whereby you simultaneously update all keys without any concurrency control and hope readers observe transactional atomicity. This final option is remarkably common: it scales well and is applicable to any system, but it doesn’t provide any atomicity guarantees until the system stabilizes (i.e., converges, or is eventually consistent).

In this post, I’ll provide a simple alternative (let’s call it Non-blocking Transactional Atomicity, or NBTA) that uses multi-versioning and some extra metadata to ensure transactional atomicity without the use of locks. Specifically, our solution does not block readers or writers in the event of arbitrary process failure and, as long as readers and writers can contact a server for each data item they want to access, the system can guarantee transactional atomicity of both reads or writes. At a high level, the key idea is to avoid performing in-place updates and to use additional metadata to substitute for synchronous synchronization across replicas.

NBTA by Example

To illustrate the NBTA algorithm, consider the simple scenario where there are two servers, one storing item x and the other storing item y, both of which have value 0. Say we have two clients, one of which wishes to write x=1, y=1 and another that wants to read x and y together (i.e., x=y=0 or x=y=1). (We’ll discuss replication later.)

good, pending, and Invariants

Let’s split each server’s storage into two parts: good and pending. We will maintain the invariant that every write stored in good will have its transactional sibling writes (i.e., the other writes originating from the transactionally atomic operation) present on each of their respective servers, either in good or pending. That is, if x=1 is in good on the server for x, then, in the example above, y=1 will be guaranteed to be in good or pending on the server for y.

To maintain the above invariant, servers first place writes into pending. Then, once servers learn (possibly asynchronously) that a write’s transactional siblings are all in pending (let’s call this process “learning that a write is stable”), the servers individually move their respective writes into good. One simple strategy for informing servers that a write is stable is to have the writing client perform two rounds of communication: the first round places writes into pending, then, once all servers have acknowledged writes to pending, the client notifies each server that its write is stable. (If you’re nervous, this isn’t two-phase commit; more on that later.)

Races and Pointers

We’re almost done. If readers read from good, then they’re guaranteed to be able to read transactional siblings from other servers. However, there’s a race condition: what if one server has placed its write in good but another still has its transactional sibling in pending? We need a way to tell the second server to serve its read from pending.

To handle this race condition, we attach additional information to each write: a list of transactional siblings. At the start of a multi-key update, clients generate a unique timestamp for all of their writes (say, client ID plus local clock or random hash), which they attach to each write, along with a list of the keys written to in the transaction. Now, when a client reads from good, it will have a list of transactional siblings with the same timestamp. When the client requests a read from one of those sibling items, the server can fetch it from either pending or good. If a client doesn’t need to read a specific item, the server can respond with the highest timestamped item from good.

Putting it all together

We now have an algorithm that guarantees that all writes in a multi-key update are accessible before revealing them to readers. If a reader accesses a write, it is guaranteed to be able to access its transactional siblings without blocking. This way, readers will never stall waiting for a sibling that hasn’t arrived on its respective server. To make sure readers can access siblings in both good and pending, we attached additional metadata to each write that can be used by servers in the event of skews in stable write detection across servers. If readers or writers fail, there is no effect on other readers or writers. Any partially written multi-key updates will never become stable, and servers can optionally guarantee write stability by performing pending acknowledgments (i.e., performing the second phase of the client write) for themselves.

It Gets Better!

…because optimizations are awesome…

There are several optimizations and modifications we can make to the NBTA protocol:

Size of pending and good: if users want “last writer wins” semantics, there’s no need to store more than one write in good. However, if we do this, a write’s sibling may have been overwritten. If we want to prevent readers from reading “forwards in time” (e.g., read x=0 then y=1 then x=1, which preserves the property that once one write becomes visible, all of transaction’s writes become visible but does not guarantee a consistent snapshot across items), then servers can retain items in good for a bounded amount of time (e.g., as long as a multi-item read might take) and/or clients can retry reads in the presence of overwrites.
Faster writes: As I alluded to above, it’s not necessary to have the client perform the second round of communication (which requires three message delays until visibility). Instead, servers can directly contact one another once they’ve placed writes in pending, requiring only two message delays. Alternatively, clients can issue the second round of communication asynchronously. However, to ensure that clients read their writes in these scenarios, they need to retain metadata until they have (asynchronously) detected that each write is in good.
Replication: So far, I’ve only discussed having a single server for each data item. With “strong consistency” (i.e., linearizability) per server, the above algorithm works fine. With asynchronous, or lazy, replication between servers (e.g., “eventual consistency”), there are two options. If all clients contact disjoint sets of servers (e.g., all clients in a datacenter contact a full set of replicas), then clients only need to update their local set of servers, and each set of servers can detect when writes are stable within their groups. However, if clients can connect to any server, then writes should only become stable whenever all respective servers have placed their writes in good or pending. This can take indefinitely long in the presence of partial failure.³
Read/write transactions: I’ve discussed read-only and write-only transactions here, but it’s easy to use these techniques for general-purpose read/write transactions. The main problem when aiming for models like ANSI Repeatable Read (i.e., snapshot reads) is ensuring that reads come from a transactionally atomic set: this can be done by pre-declaring all reads in the transaction and fetching all items at the start of the transaction or via fancier (and more expensive) metadata like vector clocks, which I won’t get into here.
Metadata sizes: The metadata required above is linear in the number of keys written. This is modest in practice, but metadata can also be dropped once all sibling writes are present in good (i.e., there is no race condition for the transactional writes).

…and it works in real life.

We’ve built a database based on LevelDB that implements NBTA with all of the above optimizations except metadata pruning (related pseudocode here). Under the Yahoo! Cloud Serving Benchmark, NBTA transactions of 8 operations each achieve within 33% (all writes) to 4.8% (all reads) of the peak throughput of eventually consistent (i.e., “fuck-it”) operation (with 3.8–48% higher latency). Our implementation scales linearly, to over 250,000 operations per second for transactions of length 8 consisting of 50% reads and 50% writes on a deployment of 50 EC2 instances.

In our experience, NBTA performs substantially better than lock-based operation because there is no blocking involved. The two primary sources of overhead are due to metadata (expected to be small for real-world transactions like the Facebook and secondary indexing) and moving writes from pending to stable (if, as in our implementation, writes to pending are persistent, this results in two durable server-side writes for every client-initiated write). Given these results, we’re excited to start applying NBTA to other data stores (and secondary indexing).

So what just happened?

If you’re a distributed systems or database weenie like me, you may be curious how NBTA relates to well-known problems like two-phase commit.

The NBTA algorithm is a variant of uniform reliable broadcast with additional metadata to address the case where some servers have delivered writes but others have not yet, providing safety (e.g., see Algorithm 3.4). Formally, NBTA as presented here does not guarantee termination: servers may not realize that a write in pending will never become stable. Recognizing a “dead write” in pending requires failure detection and, in practice, writes can be removed from pending once sibling servers have been marked as dead, the server detects that a client died mid-write, the write (under last-writer-wins semantics) is overwritten by a higher timestamped write in good, or, more pragmatically, after a timeout.

As presented here, servers don’t abort updates, but this isn’t fundamental. Instead of placing items in pending, servers can instead reject updates, so any updates that were placed into pending on other servers will never become stable. NBTA is weaker than traditional non-blocking atomic commitment protocols because it allows non-termination for individual transactional updates (that is, garbage collecting pending may take a while). The trick is that, in practice, as long as independent transactional updates can be executed concurrently (as is the case with last-writer-wins and as is the case for all Highly Available Transaction) semantics, a stalled transactional update won’t affect other updates. In contrast, traditional techniques like two-phase commit with two-phase locking will require stalling in the presence of coordinator failure.

There are several ideas in the database literature that are similar to NBTA. The optimization for reducing message round trips is similar to the optimizations employed by Paxos Commit, while the use of additional metadata to guard against concurrent updates may remind you of B-link trees or other lockless data structures. And, of course, multi-version concurrency control and timestamp-based concurrency control have a long history in database systems. The key in NBTA is to achieve transactional atomicity while avoiding a centralized timestamp authority or concurrency control mechanism.

All this said, I haven’t seen a distributed transactional atomicity algorithm like NBTA before; if you have, please do let me know.

Conclusion

This post demonstrated how to achieve atomic multi-key updates across arbitrary data partitions without using locks or losing the ability to provide a safe response despite arbitrary failures of readers, writers, and (depending on the configuration) servers. The key idea was to establish an invariant that all writes have to be present on the appropriate servers before showing them to readers. The challenge was in solving a race condition between showing writes on different servers—trivial for locks but harder for a highly available system. And it works in practice–rather well, and much better than similar lock-based techniques! If you’ve made it this far, you’ve probably followed along, but I look forward to following up with a post on how to perform consistent secondary indexing via similar techniques—a potential killer application for NBTA, particularly given that it’s often considered impossible in scalable systems.

As always, feedback is welcomed and encouraged. If you’re interested in these algorithms in your system, let’s talk.

Thanks to Peter Alvaro, Neil Conway, Aurojit Panda, and Shivaram Venkataraman, and Reynold Xin for early feedback on this post. This research is joint work with Aaron Davidson, Alan Fekete, Ali Ghodsi, Joe Hellerstein, and Ion Stoica at UC Berkeley and the University of Sydney.

Footnotes

[1] Note that this “atomicity” is not the same as linearizability, the data consistency property addressed in Gilbert and Lynch’s proof of the CAP Theorem and often referred to as “atomic” consistency. Linearizability concerns ordering operations with respect to real time and is a single-object guarantee. The “atomicity” here stems from a database context (namely, the “A” in “ACID” and concerns performing and observing operations over multiple objects. To avoid further confusion, we’ll call this “atomicity” “transactional atomicity.”

\[2\] More specifically, there are a bunch of ways things can get weird. If a client dies while holding locks, then the servers should eventually revoke the locks. This often requires some form of failure detection or timeout, which leads to awkward scenarios over asynchronous networks, coupled with effective unavailability prior to lock revocation. In a linearizable system, as in the example, we've already given up on availability, so this isn't necessarily horrible---but it's a shame (read: it's slow) to block readers during updates and vice-versa. If we're going for a highly available (F=N-1 fault-tolerant) setup (as we will later on), locks are a non-starter; locks are fundamentally at odds with providing available operation on all replicas during partitions.

\[3\] Hold up, cowboy! What does this replication mean for availability? As I'll discuss soon, we haven't talked about when the effects of transactions will become visible in the event of replica failures (i.e., when people will read my writes). Readers will *always* be able to read transactionally atomic sets of data items from non-failing replicas; however, depending on the desired availability, reads may not be the most "up to date" set that is available on some servers. One way to look at this trade-off is as follows:

You can achieve linearizability and transactional atomicity, whereby everyone sees all writes after they complete, but writes may take an indefinite amount of time to complete ("CP")

You can achieve read-your-writes and transactional atomicity, whereby you can see your writes after they complete, but you'll have to remain "sticky" and continue to contact the same (logical) set of servers during execution (your "sticky" neighboring clients will also see your writes; "Sticky-CP")

You can achieve transactional atomicity and be able to contact any server, but writes won't become visible until all servers you might read from have received the transactional writes ("AP"; at the risk of sub-footnoting myself, I'll note that there are cool and useful connections to different kinds of [failure detectors](http://pine.cs.yale.edu/pinewiki/FailureDetectors) here).

Transactionally atomic [safety properties](http://www.bailis.org/blog/safety-and-liveness-eventual-consistency-is-not-safe/) are guaranteed in all three scenarios, but the safety guarantees on *recency* offered by each vary. The main ideas presented here apply to all three cases but were developed in the context of HA systems.

Communication Costs in Real-world Networks

2013-05-17T00:00:00-07:00

tl;dr: Network latencies in the wild can be expensive, especially at the tail and across datacenters: how bad are they, and what can we do about them? Make sure to explore the the demo.

Network latency makes distributed programming hard. Even when a distributed system is fault-free, any communication between servers affects performance. While the theoretical lower-bound on communication delay is the speed of light—not horrible, at least within a single datacenter—latencies are rarely this fast. I’ve been working on and benchmarking communication-avoiding databases and wanted to isolate and quantify the behavior of real-world networks both within and across datacenters. This post contains both an interactive demo of what we found, some high-level trends, and some implications for distributed systems designs.

I wasn’t aware of any datasets describing network behavior both within and across datacenters, so we launched m1.small Amazon EC2 instances in each of the eight geo-distributed “Regions,” across the three us-east “Availability Zones” (three co-located datacenters in Virginia), and within one datacenter (us-east-b). We measured RTTs between hosts for a week at a granularity of one ping per second.¹

I've made the raw data available on Github but, as an excuse to play with D3.js, I built this interactive visualization; select different percentiles and drag and zoom into the graph:^2,3 I suggest you enable iframes.

High-level Takeaways

Aside from the absolute numbers and the raw data, I think that there are a few interesting takeaways. If you’re a networking guru, these may be obvious, but I found the magnitude of these trends surprising. (N.B. These aren’t necessarily Amazon-specific, and this is hardly an indictment of AWS.)⁴

Latency » Speed of Light The minimum RTT between any two nodes was 227µs, almost two orders of magnitude higher than the theoretical minimum. Across continents, latencies were also higher than the speed of light requires: Dublin to Sydney could take around 115 milliseconds but requires around 350ms on average. Instead, routers, network topologies, virtualization, and the end-host software stack all get in the way.
Average « Tail Within us-east-b, ping times averaged around 400µs; this is close to Jeff Dean’s figure from his Numbers Everyone Should Know. However, at the tail, latencies get much worse: at the 99.9th percentile, latency (again, within a single datacenter) rose to between 11.6 and 21ms. At the 99.999th percentile, latency increased to between 84 and 151ms—a 160 to 350x increase over the average!
Cross-Datacenter Communication is Expensive On average, communicating across availability zones was 2–7x slower than communicating within an availability zone; communicating across geographic regions was 44–720x slower. Notably, latencies for cross-geographic regions performed relatively better at the tail: at the 99.999th percentile, cross-region RTTs were only 1.4–45x slower than us-east RTTs. I suspect this is because transit delays on the wire are fixed, while routing and software-related delays are more likely to vary. However, the network distance between AZs also varied: us-east-b and us-east-c had a minimum RTT of 693µs but us-east-c to us-east-d had a minimum RTT of 1.31ms (and, on average, a difference of almost 3.5x); not all local DC communication links are equal.

Implications for Distributed Systems Designers

Aside from any particular statistical behavior or correlations, this data highlights the importance of reasoning about latency in distributed system design. While many five-star wizards of distributed computing have long warned us of the pitfalls of network latency, there are at least two additional challenges today: almost every new system is distributed and many systems are operating at larger scale than ever before. The former means that more distributed systems developers need communication-avoiding techniques, while the latter means that the tail will continue to grow. Even if we solve the LAN latency problem, the lower bound on communication cost is still much higher than that of local data access, and multi-datacenter system deployments are increasingly common. While we can reduce some inefficiencies today, there are fundamental barriers to improvement, like the speed of light; I believe the solution to avoiding latency penalties will come from better software, algorithms, and programming techniques instead of better network hardware. Better languages, semantics, and libraries are a start.

Footnotes

[1] There’s a non-negligible chance that this post generates debate with respect to this methodology. My primary purpose for this experiment was to demonstrate the considerable gap between LAN and WAN latencies, which are easily captured by the data (if this is your cup of tea, let’s talk!). However, it’s possible that EC2 virtualization and the choice of m1.small instances led to higher latencies due to factors like multi-tenancy and VM migration. There’s also no doubt that larger packet sizes would change these trends; indeed, in recent database benchmarking, we’ve observed several additional effects related to local processing and EC2 NIC behavior under heavy traffic. Please feel free to leave a comment or get in contact, especially if you have suggestions for improvement or have any data to share; I’ll gladly link to it and use it if possible.

[2] If you like this stuff, there’s some really cool research that studies the metric spaces that arise from network topologies; one of my favorite papers is “Network Coordinates in the Wild” by Ledlie et al. in NSDI 2007, which applies network coordinates to the (real-world/”production”) Azureus file sharing network.

[3] My apologies for overlaying the cross-AZ and the us-east results on top of the cross-region data. I looked into pinning each of these two clusters in designated locations but was only able to pin one at a time and eventually gave up, settling for the effect that auto-adjusts the label size.

[4] I don’t study networks (rather, I spend my time building systems on top of them), but there’s a lot of ongoing work on alleviating these problems. For a short position paper regarding what should be possible and what we may need to fix, check out “It’s Time For Low Latency” by Rumble et al. in HotOS 2011.

HAT, not CAP: Introducing Highly Available Transactions

2013-02-05T00:00:00-08:00

tl;dr: Highly Available Transactions show it’s possible to achieve many of the transactional guarantees of today’s databases without sacrificing high availability and low latency.

CAP and ACID

Distributed systems designers face hard trade-offs between factors like latency, availability, and consistency. Perhaps most famously, the CAP Theorem dictates that it is impossible to achieve “consistency” while remaining available in the presence of network and system partitions.¹ Further, even without partitions, there is a trade-off between response time and consistency. These fundamental limitations mean distributed databases can’t have it all, and the limitations aren’t simply theoretical: across datacenters, the penalties for strong consistency are on the order of hundreds of milliseconds (compared to single-digit latencies for weak consistency) and, in general, unavailability takes the form of a 404 or Fail Whale on a website. Over twelve years after Eric Brewer first stated the CAP Theorem (and after decades of building distributed database systems), data store designers have taken CAP to heart, some choosing consistency and others choosing availability and low latency.

While the CAP Theorem is fairly well understood, the relationship between CAP and ACID transactions is not. If we consider the current lack of highly available systems providing arbitrary multi-object operations with ACID-like semantics, it appears that CAP and transactions are incompatible. This is partly due to the historical design of distributed database systems, which typically chose consistency over high availability. Standard database techniques like two-phase locking and multi-version concurrency control do not typically perform well in the event of partial failure, and the master-based (i.e., master-per-shard) and overlapping quorum-based techniques often adopted by many distributed database designs are similarly unavailable if users are partitioned from the anointed primary copies.

HATs for Everyone

In recent research at UC Berkeley, we show that high availability and transactions are not mutually exclusive: it is possible to match the semantics provided by many of today’s “ACID” and “NewSQL” databases without sacrificing high availability. While these Highly Available Transactions (HATs) do not provide serializability—which is not highly available under arbitrary read/write transactions—as I blogged about last week, many ACID databases provide a weaker form of isolation. The problem is that these databases do not implement their guarantees using highly available algorithms. However, as our recent results demonstrate, we can implement these guarantees and achieve other useful properties without giving up high availability or having to incur cross-replica (or, in a georeplicated scenario, cross-datacenter) latencies.

At a high level, HATs provide several guarantees that can be achieved with high availability² for arbitrary read/write transactions across a given set of data items, irrespective of data layout:

Transactional atomicity across arbitrary data items (e.g., see all or none of a transaction’s updates, or “A” in “ACID”), regardless of how many shards a transaction accesses and without using a master.
ANSI-compliant Read Committed and Repeatable Read isolation levels³ (“I” in “ACID” matching many existing databases).
Session guarantees including read-your-writes, monotonic reads (i.e., time doesn’t go backwards), and causality within and across transactions.
Eventual consistency, meaning that, if writes to a data item stop, all transaction reads will eventually return the last written value.

We believe that this is the strongest set of guarantees that have been provided with high availability, and many of the algorithms—like the atomicity and isolation guarantees—are brand new, namely because they don’t use masters or other coordination on transactions’ fast paths. The brief report we just released runs slightly over five pages and includes proof-of-concept algorithms for each guarantee.

Trade-offs

Of course, there are several guarantees that HATs cannot provide. Not even the best of marketing teams can produce a real database that “beats CAP”; HATs cannot make guarantees on data recency during partitions, although, in the absence of partitions, data may not be very stale. HATs cannot be “100% ACID compliant” as they cannot guarantee serializability, yet they meet the default and sometimes maximum guarantees of many “ACID” databases. HATs cannot guarantee global integrity constraints (e.g., uniqueness constraints across data items) but can perform local checking of predicates (e.g., per-record integrity maintenance like null value checks). In the report, we classify many of these anomalies in terms of previously documented isolation levels.

Are these guarantees worthwhile? If users need high availability or low latency, HATs provide a set of semantics that is stronger than any existing highly available data store. If users need strong consistency guarantees, they will need to accept the possibility of unavailability and expect to pay at least one round trip time for each of their operations. As an example, people often ask me about Spanner, from Google. Spanner provides strong consistency and typically low latency read-only transactions. Users that are partitioned from the majority of Spanner nodes will experience unavailability and read-write transactions will incur WAN latencies due to Spanner’s two-phase locking mechanism. Spanner’s authors don’t hide these facts—for example, look at Table 6 on page 12 of the paper: read/write transactions are between 8.3 and 11.9 times slower than read-only transactions. For Google, who has optimized their WAN networks, atomic clocks, and infrastructure engineering and whose workload (also in Table 6) is composed of over 98% of read-only transactions, Spanner makes sense. When high availability and guaranteed low latency matter, even Google might choose a different architecture.

Coming soon

Our work on HATs at Berkeley is just beginning. We’re benchmarking a HAT prototype and are tuning our algorithms for performance and scalability. Once the algorithms are better explored, I would personally like to help integrate HATs into existing data stores, much as we recently did with our PBS work in Cassandra. It’d be interesting to port an application running on Oracle Database to a NoSQL store and provide the same semantic guarantees with substantially improved performance, availability, and cost effectiveness. We’re also working on additional theoretical results to further explain HATs in the context of CAP. I plan to share these results as we develop them further.

In the meantime, we’d welcome feedback on our work so far and are curious where HATs make sense in your stack. If you’re an application developer who wishes she had transactional atomicity or weak isolation, a distributed database developer interested in HATs, or you just think HATs are cool, let us know. We’re always looking for anecdotes, workloads, and good conversation.

This is Part Two of a two part series on Transactions and Availability.
Part One: When is ACID ACID?

This research is joint work with Alan Fekete, Ali Ghodsi, Joe Hellerstein, and Ion Stoica.

Footnotes

[1] As formally proven by Gilbert and Lynch, the CAP Theorem states that linearizability and high availability are incompatible. Linearizability is often called “atomicity,” yet “atomicity” means something different in database parlance, namely, we see all or none of a transaction’s updates. For clarity, I’ll call “ACID atomicity” “transactional atomicity.”

[2] As we discuss in Section 2 of the paper, we have to be careful how to define "high availability": a system that always aborts all transactions is, in a sense, "available," if not very useful. In short, we say that a system provides high availability if every transaction that can contact at least one server for each data item in the transaction eventually commits (or alternatively, aborts itself due to an internal integrity constraint violation). The system is not allowed to indefinitely abort transactions for the purposes of maintaining availability.

[3] If you're a database nut, you may object that the ANSI SQL definitions are [notoriously underspecified](http://ftp.research.microsoft.com/pub/tr/tr-95-51.pdf). However, rest assured that HAT "Read Committed" matches all of the definitions we've found in the literature, including those by [Berenson et al. (SIGMOD 1995)](http://ftp.research.microsoft.com/pub/tr/tr-95-51.pdf) and [Adya (MIT Ph.D. thesis, ICDE 2000)](http://www.pmg.lcs.mit.edu/~adya/pubs/phd.pdf). HAT "Repeatable Read"---and "Repeatable Read" interpretations in general---is more complicated; HAT "Repeatable Read" does match the ANSI spec, and we provide a detailed discussion in Section 4 of the paper.

When is "ACID" ACID? Rarely.

2013-01-22T00:00:00-08:00

tl;dr: ACID and NewSQL databases rarely provide true ACID guarantees by default, if they are supported at all. See the table.

Many databases today differentiate themselves from their NoSQL counterparts by claiming to support “100% ACID” transactions or by “guaranteeing strong consistency (ACID).” In reality, few of these databases—including traditional “big iron” systems like Oracle—provide formal ACID guarantees, even when they claim to do so.

The textbook definition of ACID Isolation is serializability (e.g., Architecture of a Database System, Section 6.2), which states that the outcome of executing a set of transactions should be equivalent to some serial execution of those transactions. This means that each transaction gets to operate on the database as if it were running by itself, which ensures database correctness, or consistency. A database with serializability (“I” in ACID), provides arbitrary read/write transactions and guarantees consistency (“C” in ACID), or correctness, of the database. Without serializability, ACID, particularly consistency, is generally¹ not guaranteed

Nevertheless, most publicly available databases (often claiming to provide “ACID” transactions) do not provide serializability. I’ve compiled the isolation guarantees provided by 18 popular databases below (sources hyperlinked). Only three of 18 databases provide serializability by default, and only nine provide serializability as an option at all (shaded):

Database	Default Isolation	Maximum Isolation
Actian Ingres 10.0/10S	S	S
Aerospike	RC	RC
Akiban Persistit	SI	SI
Clustrix CLX 4100	RR	?
Greenplum 4.1	RC	S
IBM DB2 10 for z/OS	CS	S
IBM Informix 11.50	Depends	RR
MySQL 5.6	RR	S
MemSQL 1b	RC	RC
MS SQL Server 2012	RC	S
NuoDB	CR	CR
Oracle 11g	RC	SI
Oracle Berkeley DB	S	S
Oracle Berkeley DB JE	RR	S
Postgres 9.2.2	RC	S
SAP HANA	RC	SI
ScaleDB 1.02	RC	RC
VoltDB	S	S
Legend	RC: read committed, RR: repeatable read, S: serializability, SI: snapshot isolation, CS: cursor stability, CR: consistent read

Instead of providing serializability, many these databases provide one of several weaker variants,² often when marketing material and documentation claim otherwise.³ There is no fundamental reason why a database shouldn’t support serializability—we have the algorithms, and we’ve made great strides in improving ACID scalability.⁴ So why not provide serializability by default, or, at the least, provide serializability as an option at all? One key factor is performance: serializable isolation can limit concurrency; traditional techniques such as two-phase locking are expensive compared to, say, taking short read locks on data items. Additionally, it is impossible to simultaneously achieve high availability and serializability (though most of these database implementations are not highly available anyway, even when providing weaker models). A third reason is that transactions may be less likely to deadlock or abort due to conflicts under weaker isolation. However, these benefits aren’t free: the consistency anomalies that arise from the weak levels shown above are well-understood and quantifiable.

Where’s the silver lining? We can get real ACID in some of our databases (if not by default). And, despite the fact that many other “ACID” databases don’t provide ACID properties—at least according to decades of research and development and formally proven guarantees regarding database correctness (although perhaps marketing has rewritten the books)—we can still reserve travel tickets, use our bank accounts, and fight crime. How? One possibility is that anomalies are rare and the performance benefits of weak isolation outweigh the cost of inconsistencies. Another possibility is that applications are performing their own concurrency control external to the database; database programmers can use commands like SELECT FOR UPDATE, manual LOCK TABLE, and UNIQUE constraints to manually perform their own synchronization. The answer is likely a mix of each, but, stepping back, these strategies should remind you of what’s often done today in NoSQL-style data infrastructure: “good enough” consistency and some hand-rolled, application-specific concurrency control. Perhaps there’s a better question: when is “ACID” NoSQL?

This is Part One of a two part series on Transactions and Consistency.
Part Two: recent research on Highly Available Transactions (HATs).

Thanks to Neil Conway, Ali Ghodsi, and Alan Fekete for early feedback on this post.

Footnotes

[1] There’s a considerable amount of research focusing on how to provide ACID consistency without serializability. As an example, we can restrict the types of operations that transactions can perform, as in escrow and read-only transactions and with monotonic logic. We can also consider hypothetical databases that introduce dummy transactions to fill in anomalous behavior in the serial schedule, which would be silly but technically serializable. The systems in question don’t (usually) provide these sorts of “special-case” ACID-compliant transactions as features.

[2] There are a bunch of different weak isolation models to consider, but their definitions often vary depending on where you look. In this table, when necessary, I’ve mapped the stated guarantees back to a known model (e.g., Aerospike); in any event, only the databases marked as such provide serializability. The best vendor documentation will tell you exactly what is implemented, even if the description doesn’t match the name (see Footnote 3). If you like database theory, the best description of these levels I’ve seen, describing both multi-version and lock-based databases, is Atul Adya’s MIT Ph.D. thesis from 1999.

[3] As a detailed example of what can happen, consider Oracle 11g. (Admittedly, I’m picking on Oracle, due mostly to the wealth of available information.) 11g’s strongest isolation level is called “serializable,” while its description matches snapshot isolation. This behavior is well-documented in both the academic literature and by practitioners. For more fun, try to figure out what can happen when you execute distributed transactions.

[4] As an example, check out Michael Stonebraker and Andy Pavlo’s research on the HStore project (commercialized via VoltDB) or Google’s Spanner. Each of these systems makes trade-offs (e.g., Spanner still uses two-phase locking for read-write transactions, which is expensive over wide-area networks, and doesn’t support transaction-level read-your-write semantics) but is pushing the limits of true ACID scalability.

Using PBS in Cassandra 1.2.0

2013-01-14T00:00:00-08:00

With the help of the Cassandra community, we recently released PBS consistency predictions as a feature in the official Cassandra 1.2.0 stable release. In case you aren’t familiar, PBS (Probabilistically Bounded Staleness) predictions help answer questions like: how eventual is eventual consistency? how consistent is eventual consistency? These predictions help you profile your existing Cassandra cluster and determine which configuration of N,R, and W are the best fit for your application, expressed quantitatively in terms of latency, consistency, and durability (see output below).

There are several resources for understanding the theory behind PBS, including talks, a demo, slides, and an academic paper. We’ve used PBS to look at the effect of SSDs and disks, wide-area networks, and compare different web services’ data store deployments. My goal in this post is to show how to profile an existing cluster and briefly explain what’s going on behind the scenes. If you prefer, you can download a (mostly) fully automated demo script instead.

Step One: Get a Cassandra cluster.

The PBS predictor provides custom consistency and latency predictions based on observed latencies in deployed clusters. To gather data for predictions, we need a cluster to profile. If you have a cluster running 1.2.0, you can skip these instructions.

The easiest way to spin up a cluster for testing is to use ccm. Let’s start a 5-node Cassandra cluster running on localhost:

git clone https://github.com/pcmanus/ccm.git
cd ccm && sudo ./setup.py install
ccm create pbstest -v 1.2.0
ccm populate -n 5
ccm start
export CASS_HOST=127.0.0.1

If ccm start fails, you might need to initialize more loopback interfaces (e.g., sudo ifconfig lo0 alias 127.0.0.2)—see the script.

Step Two: Enable PBS metrics on a Cassandra server.

The PBS predictor works by profiling message latencies that it sees in a production cluster. You only need to enable logging on a single node, and all reads and writes that the node performs will be used in predictions.

The prediction module logs latencies in a circular buffer with a FIFO eviction policy (default: 20,000 reads and writes). By default, this logging is turned off, saving about 300k of memory. To turn it on, use a JMX tool to call the org.apache.cassandra.service.PBSPredictor MBean’s enableConsistencyPredictionLogging method. You can use jconsole¹ or use a command line JMX interface like jmxterm:

wget http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
echo "run -b org.apache.cassandra.service:type=PBSPredictor enableConsistencyPredictionLogging" | java -jar jmxterm-1.0-alpha-4-uber.jar -l $CASS_HOST:7100

Step Three: Run a Workload

The PBS predictor is entirely passive: it profiles the reads and writes that are already occuring in the cluster. This means that predictions don’t interfere with live requests but also means that we need to have a workload to get results.¹

We can use the Cassandra stress test, below executing 10,000 read and write requests with a replication factor of three.

cd ~/.ccm/repository/1.2.0/
chmod +x tools/bin/cassandra-stress
tools/bin/cassandra-stress -d $CASS_HOST -l 3 -n 10000 -o insert
tools/bin/cassandra-stress -d $CASS_HOST -l 3 -n 10000 -o read

Step Four: Run predictions.

We can now connect to the node performing the profiling and have it perform some Monte Carlo analysis for us. The consistency prediction is triggered via JMX, but this time using the nodetool administration interface packaged with Cassandra:

bin/nodetool -h $CASS_HOST -p 7100 predictconsistency 3 100 1

Here’s some sample output from a run on one of our clusters. You can vary the replication factor, the amount of time you’d like to consider after writes, and even multi-versioned staleness. Remember that, aside from taking up some CPU on the predicting node, this profiling doesn’t affect query performance:

Performing consistency prediction
100ms after a given write, with maximum version staleness of k=1
N=3, R=1, W=1
Probability of consistent reads: 0.678900
Average read latency: 5.377900ms (99.900th %ile 40ms)
Average write latency: 36.971298ms (99.900th %ile 294ms)

N=3, R=1, W=2
Probability of consistent reads: 0.791600
Average read latency: 5.372500ms (99.900th %ile 39ms)
Average write latency: 303.630890ms (99.900th %ile 357ms)

N=3, R=1, W=3
Probability of consistent reads: 1.000000
Average read latency: 5.426600ms (99.900th %ile 42ms)
Average write latency: 1382.650879ms (99.900th %ile 629ms)

N=3, R=2, W=1
Probability of consistent reads: 0.915800
Average read latency: 11.091000ms (99.900th %ile 348ms)
Average write latency: 42.663101ms (99.900th %ile 284ms)

N=3, R=2, W=2
Probability of consistent reads: 1.000000
Average read latency: 10.606800ms (99.900th %ile 263ms)
Average write latency: 310.117615ms (99.900th %ile 335ms)

N=3, R=3, W=1
Probability of consistent reads: 1.000000
Average read latency: 52.657501ms (99.900th %ile 565ms)
Average write latency: 39.949799ms (99.900th %ile 237ms)

Conclusions and Caveats

Once configured, the PBS predictions are both easy and fast to run. The great thing about predictions is that they can be run entirely off of the fast path; our PBS code module performs simple message profiling (timestamp logging), then, when prompted, performs forward prediction of how the system might behave in different scenarios in the background. This is a fundamental algorithmic property of the prediction problem, and, provided all nodes in the system attach the required timestamps on messages, only one node has to actually log data and perform predictions

Before I end, there are a few caveats to the current implementation. (Warning: this is a bit technical.) First, we only simulate non-local operations. In Cassandra, a node can act as a coordinator and as a replica for a given operation. We only collect data for operations for which the predicting node was a coordinator, not a replica. This means that, for example, if the predicting node serves all reads locally, we won’t have enough data for accurate predictions. The reason we did this is because we’d otherwise have to model coordinator and data accesses, which gets tricky in a running cluster. Second, we don’t consider failures or hinted handoff. We do capture slow node behavior. Third, we don’t differentiate between column families or different data items. This (like the rest) was an engineering decision that I’m sure we could change in future releases.

Despite these limitations, I think the current functionality is useful for getting a sense of how clusters are behaving and the potential impact of replication parameters. Moreover, I’m confident that we can fix the above issues if there’s enough interest. If you’re interested in using, further developing, or learning more about this functionality, please let me know and we can talk. We built this implementation because we care about real-world research impact; let us know what you think.

Thanks to Shivaram Venkataraman, who co-authored our patch, and the Cassandra community, particularly Jonathan Ellis, for being so accommodating.

Footnotes

[1] You can run predictions without workloads, just not within Cassandra. Take a look at our paper or some old Python code.

[2] This is ugly, so I put the instructions down here. Run jconsole (if you used CCM, your 127.0.0.1 node will likely have the lowest PID), click MBeans, then org.apache.cassandra.service (bottom of the menu), PBSPredictor, Operations, enableConsistencyPredictionLogging, then click the enableConsistencyPredictionLogging button (screenshot here).

Doing Redundant Work to Speed Up Distributed Queries

2012-09-20T00:00:00-07:00

tl;dr: In distributed data stores, redundant operations can dramatically drop tail latency at the expense of increased system load; different Dynamo-style stores handle this trade-off differently, and there’s room for improvement.

Update 10/2013: Cassandra has since added support for “speculative retry”–effectively, Dean’s suggestions applied to Dynamo reads, as described below.

Update 9/2014: Akka 2.3.5 introduced support for this kind of speculative retry.

At scale, tail latencies matter. When serving high volumes of traffic, even a miniscule fraction of requests corresponds to a large number of operations. Latency has a huge impact on service quality, and looking at the average service latency alone is often insufficient. Instead, folks running high-performance systems at places like Amazon and Google look to the tail when measuring their performance.¹ High variance often hides in distribution tails: in a talk at Berkeley last spring, Jeff Dean reported a 95 percentile latency of 24ms and a 99.9th percentile latency of 994ms in Google’s BigTable service—a 42x difference!

In distributed systems, there’s a subtle and somewhat underappreciated strategy for reducing tail latencies: doing redundant work. If you send the same request to multiple servers, (all else equal) you’re going to get an answer back faster than waiting for a single server. Waiting for, say, one of three servers to reply is often faster than waiting for one of one to reply. The basic cause is due to variance in modern service components: requests take different amounts of time in the network and on different servers at different times.² In Dean’s experiments, BigTable’s 99.9th percentile latency dropped to 50ms when he sent out a second, redundant request if the initial request hadn’t come back in 10ms—a 40x improvement. While there’s a cost associated with redundant work—increased service load—the load increase may be modest. In the example I’ve mentioned, Dean recorded only a 5% total increase in number of requests.³

Learning about Google’s systems is instructional, but we can also observe the trade-off between tail latency and load in several publicly-available distributed data stores patterned on Amazon’s influential Dynamo data store. In the original paper, Dynamo sends a client’s read and write requests to all replicas for a given key. For writes, the system needs to update all replicas anyway. For reads, requests are idempotent, so the system doesn’t necessarily need to contact all replicas—should it? Sending read requests to all replicas results in a linear increase in load compared to sending to the minimum required number of replicas.⁴ For read-dominated workloads (like many internet applications), this optimization has a cost. When is it worthwhile?

Open-source Dynamo-style stores have different answers. Apache Cassandra originally sent reads to all replicas, but CASSANDRA-930 and CASSANDRA-982 changed this: one commenter argued that “in IO overloaded situations” it was better to send read requests only to the minimum number of replicas. By default, Cassandra now sends reads to the minimum number of replicas 90% of the time and to all replicas 10% of the time, primarily for consistency purposes.⁵ (Surprisingly, the relevant JIRA issues don’t even mention the latency impact.) LinkedIn’s Voldemort also uses a send-to-minimum strategy (and has evidently done so since it was open-sourced). In contrast, Basho Riak chooses the “true” Dynamo-style send-to-all read policy.

Who’s right? What do these choices mean for a real NoSQL deployment? We can do a back-of-the-envelope analysis pretty easily. For one of our recent papers on latency-consistency trade-offs in Dynamo style systems, we obtained latency data from Yammer’s Riak clusters. If we run some simple Monte Carlo analysis (script available here), we see that—perhaps unsurprisingly—redundant work can have a big effect on latencies. For example, at the 99.9th percentile, sending a single read request to two servers instead of one is 17x faster than sending to one—maybe worth the 2x load increase. Sending reads to three servers and waiting for one is 30x faster. Pretty good!

		Requests sent
Responses waited for
		1	2	3	4	5
	1	170.0	10.7	5.6	4.8	4.5
	2		200.6	33.9	6.5	5.3
	3			218.2	50.0	7.5
	4				231.1	59.8
	5					242.2

99.9th percentile read latencies (in ms) for the Yammer Dynamo-style latency model.

The numbers above assume you send both requests at the same time, but this need not be the case. For example, sending a second request if the first hasn’t come back within 8ms results in a modest 4.2% increase in requests sent and a 99.9th percentile read latency of 11.0ms. This is due to the long tail of the latency distributions we see in Yammer’s clusters—we only have to speed up a small fraction of queries to improve the overall performance.

To preempt any dissatisfaction, I’ll admit that this analysis is simplistic. First, I’m not considering the increased load on each server due to sending multiple requests. The increased load may in turn increase latencies, which would decrease the benefits we see here. This effect depends on the system and workload. Second, I’m assuming that each request is identically, independently distributed. This means that each server behaves the same (according to the Yammer latency distribution we have). This models a system equally loaded, equally powerful servers, but this too may be different in practice.⁶ Third, with a different latency distribution, the numbers will change. Real-world benchmarking is the best source of truth, but this analysis is a starting point, and you can easily play with different distributions of your own either with the provided script or in your browser using an older demo I built.

Ultimately, the balance between redundant work and tail latency depends on the application. However, in latency-sensitive environments (particularly when there are serial dependencies between requests) this redundant work has a massive impact. And we’ve only begun: we don’t need to send to all replicas to see benefits—even one extra request can help—while delay and cancellation mechanisms like the ones that Jeff Dean hints at can further reduce load penalties. There’s a large amount of hard work and research to be done designing, implementing, and battle-testing these strategies, but I suspect that these kinds of techniques will have a substantial impact on future large-scale distributed data systems.

Thanks to Shivaram Venkataraman, Ali Ghodsi, Kay Ousterhout, Patrick Wendell, Sean Cribbs, and Andy Gross for assistance and feedback that contributed to this post.

Footnotes

[1] There are many studies of the importance of latency for different services. For example, an often-cited statistic is that an additional 100ms of latency cost Amazon 1% of sales. In the systems community, David Anderson attributes this sensitivity to tail latencies to both the original Dynamo paper and Werner Vogels's subsequent evangelizing (buried in the comments here).

[2] There's a large body of theoretical research on "the power of two choices" that's related to this phenomenon: if you select the less loaded of two randomly chosen servers instead of randomly picking one, you can exponentially improve a cluster's load balance. Theoreticians might quibble with this analogy: after all, here, we're usually still sending requests to both of the servers, and the original power of two work focuses on only sending requests to the lighter-loaded server. However, this research is still interesting to consider as a precursor to many of the techniques here and as the start of a more rigorous understanding of these trade-offs.

[3] Dean also describes the effect of different delay and cancellation mechanisms. Cancellation seems tricky to get right depending on the application. Intercepting and cancelling a lightweight read request is harder (i.e., needs to be faster) than, say, canceling a slower, more complex query. Recent work from some of my colleagues at UC Berkeley demonstrates alternative algorithms for limiting the overhead of redundant work in general-purpose cluster computing frameworks like Hadoop.

[4] Determining the minimum number of replicas to read from depends on the desired consistency of the operation. This is a complicated subject and is related to (but too involved for) the main discussion here. In short, if we denote the number of replicas as N, the number of replicas to block for during reads as R and equivalently for writes as W, R+W > N means you'll read your writes (each key acts as a regular register), while anything less gives you weak guarantees. For more info, check out some research we recently did on these latency-consistency trade-offs.

[5] For readers into hardcore Cassandra internals: currently, Cassandra fetches the data from one server and requests "digests," or value hashes, from the remaining (R-1) servers. In the case that the digests don't match, Cassandra will perform another round of requests for the actual values. Now, if a mechanism called read repair is enabled, then Cassandra will randomly (as a configurable parameter) send digest requests to all replicas (as opposed to just the R-1). In the original patch, the default probability of sending to all was 100% (all the time); it is now 10% (due—somewhat cryptically—to CASSANDRA-3169). (Sources: original patch here; latest code here; filtering of endpoints done here) This digest-based scheme is a substantial deviation from the Dynamo design and exposes a number of other interesting trade-offs that probably deserve further examination.

[6] For example, Cassandra's "endpoint snitches" keep track of which nodes are "closest" according to several different, configurable dimensions, including historical latencies. Depending on the configuration, if a given node is slow or overloaded, Cassandra may choose not to read from it. I haven't seen a performance analysis of this strategy, but, at first glance, it seems reasonable.

Safety and Liveness: Eventual Consistency Is Not Safe

2012-03-27T00:00:00-07:00

tl;dr: Eventual consistency is a liveness property—not a safety property—and is trivially satisfiable by itself. Liveness and safety properties should be taken together.

Safety and liveness are two important kinds of properties provided by all distributed systems. Informally, safety guarantees promise that nothing bad happens, while liveness guarantees promise that something good eventually happens. Every distributed system makes some form of safety and liveness guarantees, and some are stronger than others. For example, atomic consistency guarantees that operations will appear to happen instantaneously across the system (safety) but operations won’t always succeed in the presence of network partitions (liveness, in the form of availability).

Many of today’s distributed systems promise eventual consistency: after some period of time, all participants in the system agree on the same value. This is a useful property: good things will eventually happen without the need for intervention, even in the presence of partitions. However, under our definitions of safety and liveness, eventual consistency only provides liveness guarantees, not safety: Which value is eventually chosen? What values may be returned before participants “eventually” agree?

As recent work from UT Austin points out, it’s easy to satisfy liveness without being useful. If all replicas always return the initial state, the system is eventually consistent. If all replicas return the value 42 in response to every request (even if you didn’t write the value of 42), the system is eventually consistent. If replicas accept every thousandth write, the system is eventually consistent. These guarantees are somehow not what we would like, but they satisfy our definition of eventual consistency. Moreover, as the authors explain, accepting more read/write combinations doesn’t necessarily translate to stronger consistency. We’d like some notion of convergence that captures both agreement on a common shared state and exchanging of writes.¹

Today’s eventually consistent systems do provide some form of safety properties, even if they don’t say so explicitly. For instance, in Riak, Cassandra, and DynamoDB, timestamp ordering is often used to decide which version of a data item to keep. Moreover, these data stores won’t return any values you haven’t written to them, and replicas will converge to the last written value for each key. In short, many “eventually consistent” stores really offer something like “eventually last-writer-wins, and read-the-last-observed-value in the meantime” consistency. This is both more descriptive and more useful than a vanilla “eventual consistency” guarantee.²

It’s worth noting that safety without convergence also leads to problems. Read-your-writes, PRAM/monotonic writes, and causal consistency guarantees are trivially achievable using only local storage and no communication: simply keep a local copy of every key that you update and read from for every operation. This is not a convergent implementation. However, it satisfies each of these consistency models because they make safety but not liveness guarantees. If we were to add in our liveness requirement of convergence, our implementation would have to propagate writes between replicas.

Next time someone tells you their system is “eventually consistent,” ask them two questions: What versions of a data item can be returned at any time? What version will the system eventually choose to return? And remember: consider safety and liveness properties together. Otherwise, you probably have a trivially satisfiable requirement.

This post was influenced in large part by discussions with Ali Ghodsi, Joe Hellerstein, and Ion Stoica.

Footnotes

[1] Eventual convergence is likely the strongest convergence property we can guarantee given unbounded partition durations. Any system guaranteeing non-trivial convergence within a fixed amount of time would violate its liveness guarantees if partitioned for a longer period of time.

[2] In their technical documentation, vendors are usually forthcoming about these details, though they can be difficult to pick out. However, in promotional material and especially when making superficial comparisons, these distinctions are often omitted or glossed over.

A Running List: Writing, Speaking, and Research Advice

2012-03-17T00:00:00-07:00

This is a growing list of other people’s advice that I’ve found useful, posted here mostly for my own reference. (As if there weren’t enough of these already.)

On Writing

The Elements of Style by Strunk and White
tl;dr: A classic on good prose and English writing.
Pet Peeves for Writing by Margo Seltzer
tl;dr: Read it. It’s short. Don’t do these things.
LaTeX Usage Notes by Eddie Kohler
tl;dr: A (beautifully typeset) document on proper LaTeX formatting, typography, and writing tips.
Shell Scripts for Editing by Matt Might
tl;dr: Automatically highlight passive voice, weasel words, and lexical illusions using these scripts. (Also good) Thanks to Colin Scott and Shaddi Hasan.

On Speaking

Slide Design for Developers by Zach Holman
tl;dr: Practical tips for making your slides better: use color and huge fonts, and treat your slides as prop, not as a crutch.
The Cognitive Style of Powerpoint by Edward Tufte
tl;dr: Tufte provides many examples of how not to make slides, and his perspective on the “projector operating system” is valuable.

On Research

You and Your Research by Edward Hamming
tl;dr: How to ask the right questions and scope your research for maximum impact by a guy who did both.
Cargo Cult Science by Richard Feynman
tl;dr: Don’t let PR, hype, or zeitgeist interfere with real science.
How to Have a Bad Career in Research/Academia by Dave Patterson
tl;dr: Good advice on pitfalls of graduate school/academia and how to avoid them.
Software is not science by Matt Welsh
tl;dr: Systems research is about principles, not artifacts.
Database Metatheory: Asking the Big Queries by Christos Papadimitriou
tl;dr: An awesome reflection on theory versus practice by one of the great CS theoreticians.

What's Wrong with Amazon's DynamoDB Pricing?

2012-03-04T00:00:00-08:00

tl;dr: There’s no good reason why strong consistency should cost double what eventual consistency costs. Strong consistency in DynamoDB shouldn’t cost Amazon anywhere near double and wouldn’t cost you double if you ran your own data store. While the benefit to your application may not be worth this high price, hiding consistency statistics encourages you to overpay.

Amazon recently released an open beta for DynamoDB, a hosted, “fully managed NoSQL database service.” DynamoDB automatically handles database scaling, configuration, and replica maintenance under a pay-for-throughput model. Interestingly, DynamoDB costs twice as much for consistent reads compared to eventually consistent reads. This means that if you want to be guaranteed to read the data you last wrote, you need to pay double what you could be paying otherwise.

Now, I can’t come up with any good technical explanation for why consistent reads cost 2x (and this is what I’m handsomely compensated to do all day). As far as I can tell, this is purely a business decision that’s bad for users:

The cost of strong consistency to Amazon is low, if not zero. To you? 2x.
If you were to run your own distributed database, you wouldn’t incur this cost (although you’d have to factor in hardware and ops costs).
Offering a “consistent write” option instead would save you money and latency.
If Amazon provided SLAs so users knew how well eventual consistency worked, users could make more informed decisions about their app requirements and DynamoDB. However, Amazon probably wouldn’t be able to charge so much for strong consistency.

I’d love if someone (read: from Amazon) would tell me why I’m wrong.

The cost of consistency for Amazon is low, if not zero. To you? 2x.

Regardless of the replication model Amazon chose, I don’t think it’s possible that it’s costing them 2x to perform consistent reads compared to eventually consistent reads:

Dynamo-style replication. If Amazon decided to stick to their original Dynamo architecture, then all reads and writes are sent to all replicas in the system. When you perform an eventually consistent read, it means that DynamoDB gives you an answer when the first replica (or first few replicas) reply. This cuts down on latency: after all, waiting for the first of three replicas to reply is faster than waiting for a single replica to respond! However, all of the replicas will eventually respond to your request, whether you wait for them or not.

This means that eventually consistent reads in DynamoDB would use the same number of messages, amount of bandwidth, and processing power as strongly consistent reads. It shouldn’t cost Amazon any more to perform consistent reads, but it costs you double.

If you ran your own Dynamo-style NoSQL database, like Riak, Cassandra, or Voldemort, you wouldn’t have this artificial cost increase when choosing between eventual and strongly consistent reads.

Possible explanations: Maybe Amazon didn’t adopt the Dynamo design’s “send-to-all” semantics. This would save bandwidth but cause a significant latency hit. However, Amazon might save in terms of messages and I/O without compromising latency if they chose not to send read requests to remote sites (e.g., across availability zones) for eventually consistent reads . Another possibility is that, because eventual consistency is faster, transient effects like queuing are reduced, which helps with their back-end provisioning. Regardless, none of these possibilities necessitate a 2x price increase. Note that the Dynamo model covers most variants of quorum replication plus anti-entropy. edit: As I mentioned, it’s possible that weak consistency saves an extra I/O per read. However, I’m still unconvinced that this leads to a 2x TCO increase.

Cost to Amazon: probably zero

Master-slave replication. If Amazon has chosen a master-slave replication model in DynamoDB, eventually consistent reads could go to a slave, reducing load on the master. However, this would mean Amazon is performing some kind of monetary-cost-based load-balancing, which seems strange and hard to enforce. Even if this is the case, is doubling the price really the right setting for proper load-balancing at all times and for all data? Is the 2x cost bump really necessary? I’m not convinced.

If you ran your own NoSQL store that used master-slave replication, like HBase or MongoDB, you wouldn’t be faced with this 2x cost increase for strong consistency.

Cost to Amazon: increased master load. 2x load? Probably not, given proper load balancing.

Read vs. Write Cost

Amazon decided to place this extra charge on the read path. Instead, they could have offered a “consistent write” option, where all subsequent reads would return the data you wrote. This would slow down consistent writes (and speed up reads). I’d wager that a vast majority of DynamoDB operations are reads, so this “consistent write” option would decrease revenue compared to the current “consistent read” option. So, compared to a consistent write option, you’re currently getting charged more, and the majority of your DynamoDB operations are slower.

FUD Helps Sell Warranties

Amazon is vague about eventual consistency in DynamoDB. Amazon says that:

Consistency across all copies of data is usually reached within a second. Repeating a[n eventually consistent] read after a short time should return the updated data.

What does “usually” mean? What’s “a short time”? These are imprecise metrics, and they certaintly aren’t guarantees. This may be intentional.

Best Buy tries to sell you an in-house extended warranty when you buy a new gadget. Most of the time, you don’t buy the warranty because you make a judgment that the chance of failure isn’t worth the price of the warranty. With DynamoDB, you have no idea what the likelihood of inconsistency is, except that “a second” is “usually” long enough. What if you don’t want to wait? What about the best case? Or the worst? Amazon could release a distribution for these delays so you could make an informed decision, but they don’t. Why not?

It’s not for technical reasons. Amazon has to know this distribution. I’ve spent the last six months thinking about how to provide SLAs for consistency. Our new techniques show that you can make predictions with arbitrary precision simply by measuring message delays. Even without our work, I’d bet all my chips on the fact that Amazon has tested the hell out of their service and know exactly how eventual eventual consistency is in DynamoDB. After all, if the window of inconsistency was prohibitively long, it’s unlikely that Amazon would offer eventual consistency as an option in the first place.

With more information, most customers probably wouldn’t pay 2x

So why doesn’t Amazon release this data to customers? There are several business reasons that disincentivize them from doing so, but the basic idea is that it’s not clear that strong consistency delivers a 2x value to the customer in most cases. However, without the data, customers can’t make this call for themselves without a lot of benchmarking. And profiling doesn’t buy you any guarantees like an SLA.

If users knew what “usually” really meant, they could make intelligent decisions about pricing. How many people would pay double for strong consistency if they’d only have to wait a few tens of milliseconds on average for consistent read, or, conversely, that data would be at most a few tens of milliseconds stale? What if 99.9% of reads were fresh within 100ms? 200ms? 500ms? 1000ms? Without the distribution, it’s hard to judge whether eventual consistency works for a given app.
Related: if your probability of consistent reads is sufficiently high (say, 99.999%) after normal client-side round-trip times (which are often long), do you really care about strong consistency? If you did, you could also intentionally block your readers until they would read consistent data with a high enough probability.
If you have “stickiness” and your requests are always sent to a given DynamoDB instance (which you can’t enforce but is a likely design choice!), then you may never read inconsistent data, even under eventual consistency. Even if this happens only some of the time, you’ll see less stale data. This is due to what are known as session guarantees.
It’s not specified how much slower a consistent read is compared to an eventually consistent read, so it’s possible that the wait time until (a high probability of) consistency plus the latency of an eventually consistent read is actually lower than the latency of a strongly consistent read.
If you store timestamps with your data, you can guarantee you don’t read old data (another session guarantee) and, if you’re dissatisfied, you can repeat your eventually consistent read. You can calculate the expected cost of reading consistent data across multiple eventually consistent reads—if you have this data.

I don’t think many people would pay 2x for their reads if they were provided with this data or some consistency SLA. Maybe some total rockstars are okay with this vagueness, but it sure seems hard to design a reliable service without any consistency guarantees from your backend. By only giving users an extremely conservative upper bound on the eventuality of reads (say, if your clients switch between data centers between reads and writes and DynamoDB experiences extremely high message delays), Amazon may be scaring the average (prudent) user into paying more than they probably should.

Conclusion

Strong consistency in DynamoDB probably doesn’t cost Amazon much (if anything) but it costs you twice as much as eventual consistency. It’s highly likely that Amazon has quantitative metrics for eventual consistency in DynamoDB, but keeping those numbers private makes you more likely to pay extra for guaranteed consistency. Without those numbers, it’s much harder for you to reason about your application’s behavior under average-case eventual consistency.

Sometimes you absolutely need strong consistency. Pay in those cases. However, especially for web data, eventual is often good enough. The current problem with DynamoDB is that, because customers don’t have access to quantitative metrics about their window of inconsistency, it’s easy for Amazon to set prices irrationally high.

Disclaimer: I (perhaps obviously) have no privileged knowledge of Amazon’s AWS architecture (or any other parts of Amazon, for that matter). I just happen to spend a lot of time thinking about and working on cool distributed systems problems.

Huchra's Seven Characteristics of a Successful Scientist

2012-02-10T00:00:00-08:00

In my freshman fall at Harvard, I was fortunate to enroll in a seminar on cosmology with John Huchra. In case you're not an astrophysics buff, among other things, Huchra was one of the first astronomers to experimentally observe large-scale structures in the universe using wide-ranging galaxy surveys and, until shortly before his untimely death, he served as the president of the American Astronomical Society.

While my formal study of astronomy ended after Huchra's seminar (although I do lug my telescope out when visiting my parents in Nebraska), Huchra left a strong impression on me. There are certain individuals who exemplify my ideal of a great scientist, who are not only successful but demonstrate a sense of real purpose in their research and enthusiasm for their subject. Huchra's weekly seminar and our often extended discussions afterward were probably my first extended contact with a real-life, practicing professional scientist. Huchra's excitement about his field, his ability to communicate heady concepts such as galaxy formation and cosmic microwave background radiation to a group of university freshmen, and his passion for science heavily influenced my own career path, even if I didn't realize it at the time.

Some time ago, I found a highly accessible (draft) essay Huchra wrote titled "Mapmaker, Mapmaker Make Me a Map", describing his career path and research. Huchra's writing captures his personality well and is worth a read in itself, but I like the essay also for his formulation of the characteristics of a "successful scientist":

…an individual's success in the [science] game could be predicted by their characteristics in a seven vector space. Each vector measured a critical personal characteristic or set of characteristics such as intelligence, taste and luck, and the ability to tell one's story (public relations)…

Looking back on this I've come to realize that being nearly a unit vector in any one of the important characteristics pretty much guarantee's[sic] you a tenured job, two are good for membership in the National Academy, and three put you in contention for the Nobel Prize.

I haven't seen Huchra's seven characteristics elsewhere, but I find them particularly interesting:

Raw Intelligence
Knowledge
Public Relations
Creativity
Taste
Effectiveness
Competence

Huchra also adds that "many people would want to add 'luck' to the list, but our learned conclusion was that luck is a product of at least three of the above vectors and not an attribute in and of itself." Moreover, "some vectors are worth more than others, [sic] for example [sic] taste and creativity are probably more important than knowledge."

Now, as a greenhorn in Computer Science, I'll not speculate like Huchra does in his essay as to who in my own field constitutes a "unit vector" for each characteristic, nor will I attempt to publicly analyze my own strengths and weaknesses. However, I think this is a thought-provoking taxonomy that deserves reflection. After all, being the smartest person in the room can get you pretty far, but, as with most life pursuits, (perhaps thankfully for most of us) there's often a lot more to success than raw intelligence by itself.

A version of Huchra's essay also appears as a chapter of Our Universe: The Thrill of Extragalactic Exploration as Told by Leading Experts, edited by Alan Stern (Cambridge University Press, 2001).

CCC Post: Why am I in graduate school?

2011-12-19T00:00:00-08:00

For the 2011-12 school year, I'm serving as a student blogger for the Computing Community Consortium's "A Day in the Life of a Computer Science Graduate Student" blog. In the interest of archiving my posts, I'm cross-posting them here.

Link to original post

Why am I in graduate school? Different students have different answers to this question--passion, research, lifestyle, startup incubation, you name it--and others may not have an answer. In the final lecture of CS262, our entry-level graduate systems seminar, our professor, Eric Brewer, emphasized the importance of having a purpose in one's graduate education. I agree, and I also think it's important for someone considering graduate studies to think about why they'd like to spend five or more years on research.

Graduate school may be the logical "next step" for you, but, if you're considering a Ph.D. in Computer Science, you likely have other options. Someone aptly described the financial opportunity cost of five years of graduate school to me as roughly costing a house. From a strictly financial perspective, a grad student makes $30,000 or less per year (not including summer internships and external employment), while the going rate for engineers in the Bay Area is easily $80-100,000+ per year. Over five years, that's at least $250,000, and that's not including salary increases, benefits, stock appreciation, and bonuses. At the least, that number alone should make you think twice about grad school. Of course, there are salary benefits from getting a Ph.D., but it's doubtful whether they're financially worthwhile (see the link to Matt's blog at end of this post).

I thought I'd share three of my personal reasons here. Maybe after five years I'll be jaded, but I hope to retain my optimism and will do my best to make it last.

I get to work on (almost) whatever I want. As a grad student, you can work on whatever you can convince your advisor to let you work on. If a project has research potential and intellectual merit, you can probably convince someone to let you work on it. If you're able to choose your advisor, your options are as broad as your department. Moreover, while your advisor typically funds you, external fellowships and funding give you additional freedom. This freedom is a double-edged sword: if you don't like what you're doing, it means you (likely) chose poorly! I consider myself lucky because there are a ton of exciting projects at Berkeley and because I received external funding. In fact, I changed my initial plans to defer enrollment for a year (and, in doing so, turned down enticing compensation from an industry position) to keep my external funding and the freedom it provides me.

The people are great. Being around passionate people is invigorating. Learning from and debating with other people in my field who are thinking about problems I also find stimulating is irreplaceable. I think you can find this kind of concentrated community in many areas of society, but, at its core, I see science as a collective effort towards greater human knowledge that (at its best) lends itself to interaction and collaboration. In CS in particular, reseachers have chosen to sacrifice monetary gain (especially at the top of the field, often working as hard or harder than they would in industry) to be in academia. Maybe this means we're all a little crazy, but my colleagues give me a lot of energy.

I get to (try to) change the world. This is quite optimistic, but, in the very long tail of the distribution of researchers, you'll find people who changed the way we view and interact with the world. I don't anticipate being one of these select few, but I want to try my best to make a difference. Academic CS is in an odd position where there's an industry making money on the field as well, often coming up with better ideas than academia does. However, our goals are different: in academia, we don't need to make a profit, meet a bottom line, or ship production code. I think this affords us the opportunity for longer-term research and impact. As academics, we don't always get this right, but I think the struggle is worthwhile. I do believe that one can make meaningful contributions and have real-world impact in industry--it's probably easier to do so there as well! In fact, having good entrepreneurship skills is quite similar to having good research taste and execution ability. However, even if I don't change the world directly, I have faith that my (would-be) future students will!

I'm sure these reasons will evolve as I continue in graduate school, but, to me, having reasons and understanding why I'm here is fundamental to getting what I want out of the experience.

Matt Welsh has a particularly good overview of some additional reasons why or why not to go to graduate school on his blog.

Tasting My Proverbial Academic Foot

2011-12-01T00:00:00-08:00

tl;dr: Researchers I really respect read a rambling, ranting, accidentally-public review of their paper I wrote for class and then blogged about it.

Apparently private-but-really-public memos are all the rage this week, so I shouldn't have been surprised when I received an interesting tweet from Mike Freedman last night. Mike informed me that Dave Andersen had posted a reply to a paper review I'd recently posted on a private blog as homework for Ion Stoica's cloud computing seminar I'm enrolled in at Berkeley. This was surprising in part because I wasn't aware of any public links to my personal review blog (having disabled Google indexing and public references) except for the link apparently buried on the course site. Apparently, my roommate mentioned our class to Mike, his old advisor, and I'd recently de-anonymized myself when tweaking my Blogger profile for this blog. However, the tweet was even more surprising because Dave and Mike were co-authors on the paper that Dave blogged about, which I had reviewed unnecessarily harshly. My review of this paper (describing a system called COPS that provides a variant of causal consistency called causal+ consistency) was definitely not intended for public consumption--and certainly not by the authors, who I genuinely greatly respect.

To comment directly on Dave's post, the substantive critical portions of my review stemmed from what I incorrectly perceived as a claimed major contribution of the paper: the development of causal+ consistency, which has existed in various flavors in related work but wasn't previously formally defined. As Dave clarifies in his post, defining causal+ is a contribution, but the paper's main contribution is the design of a causal+ system, including technical details like garbage collection and making it run fast. This is an interesting distinction that I'm slowly learning about between the systems community, with which I am more familiar, and the database community. To paraphrase Eric Brewer in CS262 lecture, traditionally, the database community focuses on providing useful invariants, then building systems that obey those invariants. On the other hand, the OS/systems community builds from the ground up, providing useful abstraction and modularity along the way. Brewer and Joe Hellerstein make a rough categorization between OS/systems research ("elegance of system") and database research ("elegance of semantics") styles along these lines. COPS is a huge win in terms of system elegance compared to prior work--it screams, it's scalable, and it's practical--even if the semantics which derailed my brief review have been previously explored (if not formally stated). As I determine a field for my dissertation, it's important that I continue aligning and calibrating my rapidly-evolving tastes with the values of the research community I will join.

Dave quotes me mercifully in his post, but I had some unnecessarily strong words in my "(surprisingly angry)" review that I wouldn't stand by publicly or even privately. In my (would-be) private class reviews and discussions--often written quickly, in the eleventh hour--I'm typically pretty harsh and at times deliberately inflammatory. This helps me get feedback from my advisors and peers about the uncertainties and opinions I have about research in addition to encouraging debate--in private. I'm an admirer of the philosophy that one should put a stake firmly in the ground and see what's there, but be open and ready to move the stake just as easily. This is one of the great parts of academia: ideas are currency, yet judgments are rarely irreversible, so making judgments about ideas and rethinking them is both fun and encouraged! In the context of a write-once-read-never paper review between me and my instructor, I lean towards making over-the-top statements and often voice my doubts about our field and what to expect from my own research. Additionally, in full disclosure, the fact that I'm working on similar data consistency issues to their paper probably didn't help this particular review (although, to be fair, we're looking at how far we can push existing models--but more on that soon). I'm not the most socially adept person I know, but I definitely wouldn't have knowingly publicly published this review. You can imagine my horror when I learned the COPS team had read it.

Public content on the internet is serious business. There are too many examples of public content being posted that really shouldn't be posted (even if it's a conscious decision: in CS, Mihai Patrascu's job offer debacle comes shudderingly to mind). In the internet era, the "newspaper rule" of not doing anything you wouldn't want on the front page of the New York Times applies to all content you post; a hyperlink can spread virally. Moreover, the internet (more or less) lasts forever, so there's no easy "undo" or "delete" operations (unless we eventually have some combination of ubiquitous attestations/TPMs, DIFC, and leases). I personally make a concerted effort to control what content I release publicly, and this error is not one I'll forget. I'm well-aware that security through obscurity isn't security, but this experience truly brings the lesson home.

Ironically, this semester's classes ended yesterday, and I was planning to shut down the blog shortly thereafter. After this incident, I considered keeping the post publicly readable, but Dave has picked out the salient points and re-posted what little is worth reposting (in addition to answering my questions!). I'm a big proponent of publicly baring weaknesses in one's own research and have recently thought hard about posting the blind paper reviews that my work receives and independently providing lists of deficiencies for each publication I write. However, because this review wasn't intended to be public (and, of course, for the sake of my academic career), I've decided to take it down. (Dave has graciously agreed to remove the copy of my post that he uploaded after I removed my blog.)

After a panicked late-night phone call with my advisor, a powwow with two wonderful, anonymous late-night inhabitants of Soda Hall, and some sleep, I still feel embarrassed, but I think I've learned a few good lessons here. I trust that Dave, Mike, and the COPS team won't hold this against me in five years if I make a trip around the academic job circuit. And, secretly, I look forward to the day when naïve first-year Ph.D. students write ridiculous commentaries on my work, especially if they're as bombastic I was in my review.

NoseSQL and SenseDB: New Paradigms for Crowdsourced Databases

2011-11-11T00:00:00-08:00

Introductory note: Given the high risk of being scooped, I've decided to unveil my vision for the future of human computer interaction and crowdsourced databases. In light of the impending explosion of research in this area, I have deviated from my plans to submit to a proper database conference and have instead chosen to publicly lay claim to these ideas in this post. Because you'll wonder, I am serious about these ideas and believe there are interesting problems in this space. I'd even entertain proposals for collaboration.

(Edit: To be clear, I'm kidding about getting scooped. This isn't my day-to-day research, but I do like the idea.)

Current crowdsourced databases are incapable of answering three major classes of queries. Database systems such as CrowdDB and Qurk leverage human-powered computation to answer queries that computers cannot typically answer, such as performing complex image classification or processing uncertain or underspecified queries. These systems are generic but have thus far focused on processing known information about entities in the outside world. However, to the best of my knowledge, crowdsourced databases have overlooked a large part of the human experience: our senses. In the remainder of this post, I will outline crowdsourcing extensions that represent an improvement over existing databases: the ability to query over scents, tastes, and tactile sensations.

Olfaction, taste, and touch/texture sensors are immature and are relatively specialized. Computers cannot reliably answer a wide range of pressing questions about raw sensory input and our interactions with the physical world. Electronic sensors can detect specialized inputs, such as chemical presence (e.g., explosives) and some flavors (e.g., selected features of wine) but, to the best of my knowledge, are not generally applicable or widely available.

Online databases can answer questions about particular sensing domains such as beer and food tasting and musical preferences. These databases contain knowledge of high-level, narrowly-constrained semantic interpretations of the raw sensory data. A beer rating is a condensation of multiple factors, many of which are reflections on the beer's taste, nose, and mouthfeel--but the raw taste, scent, and mouthfeel data is not available.

Operating on raw data allows greater query expressivity and insight than operating on a set of features describing the data. We can view preferences regarding senses as functions over the set of raw stimuli in the world. Without sensory data, it is difficult to infer connections between sensations, such as why we like taste of peanut butter, banana, and bacon in sandwiches, the smell of cucumber and Good & Plentys, and the seemingly culturally universal combination of heat and steam in a sweat lodge. We cannot easily make connections between even somewhat similar domains. For example, answering questions about wine and recipe pairings requires either additional cross-domain knowledge (a database of explicit beer and recipe pairings) or lower-level sensory data (what flavors are in each beer?) paired with filters on this data. These solutions appear similar, however the latter scales to more domains without requiring additional external expert input.

While computers are deficient at answering sense-based queries, thankfully (and by definition), most humans come complete with detectors for all five of our senses. Employing humans to power general-purpose sensory databases is a natural extension of crowdsourcing technology. Compared to a specialized mechanical solution such as a chemical-specific detector, a human crowd is more general and likely less expensive than highly-specialized equipment when answering a wide range of queries. Similarly, humans can be used for both lower-level sensory analysis and broader semantic-level comparisons than narrowly scoped online information aggregation sites. Accordingly, I propose the development of a crowdsourced sense-oriented database, SenseDB. This database no doubt needs a query language for user-defined functions, which will consist of embedded DSLs for scent, taste, and touch queries, or NoseSQL, FlavorSQL, and FeelSQL, respectively.

Harnessing the power of human-powered sense-based query processing leads to several research questions:

Raw versus semantically-rich data. To what extent does encoding raw sensory data aid in query processing? Does querying a (logical) database of taste, scent, and touch details provide higher accuracy, speed, or throughput than simply presenting the question to a crowd from a high semantic perspective? Can we better re-use raw data between queries? Do semantically rich queries impact the bias of the results?

Encoding. How do we encode the sensory details required to answer a query? Which aspects of the sensory experience are required to answer a query? The degree of specificity in formulating a particular query limits the applicability of the results for future queries and analysis. There are many published ISO standards governing sensory analysis (including which tasting glasses to use with olive oil), but applying these standards to a general-purpose crowdsourced query processing system remains an open problem.

Transmission. Sights and sounds can be easily recorded and transmitted for processing, but we lack mechanisms for reliably communicating touch, taste, and smell stimuli. One option is to use a crowd that is physically co-located with the set of objects to be queried, but this does not scale in the size of the set of objects or in the number of queries.

Non-human processing. Are humans the most efficient computation engine for sensory queries? Can we humanely use canines or other macrobiotic organisms to process these queries instead? How does the throughput of a non-human compute engine compare to a human compute engine? What about queries per second, queries per dollar, or total cost of ownership? Both rats and pigs have been successfully employed in demining scenarios, however the generality of these mechanisms is unclear.

These challenges are only a subset of the problems inherent in developing a sense-oriented query database. However, given the apparent advantages of ScentDB, I believe the database community will rise to the occasion and take crowdsourcing to the next level, providing valuable insights into the human condition along the way.

I would like to thank Joe Hellerstein, Mike Franklin, the BOOM team, and those explicitly not mentioned here for their feedback on these ideas.