<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
 
 <title>Peter Bailis: Highly Available, Seldom Consistent</title>
 <link href="http://bailis.org/blog/feed/" rel="self"/>
 <link href="http://bailis.org/blog"/>
 <updated>2016-05-01T10:02:22-07:00</updated>
 <id>http://bailis.org/blog/</id>
 <author>
   <name>Peter Bailis</name>
   <email>pbailis@cs.berkeley.edu</email>
 </author>

 
 <entry>
   <title>How To Make Fossils Productive Again</title>
   <link href="http://bailis.org/blog//how-to-make-fossils-productive-again/"/>
   <updated>2016-04-30T00:00:00-07:00</updated>
   <id>http://bailis.org/blog//how-to-make-fossils-productive-again</id>
   <content type="html">&lt;p&gt;At NorCal Database Day 2016, I served on a panel titled &lt;a href=&quot;https://sites.google.com/site/norcaldbday2016/home/presentation-abstracts&quot;&gt;40+ Years of Database Research: Do We Have an Identity Crisis?&lt;/a&gt; What follows is a loose transcript of my talk, which I enjoyed writing and delivering and which I hope you enjoy reading.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;The title of my short talk today is “How Can We Make Fossils Productive Again?’’ and the central question of this panel is “Does the Database Community Have an Identity Crisis?’’&lt;/p&gt;

&lt;p&gt;I say: “Sure! We’ve got a crisis. We’ve got an identity crisis. We’ve got a mid-life crisis. Our community is in all sorts of crises.”&lt;/p&gt;

&lt;p&gt;At the heart of our crises is the fact that this is literally the golden age for data. Things have never been better for data. Everyone wants a piece of data. Data is all over Silicon Valley, and Big Data, and Machine Learning, and all of these awesome tools, and yet much (or even maybe most) of the greatest, cutting-edge data-related research is not happening in the core database community. Much of the highest-impact data-related research appears in venues like OSDI and SOSP or NIPS and ICML. It’s like we’ve somehow lost control and are suddenly no longer the kings and queens of data. We’ve missed out on producing much of the seminal (and earliest) research in fields including Big Data, MapReduce and Hadoop, NoSQL, Cloud Computing, Graphs, Deep Neural Networks, and “Big ML.”&lt;/p&gt;

&lt;p&gt;(It’s not like these results are just coming out of the machine learning or statistics communities; they’re coming out of adjacent communities such as “systems.”)&lt;/p&gt;

&lt;p&gt;So, what’s up? Why &lt;em&gt;are&lt;/em&gt; we in crisis?&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Let’s consider a brief history of database research:&lt;/p&gt;

&lt;p&gt;Back in the 1970s, we were the cool kids! Codd came out with &lt;a href=&quot;https://scholar.google.com/scholar?cluster=1624408330930846885&quot;&gt;his relational model&lt;/a&gt;, and the earliest database researchers got super excited and said: “we’re going to run behind this crazy idea.” Back in the 1970s, we were the revolutionaries! Remember that the relational model wasn’t always &lt;em&gt;the&lt;/em&gt; way to do data management. &lt;a href=&quot;http://www.redbook.io/ch2-importantdbms.html&quot;&gt;System R&lt;/a&gt; was a batshit project where someone said “we’re going to take this &lt;em&gt;theory&lt;/em&gt; paper and throw 25+ people behind it and build an actual system” and along the way at least two Turing awards and a number of groundbreaking concepts came out of it. Today, it’s easy to take this crazy bet for granted, but remember: we went all in on a really cutting-edge, guerilla research agenda, and it paid off.&lt;/p&gt;

&lt;p&gt;Today, it’s 46 years later, and we have four Turing awards, many hundreds of billions of dollars of revenue, and tons of wonderful research to show for our big bet. However, despite these many successes, as this panel’s abstract suggests, we’re also at risk of ossification, at risk of becoming reactionary &lt;em&gt;fossils&lt;/em&gt;. The community narrative often runs along the lines of: “What is this ‘new’ stuff? What is this ‘NoSQL/Big Data/etc.’? This doesn’t look like a relational database or like conventional database research! This doesn’t fit my idea of a ‘database’!” In fact, many amazing ideas explored in database research over the last fifteen years were largely ignored, were of the form: “here’s a great idea from [some interesting field like signal processing], but I have to &lt;em&gt;cram&lt;/em&gt; it inside of SQL!” Or, alternatively, we say:  “Bah! We did this in the 1980s. Strong reject”; and then, just a few years later: “Oh wow, industry cares about this problem &lt;em&gt;and&lt;/em&gt; it turns out we overlooked a critical difference from the past. Let’s start publishing!”&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;So, as the panel abstract suggests, we may in fact be at risk of becoming reactionary fossils…&lt;/p&gt;

&lt;p&gt;…which brings us to the title of my talk: how do we make fossils productive again?&lt;/p&gt;

&lt;p&gt;The answer is simple: set them on fire and use them as fuel.&lt;/p&gt;

&lt;p&gt;We take the great ideas this community has developed and leverage them to push onward, upward, forward.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;How exactly are we going to use these fossils as fuel?&lt;/p&gt;

&lt;p&gt;I believe our role as a community is to discover and define the future standards for how data-intensive platforms and tools should look and operate.&lt;/p&gt;

&lt;p&gt;No other community has greater claim to data. We own the core idioms for reasoning about and dealing with data. We invented (or were instrumental in developing) critical concepts, including declarativity, query processing, materialized views, transaction processing, data storage, consistency, and scalability.&lt;/p&gt;

&lt;p&gt;As a result, we deserve and owe it to ourselves to bring these concepts to new domains, applications, and platforms. It’s time to reclaim our roots as systems revolutionaries, powered by the principles underlying our historical successes.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;I’m not going to tell you what to work on (and, in fact, I already spoke earlier today about what I think you should work on next: &lt;a href=&quot;http://arxiv.org/pdf/1603.00567.pdf&quot;&gt;analytic monitoring&lt;/a&gt;). Instead, as a young, irreverent professor (or, rather, professor-to-be-in-32-hours), I’ll offer four pieces of advice:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Kill the reference architecture and rethink our conception of “database.”&lt;/strong&gt; The article titled &lt;a href=&quot;https://scholar.google.com/scholar?cluster=11466590537214723805&quot;&gt;“Architecture of Database System”&lt;/a&gt; should be considered harmful. If a system doesn’t have a buffer pool, it can still be a database, and, in fact, I’d prefer not to read any more papers on “databases” that have buffer pools. Instead, I’d prefer you shock me with your radical, new (and useful!) conception of a data management platform.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Solve new, emerging, real problems outside traditional relational database systems.&lt;/strong&gt; We’re in Northern California. This is NorCal DB Day. As far as I’m concerned, this is the center of the universe for data. If you go talk to real users or go and talk to people outside CS working at universities, it’s astounding to see the number of data-related tools that have yet to be written that don’t look anything like existing database engines. [This advice also holds for just about anyone with an Internet connection.]&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Use data-intensive tools, both the tools that you’re building and the tools that others have built.&lt;/strong&gt; Most data-intensive tools are &lt;em&gt;awful&lt;/em&gt; to use. NoSQL and Hadoop partly came out of the fact that setting up data infrastructure sucks. It’s often impossible to load data if you don’t know your schema, which sucks. Data warehouses also cost a lot of money, which sucks. Treat these things that suck as inspiration. The world isn’t just some giant tire fire; rather, the world is waiting to be filled with beautiful data management tools that you can design, build, and work to understand in your research.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Do bold, weird, and hard projects and actually follow through.&lt;/strong&gt; Just publishing a paper isn’t enough. Today, our prerogative as researchers goes beyond simply publishing one paper, or some CIDR vision paper that remains a vision unrealized. As a community, we have an opportunity to build real projects and products—especially via open source—that take on lives of their own beyond the basics of research and can also inform our research agenda going forward. Let’s recognize that opportunity and the encourage the ambition it requires.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;hr /&gt;

&lt;p&gt;There are a number of recent projects that illustrate this advice. Three examples:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://people.ucsc.edu/~palvaro/&quot;&gt;Peter Alvaro&lt;/a&gt;’s work on &lt;a href=&quot;http://people.ucsc.edu/~palvaro/molly.pdf&quot;&gt;Molly and Lineage-Driven Fault Injection&lt;/a&gt; takes classic techniques from data provenance and applies them to understanding and debugging distributed systems and distributed systems failures. This research isn’t just a SIGMOD paper; Peter recently gave a &lt;a href=&quot;http://www.infoq.com/presentations/failure-test-research-netflix&quot;&gt;keynote in London&lt;/a&gt; about a month ago with his industrial collaborator describing how they productionized this work at Netflix, where this work finds real bugs. Theory + practice, applied.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://cs.stanford.edu/people/chrismre/&quot;&gt;Chris Ré&lt;/a&gt;’s work on &lt;a href=&quot;deepdive.stanford.edu&quot;&gt;DeepDive&lt;/a&gt; illustrates how to build dark data extraction engines that match or exceed human quality. Like MacroBase, DeepDive is a new kind of data processing system that’s providing value by leveraging core concepts from databases such as probabilistic inference and weakly-consistent query processing. And it works on real problems, like combating human trafficking.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A recent project I wish the database community had done is &lt;a href=&quot;http://download.tensorflow.org/paper/whitepaper2015.pdf&quot;&gt;TensorFlow&lt;/a&gt; at Google, which is a distributed, scale-out Torch/Theano/Caffe-like model training framework. This is work that I think we should have done in our community. [&lt;a href=&quot;http://www.cs.umd.edu/~thodrek/&quot;&gt;Theo&lt;/a&gt; interrupts and says Chris’s group is solving this problem. Since Chris is a &lt;a href=&quot;https://www.macfound.org/fellows/943/&quot;&gt;genius&lt;/a&gt;, I believe Theo.]&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;hr /&gt;

&lt;p&gt;Before I end, does anyone know who this person is?&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;http://www.bailis.org/blog/post_data/2016-04-30/le_corbusier.jpg&quot; width=&quot;350&quot; alt=&quot;Le Corbusier&quot; /&gt;&lt;/p&gt;

&lt;p&gt;[No one knows.]&lt;/p&gt;

&lt;p&gt;This is one of my role models. This is Le Corbusier, a great architect who pioneered many of the ideas and designs in modern architecture that we know and love today. &lt;/p&gt;

&lt;p&gt;In the early 1920s, Le Corbusier and friends faced a similar problem to the one the database community faces today. Le Corbusier and friends felt that the practice of architecture had ossified and had become stifled by tradition, that the field was essentially dead.&lt;/p&gt;

&lt;p&gt;In response, Le Corbusier wrote a &lt;a href=&quot;https://en.wikipedia.org/wiki/Toward_an_Architecture&quot;&gt;beautiful book&lt;/a&gt; that translates to &lt;em&gt;Towards A New Architecture&lt;/em&gt;. Le Corbusier wrote, in effect, that “yes, the Parthenon is perhaps the most beautiful instance, the perfect example of a particular standard of architecture. The Parthenon may have achieved the platonic ideal of the standard of architecture we’ve previously established. But there are &lt;em&gt;many possible standards&lt;/em&gt; to acknowledge, each dependent on need and use. Standards are established &lt;em&gt;by experiment&lt;/em&gt;.”&lt;/p&gt;

&lt;p&gt;To reiterate, and to quote directly, Le Corbusier showed us that “a standard is definitely established by experiment.” Le Corbusier went on to revolutionize our conceptions of architecture and design. Le Corbusier wasn’t always right, but he catalyzed the field.&lt;/p&gt;

&lt;p&gt;Today, relational database engines and the status quo in data management reflect an impressive standard. But there are many possible standards, suitable for different needs and uses. Today, it’s our time to experiment, to establish new standards.&lt;/p&gt;

&lt;p&gt;My name is Peter Bailis, and I’m putting my shoulder to the wheel.&lt;/p&gt;

</content>
 </entry>
 
 <entry>
   <title>You Can Do Research Too</title>
   <link href="http://bailis.org/blog//you-can-do-research-too/"/>
   <updated>2016-04-24T00:00:00-07:00</updated>
   <id>http://bailis.org/blog//you-can-do-research-too</id>
   <content type="html">&lt;p&gt;I was recently discussing gatekeeping and the process of getting started in CS research with a close friend. I feel compelled to offer a note.&lt;/p&gt;

&lt;p&gt;As a practicing academic researcher, I’m personally thrilled by the degree of excitement regarding CS research today in the broader technical community. Reading papers and doing research have always been favorite activities for me, and it’s tremendously heartening to see organizations like Papers We Love and its many members sharing their excitement as well. Research is a very human way to engage with our curiosity, and curiosity deserves cultivation, celebration, and sharing.&lt;/p&gt;

&lt;p&gt;To someone interested in learning more about research, I’d offer the following words of encouragement and advice:&lt;/p&gt;

&lt;h4 id=&quot;no-one-is-born-a-researcher&quot;&gt;No one is born a researcher&lt;/h4&gt;

&lt;p&gt;There is no prescribed way that a researcher has to look, act, or be. One of my closest colleagues started off doing technical support during the first dot-com boom with only an undergraduate degree in literature and no background in Computer Science. Today, my colleague is a tenure-track professor doing work I deeply respect and admire. Two other colleagues who are faculty at top-tier departments started their careers after emigrating as refugees, and each did their undergraduate work at non-traditional institutions. Another colleague recently started a Ph.D. after spending several years working closely with researchers while in industry, without an undergraduate degree. The CS academy is highly homogeneous, with a long way to go. But if you look closely enough, you may be surprised to find someone who looks and feels more like you than you might otherwise think exists.&lt;/p&gt;

&lt;h4 id=&quot;pedigree-and-privilege&quot;&gt;Pedigree and privilege&lt;/h4&gt;

&lt;p&gt;Granted, pedigree and privilege make many things much easier. But pedigree and privilege are not strictly necessary to do great research, and they are certainly not by themselves sufficient to do great research. Among other things, great research comes from curiosity, creativity, hard work, determination, some amount of brilliance, and many failures.&lt;/p&gt;

&lt;h4 id=&quot;some-concrete-suggestions&quot;&gt;Some concrete suggestions&lt;/h4&gt;

&lt;ol&gt;
  &lt;li&gt;Read papers. &lt;/li&gt;
  &lt;li&gt;Discuss them with friends, or strangers.&lt;/li&gt;
  &lt;li&gt;Go to your local Papers We Love meetup, or start your own chapter.&lt;/li&gt;
  &lt;li&gt;Try implementing what you read. Open sourcing and sharing your code is even better. There’s a good chance your code is better written than the code used for the paper.&lt;/li&gt;
  &lt;li&gt;If you get confused, check Wikipedia. Almost everyone reads Wikipedia, including professors. In addition, consider checking out a book from your local library. Textbooks are great guides to new and unfamiliar areas.&lt;/li&gt;
  &lt;li&gt;Ask yourself: how would I improve this research if I did it myself?&lt;/li&gt;
  &lt;li&gt;Ask yourself: how could I produce a follow-on project?&lt;/li&gt;
  &lt;li&gt;Blog about what you’ve learned and about questions like these.&lt;/li&gt;
  &lt;li&gt;Spend some time trying to improve and follow on to your favorite paper(s). Write some code. Run an experiment. Or try to find a big-O improvement.&lt;/li&gt;
  &lt;li&gt;Email authors if you have questions (or want to express your excitement about their work). If they happen not to respond, don’t worry; research makes for busy schedules, and I’d bet that your polite enthusiasm may have made their day a little bit brighter.&lt;/li&gt;
  &lt;li&gt;Attend research seminars. Your local university should have at least one public talk series, or watch them online.&lt;/li&gt;
  &lt;li&gt;If you’re in school, make sure to attend office hours. Ask questions. Enroll in a graduate-level seminar, ideally one focused on reading and discussing papers.&lt;/li&gt;
  &lt;li&gt;If you’re not in school, still ask questions. Seek out a community that can help you find answers. If you don’t know of a community, I suggest reaching out on social media like Twitter.&lt;/li&gt;
  &lt;li&gt;Don’t get discouraged.&lt;/li&gt;
  &lt;li&gt;Trust in your curiosity.&lt;/li&gt;
  &lt;li&gt;Be respectful, but be bold.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4 id=&quot;some-final-thoughts&quot;&gt;Some final thoughts&lt;/h4&gt;

&lt;p&gt;Make no mistake: getting started in research is hard. The above steps aren’t enough. Getting started in research requires perseverance and will over time require many people to make investments in you and your success. But you can be the first person to make an investment—by learning, educating yourself, doing hard things, and beginning to develop your skills. You have much more agency and power than you may believe.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Lean Research</title>
   <link href="http://bailis.org/blog//lean-research/"/>
   <updated>2016-02-20T00:00:00-08:00</updated>
   <id>http://bailis.org/blog//lean-research</id>
   <content type="html">&lt;p&gt;Recently, my favorite questions for myself and my students have been: what hypothesis are you currently testing, what is your goal in testing it, are you testing it as efficiently as possible?&lt;/p&gt;

&lt;p&gt;A research project ought to consist of multiple hypotheses, at multiple scales. You want at least one vaguely defined hypothesis regarding the space you’re working in; this could take years to test and could become a dissertation. You also want a working hypothesis that you can test in a few weeks or less. In-between, you probably want some intermediate hypotheses that are testable in a few months to a year. Generally, nearer-term hypotheses ought to be more concrete, while longer-term hypotheses ought to be more loosely held. Seek both.&lt;/p&gt;

&lt;p&gt;These hypotheses matter because, in systems research, great work is often achieved via rapid iteration, by repeated formulation and testing of smaller hypotheses in service of a set of larger goals. Failing fast allows you to hone your understanding of a problem and continually evolve your set of facts and beliefs about it. You should expect your hypotheses to change over time while courageously pursuing and refining them at multiple levels.&lt;/p&gt;

&lt;p&gt;The alternative is to sink an enormous amount of time and energy into a project that may or may not go anywhere. Given that time is the most scarce, least fungible resource in research, it’s in your interest to fail fast, learning along the way.&lt;/p&gt;

&lt;p&gt;Complaints about a lack of “long-term” research often conflate short-term research taste with short-term execution. This is a mistake. Many projects never have a chance of real success because they don’t aim high enough. However, it’s possible to aim high while iterating quickly. Many great projects I admire were the result of a series of small steps, executed in service of a larger, worthwhile vision. It’s easy to focus only on big outcomes and forget this process.&lt;/p&gt;

&lt;p&gt;So, if your hypotheses are wrong, will you know as soon as possible? If you’re right, will the grand payoff be worthwhile? As an advisor, one of my goals is to encourage audacious projects while helping break them down into manageable, intermediate steps.&lt;/p&gt;

&lt;p&gt;Ultimately: &lt;em&gt;think big in the long term, and think small in the short term. Have a loosely held plan for what comes between the two.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Call this “lean research.”&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>I Loved Graduate School</title>
   <link href="http://bailis.org/blog//i-loved-graduate-school/"/>
   <updated>2016-01-01T00:00:00-08:00</updated>
   <id>http://bailis.org/blog//i-loved-graduate-school</id>
   <content type="html">&lt;p&gt;It’s increasingly in vogue to complain about graduate school and the state of higher education. Now that I’ve officially graduated, I want to come out and say, simply: I loved graduate school. Pursuing a Ph.D. was one of the best decisions I’ve ever made. Here’s why:&lt;/p&gt;

&lt;h4 id=&quot;growth&quot;&gt;Growth&lt;/h4&gt;

&lt;p&gt;Getting a Ph.D. is an extreme growth experience. In four to six (or more) years, you are expected to become a world expert in a topic, produce original results, and communicate them to your field. Becoming an expert in anything is hard. But consider this: by the time you’re done with your dissertation, you’ll have answered questions no one ever answered (or possibly asked) before. Now, the questions may be narrowly scoped, and maybe only you or your dissertation committee care about the answers. But those possibilities don’t diminish the fact that you’ve made a difference, an original contribution.&lt;/p&gt;

&lt;p&gt;If you haven’t seen it, read Matt Might’s &lt;a href=&quot;http://matt.might.net/articles/phd-school-in-pictures/&quot;&gt;Illustrated Guide to a Ph.D.&lt;/a&gt; It’ll take you less than five minutes. Seriously, &lt;a href=&quot;http://matt.might.net/articles/phd-school-in-pictures/&quot;&gt;read it now&lt;/a&gt; if you haven’t already. &lt;/p&gt;

&lt;p&gt;I don’t know any other opportunity where you, as a person of any age, whether 18 or 85, are given the freedom and the opportunity to so boldly venture into the unknown, for the sake of the unknown. To finish a Ph.D., you have to: &lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Understand the state of the art in your field (by learning the fundamentals, seminal results, and latest developments, and learning to read research papers).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Perform original research (by learning the methodology of your field, often by building your own instruments and/or learning to use existing ones, formulating new research questions, designing experiments, analyzing the results, and repeating).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Communicate the results to your field (by writing papers, or at least a dissertation, probably by giving several talks along the way, and defending the relevance of your contributions to your field and peers).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This all requires developing a highly diverse skill set in a very short time period. This is hard. It’s one of the hardest things I know to do. Getting a Ph.D. requires creativity, courage, a hefty dose of stubbornness, and a lot of hard work. People will help you, as they helped me, by providing wisdom and guidance and training, but it’s ultimately up to you to do the original research. Again, at the end of the day, you are responsible for answering a question (or several questions) that literally no one else has answered before.&lt;/p&gt;

&lt;p&gt;As a result, I personally found the Ph.D. process to be immensely rewarding. My favorite memories of graduate school are of staying up late, jittery and unwilling to sleep because I desperately wanted to know the answer to a question that came up earlier in the day. I wanted to learn, and it was up to me to find the answer. Without exaggeration, I found (and still find) the questioning and answering part of the research process – really, a core aspect of research – to be exhilarating.&lt;/p&gt;

&lt;p&gt;Beyond this core process, handling all of the other aspects of graduate school require growth in different dimensions: writing, public speaking, strategy, management, and sometimes politics. I was constantly being stretched. After a big submission or presentation, I’d often look back and be surprised at how far I’d come in the preceding months. Looking back on several years, I’m proud of what I’ve learned and what I’ve worked to accomplish. No dissertation is perfect, but I think many feel similarly about what their dissertations represent in terms of their own personal growth.&lt;/p&gt;

&lt;h4 id=&quot;community&quot;&gt;Community&lt;/h4&gt;

&lt;p&gt;As I’ve hinted, along the way towards a Ph.D., you’re surrounded by other bright, courageous, curious individuals, including students, research staff, and faculty. Getting to nerd out all day on deeply technical ideas with other deeply technical people who share your intellectual passions is insanely fun. When I was writing my dissertation acknowledgements, the word that came to mind to describe my time with colleagues in graduate school was “joyful.” I met several of my best friends in graduate school. Maybe because no one else in my program was studying the specific topic that I chose to study, I rarely felt like I was in a competition with the other students. Departments and academic communities have personalities, but they also have people, many of whom are just as excitable as you and are excited about the same topics that you are.&lt;/p&gt;

&lt;h4 id=&quot;freedom&quot;&gt;Freedom&lt;/h4&gt;

&lt;p&gt;Another amazing aspect of graduate school is that, provided you can find someone (or some department) to support you, you can work on almost anything you want. I think it’s wonderful that people can do dissertations on any number of topics, however practical or theoretical or obscure. The latitude afforded by academic freedom is one of the most beautiful parts of our modern social contract. Funding can be a challenge, but, nevertheless, there are few other institutions that permit such freedom today.&lt;/p&gt;

&lt;p&gt;In computer science, we’re especially fortunate to have a huge number of interesting problems at the intersection of cutting-edge research and issues in current practice. There’s good reason why I thank 25 industry practitioners by name in my dissertation acknowledgments: they provided feedback, insight, and inspiration during my dissertation research. It was so much fun to work on research, present it to people who materially cared about the results, then improve the research as a result. At least in computer science, I think this process is much rarer than it needs to be.&lt;/p&gt;

&lt;h4 id=&quot;opportunity-and-opportunity-cost&quot;&gt;Opportunity and Opportunity Cost&lt;/h4&gt;

&lt;p&gt;Make no mistake: graduate school has a high opportunity cost. In computer science, you’re likely giving up over $100,000 per year to be in school. In expectation, in monetary terms, a Ph.D. in Computer Science is not worth your time.&lt;/p&gt;

&lt;p&gt;That said, with a Ph.D. in Computer Science, you can do many things:&lt;/p&gt;

&lt;p&gt;The academic job market is extremely challenging. However, computer science is much better than other fields. As I was told early on in my career, you can’t bet on an academic position by any stretch, but you can try.&lt;/p&gt;

&lt;p&gt;You can do research in an industrial lab, such as Microsoft Research, VMWare Research, Samsung Research, IBM Research, or HP Research.&lt;/p&gt;

&lt;p&gt;You can also start a company, like Andy Konwinski, Patrick Wendell, Reynold Xin, and Matei Zaharia (Databricks), Haoyuan Li (Tachyon Nexus), Ben Hindman (Mesosphere), Sean Kandel (Trifacta), Kuang Chen (Captricity), Joey Gonzalez (Dato), Adam Oliner (Kuro Labs, acquired by Salesforce), and many others did over the past few years (or, a little further back, the students who started Google, VMWare, Tableau, and Nicira). If you work in a practical area, you can use your research as the basic technology behind a commercial venture. Moreover, some of your smart, creative colleagues might want to start something too.&lt;/p&gt;

&lt;p&gt;If the entrepreneurial life doesn’t call your name, there will almost certainly be a Google or Facebook that is hiring (along with a number of companies of intermediate sizes), especially if you’re good and you’ve kept up on practical skills. In fact, you might want to work for a larger, established company anyway; there are fun problems to solve there, too.&lt;/p&gt;

&lt;p&gt;In fact, a Ph.D. in Computer Science gives you a skill set that can serve you well in industry, wherever you end up. You may not be the best engineer to start, but your critical reasoning skills, communication skills, and discipline can be real assets. In addition, you might end up on a more specialized or technical team than you might have otherwise.&lt;/p&gt;

&lt;p&gt;Of course, maybe you’d prefer to start a restaurant or become a artisan shoemaker. There’s no one stopping you, just like there was no one stopping you before; you just happen have a Ph.D. now, and, chances are, you learned a ton along the way.&lt;/p&gt;

&lt;p&gt;Maybe not all of the lessons you learned during your Ph.D. are immediately, explicitly marketable. But was that the point? Especially if you read this post before starting a Ph.D., I hope your answer is “no.”&lt;/p&gt;

&lt;h4 id=&quot;final-thoughts&quot;&gt;Final Thoughts&lt;/h4&gt;

&lt;p&gt;I recognize that I’m fortunate. Extremely fortunate. As a young, white male without debt or dependents in one of the most rapidly growing fields in academia, I’m extraordinarily, unbelievably privileged. In addition, I didn’t come this far alone. I had a fantastic team of mentors and advisors beginning in high school and college who believed in me and taught me to believe in myself. More broadly, academia is not a perfect place. Like many other human institutions, academia has systemic flaws that demand greater attention and improvement.&lt;/p&gt;

&lt;p&gt;While acknowledging all of these things, my statement stands: I loved graduate school. N=1 says graduate school can be a challenging, rewarding, and fulfilling experience, and I believe many of my former graduate student friends and colleagues would agree. As a professor, I don’t expect all of my students to love graduate school. Instead, I hope to help cultivate their curiosities, to empower them to ask and answer their own questions. Maybe some of them will love it too.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>NSF Graduate Research Fellowship: N=1 Materials for Systems Research</title>
   <link href="http://bailis.org/blog//nsf-graduate-research-fellowship-materials-for-systems-research/"/>
   <updated>2015-09-03T00:00:00-07:00</updated>
   <id>http://bailis.org/blog//nsf-graduate-research-fellowship-materials-for-systems-research</id>
   <content type="html">&lt;p&gt;The &lt;a href=&quot;http://www.nsfgrfp.org/&quot;&gt;National Science Foundation Graduate Research Fellowship&lt;/a&gt; (NSF GRFP, or “the NSF”) is one of several fellowships for graduate study in the sciences. If you’re a graduate student in STEM or are applying for STEM-related graduate school in the United States, it’s a great idea to apply for the NSF, which is due in &lt;em&gt;late October&lt;/em&gt; this year. In a nutshell, fellowships are useful because they give you flexibility and freedom by providing you with independing funding (and are also fairly prestigious).&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://transientneha.blogspot.com/2007/07/how-to-apply-for-nsf-graduate-research.html&quot;&gt;Neha Narula&lt;/a&gt;, &lt;a href=&quot;http://jxyzabc.blogspot.com/2008/08/cs-grad-school-part-3-fellowships.html&quot;&gt;Jean Yang&lt;/a&gt;, and &lt;a href=&quot;http://www.pgbovine.net/fellowship-tips.htm&quot;&gt;Philip Guo&lt;/a&gt; have all already written wonderful guides for applicants in Computer Science. If you aren’t already in graduate school, getting your materials together is a good exercise that’ll make you think hard about why you want to go to graduate school and what you want to do. There’s a lot of overlap between typical fellowship application materials and graduate application materials, so, if you’re also applying to Ph.D. programs, doing fellowship applications saves you work in November and December.&lt;/p&gt;

&lt;p&gt;In addition to adding yet another pointer to the above guides, I wanted to provide my own N=1 example of a successful NSF application. While there are several &lt;a href=&quot;https://www.google.com/search?q=sample+nsf+grfp+research+statement&quot;&gt;good examples&lt;/a&gt; of successful NSF proposals online, I’m not aware of any in software systems or databases. I’ve meant to share these for a while, but I keep forgetting – so, finally, I’m posting my materials from 2011 below:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.bailis.org/blog/post_data/2015-09-03/pbailis-nsf-grfp-proposal.pdf&quot;&gt;Research Proposal&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.bailis.org/blog/post_data/2015-09-03/pbailis-nsf-grfp-personal.pdf&quot;&gt;Personal Statement&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.bailis.org/blog/post_data/2015-09-03/pbailis-nsf-grfp-research.pdf&quot;&gt;Research Experience&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most useful part of these materials is probably the research proposal. As an applicant in systems research, I remember having a difficult time thinking about how to scope my proposal. In retrospect, I think it’s reasonable to propose a project that, if executed, could lead to one or more solid papers in a top-tier venue like SIGMOD or SOSP, with a clear and well-defined idea, or “nugget” of novelty. As a starting point, I think a good heuristic for problem selection is to read the proceedings of the top-tier conferences in your area, figure out some interesting “hot” or open questions, and ask: how could I do better? While it’s intimidating to read the recent research literature and even &lt;em&gt;consider&lt;/em&gt; the possibility of improving it, be bold, and propose something!&lt;/p&gt;

&lt;p&gt;In my proposal, I focused on a problem that was both fascinating to me and also very “hot” at the time: many-core operating system scalability. There were a number of papers proposing new many-core OS architectures coming out at the time, but none of them were clear winners. I was in love of the somewhat vintage idea of capabilities at the time that I wrote my proposal, and I thought they would enable an interesting architecture. Were capabilities &lt;em&gt;actually&lt;/em&gt; a good idea for shared-nothing multi-core? Possibly, but a definitive answer would require… research! At the proposal stage, a research project should ask some non-obvious questions, with a clear strategy for evaluating them and a clear payoff for doing so. Although my graduate research ultimately went in a different direction, I would still love to see the questions in this proposal answered!&lt;/p&gt;

&lt;p&gt;In terms of evaluation, my application received relatively positive reviews, and I’m immensely grateful for the support that the NSF has given me during my graduate career. However, as a caveat, there were a few areas I could have improved on. From my reviews: “Some comments on interactions with other students are missing; also, there are no comments on what was involved in regard to the applicant’s TA fellowship. Also no comments are provided as to the impact of the research work proposed.”; “The applicant does not mention any outreach efforts. Involving more broader activities would help strengthen the application.”; “not much information is available to evaluate other broader impacts, including promoting diversity, integration education and research, etc.” Broader impacts criteria have changed since I applied, but, in the context of a research proposal in software systems, they’re challenging to convey and are therefore worth extra effort!&lt;/p&gt;

&lt;p&gt;So, best of luck, and, as Neha &lt;a href=&quot;http://transientneha.blogspot.com/2007/07/how-to-apply-for-nsf-graduate-research.html&quot;&gt;points out&lt;/a&gt;, not applying gives you a 0% chance!&lt;/p&gt;

</content>
 </entry>
 
 <entry>
   <title>Worst-Case Distributed Systems Design</title>
   <link href="http://bailis.org/blog//worst-case-distributed-systems-design/"/>
   <updated>2015-02-03T00:00:00-08:00</updated>
   <id>http://bailis.org/blog//worst-case-distributed-systems-design</id>
   <content type="html">&lt;p&gt;Designing distributed systems that handle worst-case scenarios gracefully can—perhaps surprisingly—improve average-case behavior as well.&lt;/p&gt;

&lt;p&gt;When designing &lt;a href=&quot;http://henryr.github.io/cap-faq/&quot;&gt;CAP “AP”&lt;/a&gt; (or &lt;a href=&quot;http://www.bailis.org/papers/ca-vldb2015.pdf&quot;&gt;coordination-free&lt;/a&gt;) systems, we typically analyze their behavior under worst-case environmental conditions. We often make &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.8727&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;statements of the form&lt;/a&gt; “in the event of arbitrary loss of network connectivity between servers, every request received by a non-failing server in the system will result in a response.” These claims pertain to behavior under worst-case scenarios (“arbitrary loss of network connectivity”), which is indeed useful (e.g., DBAs don’t get paged when a network partition strikes). However, taken at face value, this language can lead to confusion (e.g., “in my experience, &lt;a href=&quot;http://queue.acm.org/detail.cfm?id=2655736&quot;&gt;network failures don’t occur&lt;/a&gt;, so why bother?”). More importantly, this language tends to obscure a more serious benefit of this kind of distributed systems design.&lt;/p&gt;

&lt;p&gt;When we design coordination-free systems that don’t have to communicate in the worst case, we’re designing systems that don’t have to communicate in the average case, either. If my servers don’t have to communicate to guarantee a response in the event of network partitions, they &lt;em&gt;also&lt;/em&gt; don’t have to pay the cost of a round-trip within a datacenter or, even better, across geographically distant datacenters. I can add more servers and make use of them—without placing additional load on my existing cluster! I’m able to achieve effectively indefinite scale-out—simply because my servers never &lt;em&gt;have&lt;/em&gt; to communicate on the fast-path.&lt;/p&gt;

&lt;p&gt;This notion of worst-case-improves-average-case is particularly interesting because designing for the worst case doesn’t always work out so nicely. For example, when I bike to my lab, I put on a helmet to guard against collisions, knowing that my helmet will help in &lt;em&gt;some&lt;/em&gt; but not &lt;em&gt;all&lt;/em&gt; situations. But my helmet is no real match for a true worst case—say, a large meteor, or maybe just an eighteen-wheeler. To adequately defend myself against an eighteen-wheeler, I’d need more serious protection that’d undoubtedly encumber my bicycling. By handling the worst case, I lose in the average case. In fact, this pattern of worst-case-degrades-average-case is common, particularly in the real world: consider automotive design, building architecture, and processor design (e.g., for thermal, voltage, and process variations).&lt;sup id=&quot;fnref:design&quot;&gt;&lt;a href=&quot;#fn:design&quot; class=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; Often, there’s a pragmatic trade-off between how much we’re willing to pay to handle extreme conditions and how much we’re willing to pay in terms of average-case performance.&lt;/p&gt;

&lt;p&gt;So why do distributed systems exhibit this pattern? One possibility is that networks are unpredictable and, in the field, pretty terrible to work with. Despite exciting advances in networking research, we still don’t have reliable SLAs from our networks. A line of research in distributed computing has asked what we &lt;em&gt;could&lt;/em&gt; do if we had better-behaved networks (e.g., with bounded delay)—but we (still) don’t yet.&lt;sup id=&quot;fnref:forward&quot;&gt;&lt;a href=&quot;#fn:forward&quot; class=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; Given the inability to (easily and practically) distinguish between message delays, link failures, and (both permanent and transient) server failures, we do well by assuming the worst. Essentially, the defining feature of our distributed systems—the network—encourages and rewards us to minimize our reliance on it.&lt;/p&gt;

&lt;p&gt;Over time, I’ve gained a greater appreciation for the subtle power of this worst-case thinking. It’s often instrumental in determining the &lt;em&gt;fundamental&lt;/em&gt; overheads of a given design rather than superficial (albeit important) differences in implementations or engineering quality.&lt;sup id=&quot;fnref:examples&quot;&gt;&lt;a href=&quot;#fn:examples&quot; class=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; It’s a clean (and often elegant) way to reason about system behavior and is a useful tool for systems architects. We’d do well by paying more attention to this pattern while fixating less on failures. Although formulations such as &lt;a href=&quot;http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf&quot;&gt;Abadi’s PACELC&lt;/a&gt; are a step in the right direction, &lt;a href=&quot;http://www.bailis.org/blog/bridging-the-gap-opportunities-in-coordination-avoiding-databases/&quot;&gt;the connections between latency, availability and scale-out performance&lt;/a&gt; deserve more attention.&lt;/p&gt;

&lt;h4 id=&quot;asides&quot;&gt;Asides&lt;/h4&gt;

&lt;div class=&quot;footnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:design&quot;&gt;
      &lt;p&gt;There’s an interesting related challenge (design theme/meme?) regarding how to design systems that behave sanely in the worst case but take advantage of the gap between average-case and worst-case conditions (when it exists). I used to hang out with computer architects, and there’s some cool work on this topic in their community (e.g., &lt;a href=&quot;http://web.eecs.umich.edu/~tnm/trev_test/papersPDF/2005.01.Opportunities%20and%20challenges%20for%20better%20than%20worst-case%20design.pdf&quot;&gt;“Better than Worst-Case design”&lt;/a&gt;). One of my favorite papers is called &lt;a href=&quot;http://www.cs.utah.edu/~rajeev/cs7810/papers/diva99.pdf&quot;&gt;DIVA&lt;/a&gt; and employs a dirt-simple, in-order co-processor to verify the computations of a less reliable primary processor, which employs all sorts of microarchitectural bells and whistles to speed up the primary computation. DIVA only “pays” for errors (worst-case, due to, say, noise, process variation, bugs in the processor design) when they occur (and are subsequently caught by the checker, which is much easier to verify and build correctly). (Fun fact, from Footnote 1: “This [single-author paper] was performed while the author was (un)employed as an independent consultant in late May and June of 1999.”) Some of my favorite algorithms (selfishly, e.g.,  &lt;a href=&quot;http://www.bailis.org/blog/scalable-atomic-visibility-with-ramp-transactions/&quot;&gt;RAMP&lt;/a&gt;) have this flavor, where we’re only penalized when a bad thing occurs (in the case of RAMP-Fast, readers only pay—using a second RTT with servers to repair missing values—when there is actually a race). &lt;a href=&quot;#fnref:design&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:forward&quot;&gt;
      &lt;p&gt;Lest I come off as grumpy: I am actually very excited about this possibility, but it’s a challenging chicken-and-egg problem. Which comes first: better hardware, or applications that could make use of the hardware if the hardware existed? I think we can go either way, but I’ll note that (in the latter scenario), there are troves of brilliant ideas in the distributed computing and database literature that are lying in wait for better hardware. As one of my favorite examples, Barbara Liskov’s PODC 1991 keynote on &lt;a href=&quot;http://dl.acm.org/citation.cfm?id=112601&quot;&gt;“Practical Uses of Synchronized Clocks in Distributed Systems”&lt;/a&gt; has a bunch of great ideas (and associated papers). Now that Barbara’s former student has managed to convince Google to add atomic clocks to their datacenters, these algorithms can be efficiently implemented at scale. I think academics have a serious role to play in this conversation, and this type of co-design is on my research agenda. But there’s probably lower-hanging fruit for practitioners today. &lt;a href=&quot;#fnref:forward&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:examples&quot;&gt;
      &lt;p&gt;Three scenarios where this kind of analysis has useful implications:&lt;/p&gt;

      &lt;p&gt;a.) Whenever someone claims their coordinated (e.g., serializable) system achieves unilateral performance or availability with a coordination-free design, we know they’re either misguided, are &lt;a href=&quot;http://www.bailis.org/blog/without-conflicts-serializability-is-free/&quot;&gt;not exercising an interesting workload&lt;/a&gt;, and/or are simply being &lt;a href=&quot;http://smalldatum.blogspot.com/2014/06/benchmarketing.html&quot;&gt;misleading&lt;/a&gt;. In these cases, we know that, regardless of how the system is implemented, because they’re choosing to implement a fundamentally “hard” problem (e.g., serializability, atomic commitment, or maybe just a consensus register) that requires coordination to prevent some “bad” execution, there’s going to be a cost associated. This is effectively the converse of our thesis: if you have to pay in the worst case, you (may) have to pay in the average case. Moreover (and perhaps most importantly), we don’t have to point fingers and argue about someone’s code or implementation—in many cases, there is literally &lt;em&gt;no way&lt;/em&gt; to implement a better system, and we can prove that there’s no use in trying to do so. (Of course, there’s plenty of fun [and money] in building fast databases, but the point is that we can separate out what’s fundamentally new—which also happens to be a key criterion for good research—and what’s simply built better.)&lt;/p&gt;

      &lt;p&gt;b.)  “Scale out” versus “scale up” systems design is &lt;a href=&quot;http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html&quot;&gt;a hot topic&lt;/a&gt;—should we build (often more sophisticated) single-node systems or build simpler systems that are geared toward horizontal scalability? These two are often opposed, and proponents of each strategy will argue that the other is misguided for some particular workload or system implementation. Yet, couched in the terms of worst-case systems design, advances in single-node performance simply serve to increase the capacity of each node within a cluster. Making each node in a coordination-free system two times faster via “scale up” techniques makes the whole cluster two times faster. If the conversation is typically “scale out” &lt;em&gt;versus&lt;/em&gt; “scale up,” if we’re coordination-free, we get to choose “scale out” &lt;em&gt;while&lt;/em&gt; “scaling up.” This may seem obvious, but consider the fact that speeding up single-node performance does &lt;em&gt;not&lt;/em&gt; always, for example, improve the throughput of distributed transaction processing if (ha!) you are bottlenecked on the network.&lt;/p&gt;

      &lt;p&gt;c.) As hardware changes, the constants associated with communication costs may change—thus affecting the relative benefit of our worst-case designs—but the fundamental gains likely won’t. For example, for decades we’ve been running shared-everything architectures on single-node servers. However, there’s mounting evidence that a shared-nothing approach is desirable in a multi- and (especially) many-core setting. Andy Pavlo and friends recently did &lt;a href=&quot;http://www.cs.cmu.edu/~pavlo/static/papers/p209-yu.pdf&quot;&gt;a study of OLTP scalability to 1000 cores&lt;/a&gt; (in simulation, with some folks at MIT). They found that—regardless of implementation—serializability tanks in performance (due to a variety of factors, depending on the algorithm, but—I’d argue—ultimately due to the fundamental costs of providing serializability). In the distributed setting, today’s datacenter networks make serializable transaction performance an easy target to beat—yet, even as network performance improves and performance bottlenecks lessen in severity, we’ll still see benefits. (An argument I like to make is that, despite DRAM accesses being ridiculously fast—at least relative to remote memory accesses or RPCs—we still spend a lot of time devising cache-friendly algorithms—albeit not for every application, but for many that matter.) &lt;a href=&quot;#fnref:examples&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>When Does Consistency Require Coordination?</title>
   <link href="http://bailis.org/blog//when-does-consistency-require-coordination/"/>
   <updated>2014-11-12T00:00:00-08:00</updated>
   <id>http://bailis.org/blog//when-does-consistency-require-coordination</id>
   <content type="html">&lt;p&gt;My coauthors and I recently &lt;a href=&quot;http://www.vldb.org/pvldb/vol8/p185-bailis.pdf&quot;&gt;published a paper&lt;/a&gt; (that will appear at &lt;a href=&quot;http://www.vldb.org/2015/&quot;&gt;VLDB 2015&lt;/a&gt;) answering one of my longest standing research questions: when does consistency require coordination? It’s well known that many “strong” properties like “ACID” &lt;a href=&quot;http://www.bailis.org/blog/linearizability-versus-serializability/&quot;&gt;serializability and linearizability&lt;/a&gt; are &lt;a href=&quot;http://queue.acm.org/detail.cfm?id=2462076&quot;&gt;not achievable&lt;/a&gt; without coordination, or synchronous communication between concurrent operations. But why is it that we can still implement &lt;a href=&quot;http://db.cs.berkeley.edu/cs286/papers/crdt-tr2011.pdf&quot;&gt;reliable distributed counters&lt;/a&gt; and &lt;a href=&quot;http://db.cs.berkeley.edu/cs286/papers/dynamo-sosp2007.pdf&quot;&gt;shopping carts&lt;/a&gt; that don’t lose writes, build indexes that are &lt;a href=&quot;http://www.bailis.org/blog/scalable-atomic-visibility-with-ramp-transactions/&quot;&gt;“consistent” with base data&lt;/a&gt;, and ensure useful properties like &lt;a href=&quot;http://db.cs.berkeley.edu/cs286/papers/baseball-cacm2013.pdf&quot;&gt;read-your-writes&lt;/a&gt;—all without coordination? Why do some operations require coordination while others don’t, and what’s the fundamental difference at play?&lt;/p&gt;

&lt;p&gt;In the paper, we present a property, called &lt;strong&gt;invariant confluence&lt;/strong&gt; (I-confluence), that precisely answers this question. I-confluence is necessary and sufficient for safe, coordination-free, available, and convergent execution (think &lt;a href=&quot;http://henryr.github.io/cap-faq/&quot;&gt;CAP “AP”&lt;/a&gt;, without breaking your app). That is, if I-confluence holds, there exists a coordination-free execution strategy that preserves these properties. If it does not hold, no such strategy exists, so don’t waste your time looking for a coordination-free algorithm or concurrency control mechanism that’ll work in all cases.&lt;/p&gt;

&lt;p&gt;The intuition behind I-confluence is pretty simple. We capture “consistency” (or &lt;a href=&quot;http://www.bailis.org/blog/safety-and-liveness-eventual-consistency-is-not-safe/&quot;&gt;safety&lt;/a&gt;) using &lt;em&gt;invariants&lt;/em&gt;, or declarative correctness criteria about database states (e.g., no user has a negative account balance). Then, given a user’s transactions and a &lt;em&gt;merge&lt;/em&gt; procedure used to reconcile divergent states (e.g., last-write-wins, set union, or a convergent datatype), we can check for I-confluence. Simply put, under I-confluence: &lt;strong&gt;all local commit decisions must be globally invariant-preserving&lt;/strong&gt;. If I commit an operation on my local copy of database state, I must make sure that no other concurrent operation would invalidate my commit decision upon merge. If no such state exists, I don’t have to coordinate to commit. That’s it.&lt;/p&gt;

&lt;p&gt;“That’s it? That sounds too simple to be true!” you protest.&lt;/p&gt;

&lt;p&gt;It’s true, &lt;em&gt;formalizing&lt;/em&gt; this property requires some care,&lt;sup id=&quot;fnref:formalism&quot;&gt;&lt;a href=&quot;#fn:formalism&quot; class=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; but the basic idea &lt;em&gt;is&lt;/em&gt; pretty simple.&lt;sup id=&quot;fnref:rewriting&quot;&gt;&lt;a href=&quot;#fn:rewriting&quot; class=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; The key is to specify correctness in terms of invariants rather than reads and writes.&lt;sup id=&quot;fnref:abstraction&quot;&gt;&lt;a href=&quot;#fn:abstraction&quot; class=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; Without knowledge of what “correctness” means to your app (e.g., the invariants used in I-confluence), the &lt;a href=&quot;http://www.eecs.harvard.edu/~htk/publication/1979-sigmod-kung-papadimitriou.pdf&quot;&gt;best you can do&lt;/a&gt; to preserve correctness under a read/write model is serializability.&lt;sup id=&quot;fnref:otherproperties&quot;&gt;&lt;a href=&quot;#fn:otherproperties&quot; class=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; (Hopefully your database &lt;a href=&quot;http://www.bailis.org/blog/when-is-acid-acid-rarely/&quot;&gt;offers it&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;“So what?” you ask.&lt;/p&gt;

&lt;p&gt;We applied I-confluence to a range of integrity constraints found in database systems, including foreign key, uniqueness, and row-level check constraints as well as some invariants over abstract datatypes.&lt;sup id=&quot;fnref:analysis&quot;&gt;&lt;a href=&quot;#fn:analysis&quot; class=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt; The results show that, while coordination can’t always be avoided, in many cases, it &lt;em&gt;can&lt;/em&gt; (even when serializability would indicate otherwise). By applying these results to the industry standard TPC-C Benchmark, &lt;strong&gt;we achieve a 25-fold improvement over the prior best result&lt;/strong&gt; (12.7M New-Order transactions per second on 200 servers—I’ll write about the details in another blog post, but, for now, the paper has details). We’ve also been looking at open source applications and applying I-confluence analysis (the subject of another paper and post) and have seen similar wins. Our experience indicates I-confluence works.&lt;/p&gt;

&lt;p&gt;Consistency doesn’t always require coordination. In fact, in practice, it seldom does. I’m excited to finally have &lt;a href=&quot;http://www.vldb.org/pvldb/vol8/p185-bailis.pdf&quot;&gt;a piece of work&lt;/a&gt; that exactly characterizes this trade-off, backed by some serious experimental data that shows why it matters.&lt;/p&gt;

&lt;h4 id=&quot;notes&quot;&gt;Notes&lt;/h4&gt;

&lt;div class=&quot;footnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:formalism&quot;&gt;
      &lt;p&gt;The paper provides actual formalism, which, if you’re a footnote reader, you should consider checking out. In a little more detail, the set of states reachable via invariant-preserving  executions must be closed under merge. We also formalize coordination-freedom, availability, and convergence more carefully. If you absolutely love formalism, you can also check out the &lt;a href=&quot;http://arxiv.org/pdf/1402.2237.pdf&quot;&gt;extended version&lt;/a&gt; of the paper on the arXiv. &lt;a href=&quot;#fnref:formalism&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:rewriting&quot;&gt;
      &lt;p&gt;When I was doing this work, I was convinced someone else had come up with a property like this already. There’s a huge amount of work on semantics-based concurrency control (mostly from the 1980s) and even more on concurrent program analysis. But, for a bunch of reasons, we actually found the basic property we were seeking in the literature on rule-based rewriting systems rather than database concurrency control or classic program analysis. In 2007, some very smart folks working on rule-based rewriting wanted a more expressive way to talk about “correct” rewrites without resorting to a formulation based on determinism, or confluence (see below), so &lt;a href=&quot;http://ww2.cs.mu.oz.au/~pjs/papers/iclp07.pdf&quot;&gt;they came up with the idea of using invariants&lt;/a&gt; to talk about the validity of intermediate and final steps in the rewriting process. Our use of I-confluence is, if you squint, similar in nature to what they proposed (e.g., treat transactions as rules, and merge as “meet”), and so we adopted the name. I’d love to spend more time digging into that literature and taking greater advantage of rule-based rewriting techniques in I-confluence analysis.&lt;/p&gt;

      &lt;p&gt;To go a little deeper on the rest of the literature: we have a pretty extensive related work section in the paper, but there were two main problems I ran into in the literature. First, much of the work on semantics-based concurrency control in databases was helpful but wasn’t directly focused on &lt;em&gt;necessary&lt;/em&gt; conditions for coordination-free execution. Second, in the program analysis literature, many techniques assume linearizable update to shared state. For example, if you declare pre- and post-conditions (which we don’t), you can often determine the safety of executing each of these operations in a linearizable manner. Not only is this linearizability already a non-starter in a coordination-free environment, but &lt;em&gt;also&lt;/em&gt; if you’re in a shared-memory setting, any writes to the same item conflict and can cause terrible problems in many analysis techniques. In contrast, in a distributed setting (or even single-node environment with the ability to rewrite/curry writes to local program variables, then play with them later), writes to the same data item can be reconciled via potentially complex logic (e.g., merge). I’d be shocked if you couldn’t find some encoding of merge and the general I-confluence rules in these program analysis techniques, but given that the rewriting literature made this basically “free”, we stuck with that. &lt;a href=&quot;#fnref:rewriting&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:abstraction&quot;&gt;
      &lt;p&gt;Another (more subtle) benefit along these lines is that I-confluence allows us to reason about arbitrary, abstract implementations and deployments of database systems rather than any particular system implementation or deployment. We effectively “lift” the partitioning arguments typified by Gilbert and Lynch’s &lt;a href=&quot;http://broadsoft.googlecode.com/svn-history/r208/docs/km/Architecture/Brewer_s_conjecture_and_the_feasibility_of_consistent__available__partition-tolerant_web_services.pdf&quot;&gt;proof of the CAP Theorem&lt;/a&gt; to the level of arbitrary program logic (rather than read/write logic executed in parallel on different, adversarially partitioned servers). If that’s inscrutable, compare the paper’s I-confluence proofs to those in the &lt;a href=&quot;http://www.bailis.org/papers/hat-vldb2014.pdf&quot;&gt;HAT paper&lt;/a&gt; or those used by Gilbert and Lynch. This higher-level abstraction is, in our experience, more intuitive than reasoning about “partition” behavior—and also applies to single-node systems as well. &lt;em&gt;Basically, instead of having to cook up a scenario where servers become partitioned, forcing “inconsistency,” we can rely on a basic property of application logic instead.&lt;/em&gt; &lt;a href=&quot;#fnref:abstraction&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:otherproperties&quot;&gt;
      &lt;p&gt;Those of you who live and breathe this stuff (like me) will probably be asking: “what about commutativity?” or “what about monotonicity?”&lt;/p&gt;

      &lt;p&gt;Commutativity, or, in effect, order insensitivity of operations, is useful and is usually sufficient for safe, concurrent execution. But it’s not always necessary for correctness: for example, reads don’t commute with writes, but, if all we care about is that no one writes the value 0xDEADBEEF, who cares about reads? We provide some more examples in the paper, and so do &lt;a href=&quot;http://web.mit.edu/amdragon/www/pubs/commutativity-sosp13.pdf&quot;&gt;Clements et al.&lt;/a&gt; Bonus: commutativity &lt;a href=&quot;http://www.bailis.org/blog/data-integrity-and-problems-of-scope/&quot;&gt;isn’t always safe&lt;/a&gt;. For example, if we we use a commutative counter to ensure we don’t “lose” the effects of increment or decrement operations, there’s no guarantee that our counter will obey the invariant that the counter is non-negative. (You might argue that this wasn’t really a “commutative” operation, at which we could have a fun conversation.)&lt;/p&gt;

      &lt;p&gt;Monotonicity, or, in effect, “grow-only” program logic, is also useful and is necessary and sufficient to guarantee &lt;em&gt;confluence&lt;/em&gt;, or determinism, of outcomes despite nondeterministic ordering of program inputs. This underlies languages like &lt;a href=&quot;http://cidrdb.org/cidr2011/Papers/CIDR11_Paper35.pdf&quot;&gt;Bloom&lt;/a&gt;, &lt;a href=&quot;http://db.cs.berkeley.edu/papers/socc12-blooml.pdf&quot;&gt;Bloom^L&lt;/a&gt;, and &lt;a href=&quot;https://www.cs.indiana.edu/~rrnewton/papers/2013-FHPC_LVars.pdf&quot;&gt;LVars&lt;/a&gt;. Determinism is useful, but it’s not necessary for all forms of correctness. As a classic example, consensus protocols allow &lt;em&gt;any&lt;/em&gt; proposed value to be accepted insofar as all processes agree on that value—the outcome is, moreover, nondeterministic due to network delays in many common protocols like Paxos. That is, Paxos is not confluent (nor monotonic), but it’s still useful. In the coordination-free domain, the read operation in our commutativity example above wouldn’t be monotonic (we &lt;em&gt;could&lt;/em&gt; do a blocking threshold read, which might work in some cases, but then we’ll have to block). Moreover, if we only &lt;em&gt;decremented&lt;/em&gt; a shared counter, the decrement operations would be monotonic, but, if we had the same non-negative invariant, we could end up with a deterministic (negative) but incorrect state. So, in short, confluence and invariant preservation are both useful properties, but we care about the latter in this paper.&lt;/p&gt;

      &lt;p&gt;The silver lining here is that—despite false positives for coordination (and, depending on your definitions, false negatives)—both commutativity and monotonicity are in some cases easier to analyze. Commutativity, monotonicity, and I-confluence are all undecidable properties. But the first two depend only on the program logic, while I-confluence requires users to &lt;em&gt;also&lt;/em&gt; specify invariants. (However, note that these invariants are really only one-per-database—or one conjunction per database—rather than one per operation, as in Hoare-style program analyses.) The open source applications we’ve been studying actually declare these surprisingly often, but there’s still a trade-off between user participation and precision in determining coordination requirements. &lt;a href=&quot;#fnref:otherproperties&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:analysis&quot;&gt;
      &lt;p&gt;If you’re interested in &lt;em&gt;how&lt;/em&gt; we actually apply the test, there are details in the paper. In brief, our current approach is to pre-classify invariant-operation pairs (by manual proof—check out the &lt;a href=&quot;http://arxiv.org/pdf/1402.2237.pdf&quot;&gt;arXiv&lt;/a&gt; for a full walk-through; these aren’t too hard) and then use simple static analysis to check pairs in code (I hesitate to really call this “program analysis”).&lt;/p&gt;

      &lt;p&gt;My current goal is to use I-confluence as a basis for principled systems optimizations like &lt;a href=&quot;http://www.bailis.org/blog/scalable-atomic-visibility-with-ramp-transactions/&quot;&gt;RAMP&lt;/a&gt;, the TPC-C results, and some of the open source work I’ve been doing. That said, I see a huge amount of low-hanging fruit in improving and further automating this analysis. &lt;a href=&quot;#fnref:analysis&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>Data Integrity and Problems of Scope</title>
   <link href="http://bailis.org/blog//data-integrity-and-problems-of-scope/"/>
   <updated>2014-10-20T00:00:00-07:00</updated>
   <id>http://bailis.org/blog//data-integrity-and-problems-of-scope</id>
   <content type="html">&lt;p&gt;Mutable state in distributed systems can cause all sorts of headaches, including data loss, corruption, and unavailability. Fortunately, there are a range of techniques—including exploiting commutativity and immutability—that can help reduce the incidence of these events without requiring much overhead. However, these techniques are only useful when applied correctly. When applied &lt;em&gt;incorrectly&lt;/em&gt;, applications are still subject to data loss and corruption. In my experience, (the unfortunately common) incorrect application of these techniques is often due to problems of &lt;em&gt;scope&lt;/em&gt;. What do I mean by scope? Let’s look at two examples:&lt;/p&gt;

&lt;h4 id=&quot;commutativity-gone-wrong&quot;&gt;1.) Commutativity Gone Wrong&lt;/h4&gt;

&lt;p&gt;For our purposes, commutativity informally means that the result of executing two operations is independent of the order in which the operations were executed. Consider a counter: if I increment by one and you also increment by one, the end result (&lt;code&gt;counter += 2&lt;/code&gt;) is equivalent despite the order in which we execute our operations! We can use this fact to built more scalable counters that exploit this potential for reorderability.&lt;/p&gt;

&lt;p&gt;However, we’re not off the hook yet. While our &lt;em&gt;operations&lt;/em&gt; on the &lt;em&gt;individual&lt;/em&gt; counter were commutative, does this mean that our &lt;em&gt;programs&lt;/em&gt; that use this counter are “correct”? If we’re Twitter and we want to build retweet counters, then commutative counters are probably a fine choice. But what if the counter is storing a bank account balance?&lt;sup id=&quot;fnref:bank&quot;&gt;&lt;a href=&quot;#fn:bank&quot; class=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; Or if it’s monitoring the amount of available inventory for a given item in a warehouse? While two individual decrement operations are commutative, two account withdrawal operations (or two item order placement operations) are not. By looking at individual data items instead of how the data is used by our programs, we risk data corruption. When analyzing commutativity, myopia is dangerous.&lt;/p&gt;

&lt;h4 id=&quot;immutability-gone-wrong&quot;&gt;2.) Immutability Gone Wrong&lt;/h4&gt;

&lt;p&gt;As a second example, consider the recent trend of making all data immutable (e.g., the &lt;a href=&quot;http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html&quot;&gt;Lambda architecture&lt;/a&gt;). The idea here is that we can collect a grow-only set of facts and subsequently run queries over those facts (often via a mix of batch and streaming systems). There’s no way to corrupt individual data items as, once they’re created, they can’t change. If I can name a fact that’s been created, I can always get a “consistent” read of that item from the underlying store holding my facts. Sounds great, right?&lt;/p&gt;

&lt;p&gt;Again, it’s easy to miss the forest for the trees. Simply because individual writes are immutable doesn’t mean that program outcomes are somehow “automatically correct.” Specifically, the outputs of the functions we’re computing over these facts are—in many cases—sensitive to order and are “mutable” with respect to the underlying facts. Pretend we use immutable data to store a ledger of deposits and withdrawals. In our bank account example, simply making individual deposit and withdrawal entries immutable doesn’t solve our above problems: we can insert two “immutable” withdrawal requests and get a negative final balance!&lt;sup id=&quot;fnref:immutability&quot;&gt;&lt;a href=&quot;#fn:immutability&quot; class=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;h4 id=&quot;the-real-problem&quot;&gt;The Real Problem&lt;/h4&gt;

&lt;p&gt;The crux of the problem is that, if used to provide data integrity, these properties &lt;em&gt;must&lt;/em&gt; be applied at the appropriate scope.&lt;sup id=&quot;fnref:conway&quot;&gt;&lt;a href=&quot;#fn:conway&quot; class=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; Simply making individual operations commutative or immutable doesn’t guarantee that their composition with other operations in your application &lt;em&gt;also&lt;/em&gt; will be, nor will this somehow automatically lead to correct behavior. Analyzing operations (or even groups of operations) in isolation—divorced from the &lt;em&gt;context&lt;/em&gt; in which they are being used—is often unsafe. If you’re building an application and want to use these properties to guarantee application correctness, you likely need to analyze your whole program behavior.&lt;/p&gt;

&lt;h4 id=&quot;to-be-fair&quot;&gt;To Be Fair…&lt;/h4&gt;

&lt;p&gt;To be fair, commutativity and immutability—even applied at too fine a scope—do have merits. They may lead to a decreased likelihood of data loss and/or overwrites than simply using mutable state without coordination. Many of the issues involved in &lt;a href=&quot;http://technet.microsoft.com/en-us/library/aa213029(v=sql.80).aspx&quot;&gt;“Lost Updates”&lt;/a&gt; go away. It’s often easier to &lt;a href=&quot;http://www.eecs.berkeley.edu/~brewer/cs262b/update-conflicts.pdf&quot;&gt;detect integrity violations&lt;/a&gt; and issue compensating actions in a no-overwrite system. And remember, some programs are actually safe under arbitrary composition of commutative and/or immutable operations with program logic. Just not all programs.&lt;/p&gt;

&lt;p&gt;These are good design patterns that help simplify many of the challenges we started with. But naïvely applying design patterns isn’t a substitute for understanding your program behavior.&lt;/p&gt;

&lt;p&gt;In a sense, these problems of scope are a natural side-effect of good software engineering. As systems builders, we design and implement abstractions that can be re-used across multiple programs. We have to make decisions about what information we need from our applications and how to take advantage of that information.&lt;sup id=&quot;fnref:research&quot;&gt;&lt;a href=&quot;#fn:research&quot; class=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; In many cases, it’s unrealistic for us to require our programmers to submit their entire program for commutativity analysis on every request (e.g., &lt;code&gt;increment(key: String, program: JARFILE)&lt;/code&gt;!). We have to make compromises, and exposing operation-commutative counters is a reasonable compromise—as long as users understand the implications and are wise enough to guard against any undesirable behavior.&lt;/p&gt;

&lt;p&gt;These problems of scope crop up in traditional database systems as well. Consider a serializable database—say, the best that money can(’t) buy, with no bugs, zero latency, infinite throughput, and guaranteed availability. If your bank application fails to wrap its multi-statement increment and decrement operations in a transaction, the database will still corrupt your data. Implicit in the serializable transaction API is the &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/gray/papers/thetransactionconcept.pdf&quot;&gt;assumption&lt;/a&gt; that your transactions will preserve data integrity on their own. If you scope your transactions incorrectly, your serializable database won’t help you.&lt;/p&gt;

&lt;h4 id=&quot;conclusions-and-some-references&quot;&gt;Conclusions and Some References&lt;/h4&gt;

&lt;p&gt;Where’s the silver lining? If you reason about your application’s correctness at the correct scope (often, the application level), you’ll be fine. There are a number of powerful ways to achieve coordination-free execution without compromising correctness—if used correctly. Figure out what “correctness” means to you (e.g., state an invariant), and analyze your applications and operations at the appropriate granularity.&lt;/p&gt;

&lt;p&gt;If you’re interested, many people are thinking about these issues. My colleagues and I &lt;a href=&quot;http://db.cs.berkeley.edu/papers/socc13-consistency.pdf&quot;&gt;wrote a short paper&lt;/a&gt; on reasoning about data integrity at different layers of abstraction, from read/write to object and whole program. Some smart people from MIT &lt;a href=&quot;http://am.csail.mit.edu/papers/commutativity:sosp13.pdf&quot;&gt;wrote a paper&lt;/a&gt; last year on how to reason about commutativity of interfaces, and my collaborators have &lt;a href=&quot;http://cidrdb.org/cidr2011/Papers/CIDR11_Paper35.pdf&quot;&gt;developed ways to guarantee determinism&lt;/a&gt; of outcomes via whole-program &lt;em&gt;monotonicity&lt;/em&gt; analysis in the Bloom language (see also &lt;a href=&quot;#conway&quot;&gt;note two&lt;/a&gt;). Finally, &lt;a href=&quot;http://www.bailis.org/papers/ca-vldb2015.pdf&quot;&gt;some of my own recent work&lt;/a&gt; examines when &lt;em&gt;any&lt;/em&gt; coordination is strictly required to guarantee arbitrary application integrity (that is, when we can preserve application “consistency” without coordination and when we &lt;em&gt;must&lt;/em&gt; coordinate).&lt;/p&gt;

&lt;p&gt;With a better understanding of what is and is not possible—along with tools that embody this understanding and systems that exploit it—we can build much more robust and scalable systems without compromising safety or usability for end users.&lt;/p&gt;

&lt;h4 id=&quot;notes&quot;&gt;Notes&lt;/h4&gt;

&lt;div class=&quot;footnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:bank&quot;&gt;
      &lt;p&gt;Yes, I used the bank example, and, indeed, &lt;a href=&quot;http://highscalability.com/blog/2013/5/1/myth-eric-brewer-on-why-banks-are-base-not-acid-availability.html&quot;&gt;banks deal with this problem&lt;/a&gt; via compensating actions and other techniques. But the point remains: otherwise commutative decrements on bank account balances aren’t enough to prevent—on their own, without other program logic—data integrity errors like negative balances. &lt;a href=&quot;#fnref:bank&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:immutability&quot;&gt;
      &lt;p&gt;In short, I can’t safely make decisions based on the output of the functions I’m computing without making additional guarantees on the underlying set of facts. This is actually a pretty well studied (albeit challenging) problem in database systems. We’d call the “Lambda Architecture” an instance of “incremental materialized view maintenance” (or “continuous query processing”; via a combination of stream processing and batch processing). This has been, at various times, a hot topic in the research community and even in practice. For example, &lt;a href=&quot;http://cs.stanford.edu/people/widom/&quot;&gt;Jennifer Widom&lt;/a&gt;’s group at Stanford spent a considerable amount of time in the early 2000s &lt;a href=&quot;http://ilpubs.stanford.edu:8090/758/1/2003-67.pdf&quot;&gt;understanding the relationship&lt;/a&gt; between streams and relations. &lt;a href=&quot;http://web.cecs.pdx.edu/~maier/&quot;&gt;Dave Maier&lt;/a&gt; and collaborators developed &lt;a href=&quot;http://db.cs.berkeley.edu/cs286/papers/punctuations-tkde2003.pdf&quot;&gt;punctuated stream semantics&lt;/a&gt; that’d allow you to “seal” time and actually make definitive statements about a streaming output that, by definition, won’t change in the future. And &lt;a href=&quot;http://www.cs.berkeley.edu/~franklin/&quot;&gt;Mike Franklin&lt;/a&gt; and friends built a &lt;a href=&quot;http://www.cisco.com/web/about/ac49/ac0/ac1/ac259/truviso.html&quot;&gt;company&lt;/a&gt; to commercialize earlier stream processing research from &lt;a href=&quot;http://db.lcs.mit.edu/madden/html/TCQcidr03.pdf&quot;&gt;Berkeley’s TelegraphCQ project&lt;/a&gt;. Not surprisingly, the semantics of continuous queries became a &lt;a href=&quot;http://www.eecs.berkeley.edu/~franklin/Papers/sigmod10krishnamurthy.pdf&quot;&gt;serious issue in practice&lt;/a&gt;. &lt;a href=&quot;#fnref:immutability&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:conway&quot;&gt;
      &lt;p&gt;&lt;a href=&quot;http://www.neilconway.org/&quot;&gt;Neil Conway&lt;/a&gt; and the &lt;a href=&quot;http://www.bloom-lang.net/&quot;&gt;Bloom&lt;/a&gt; team’s &lt;a href=&quot;http://db.cs.berkeley.edu/papers/socc12-blooml.pdf&quot;&gt;work on Bloom^L&lt;/a&gt; contains one of the first–and perhaps most lucid—discussions of what they call the “scope dilemma.” As part of the paper, they show that the composition of individual commutative replicated data types can yield non-commutative and, generally, incorrect behavior. Their later work (driven by &lt;a href=&quot;http://www.cs.berkeley.edu/~palvaro/&quot;&gt;Peter Alvaro&lt;/a&gt;) on &lt;a href=&quot;http://db.cs.berkeley.edu/papers/icde14-blazes.pdf&quot;&gt;Blazes&lt;/a&gt; examines problems of composition of monotonic and non-monotonic program modules across dataflows. &lt;a href=&quot;#fnref:conway&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:research&quot;&gt;
      &lt;p&gt;These same issues apply to research, too. For example, the distributed computing community has standard abstractions like registers and consensus objects in part because these abstractions allow researchers to talk about, compare, and work on a common set of concepts.&lt;/p&gt;

      &lt;p&gt;Moreover, research papers are not immune to making these sorts of errors. I actually started writing this post in response to a paper (on a particular distributed computing application) that we read in a campus reading group last week that abused a notion of “commutativity and associativity” and sparked a discussion along these lines. (I’m sure I’ve made these mistakes in some of my own papers.) To be clear, I’m not necessarily advocating for greater formalization of these concepts—you can formalize anything (and sometimes it’s even useful to do so)—but am instead advocating for greater care in thinking about when these concepts can be successfully applied. &lt;a href=&quot;#fnref:research&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>Linearizability versus Serializability</title>
   <link href="http://bailis.org/blog//linearizability-versus-serializability/"/>
   <updated>2014-09-24T00:00:00-07:00</updated>
   <id>http://bailis.org/blog//linearizability-versus-serializability</id>
   <content type="html">&lt;p&gt;Linearizability and serializability are both important properties about interleavings of operations in databases and distributed systems, and it’s easy to get them confused. This post gives a short, simple, and hopefully practical overview of the differences between the two.&lt;/p&gt;

&lt;h4 id=&quot;linearizability-single-operation-single-object-real-time-order&quot;&gt;Linearizability: single-operation, single-object, real-time order&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;&lt;a href=&quot;http://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf&quot;&gt;Linearizability&lt;/a&gt; is a guarantee about single operations on single objects.&lt;/em&gt; It provides a real-time (i.e., wall-clock) guarantee on the behavior of a set of single operations (often reads and writes) on a single object (e.g., distributed register or data item).&lt;/p&gt;

&lt;p&gt;In plain English, under linearizability, writes should appear to be instantaneous. Imprecisely, once a write completes, all later reads (where “later” is defined by wall-clock start time) should return the value of that write or the value of a later write. Once a read returns a particular value, all later reads should return that value or the value of a later write.&lt;/p&gt;

&lt;p&gt;Linearizability for read and write operations is synonymous with the term “atomic consistency” and is the “C,” or “consistency,” in Gilbert and Lynch’s &lt;a href=&quot;http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf&quot;&gt;proof of the CAP Theorem&lt;/a&gt;. We say linearizability is &lt;em&gt;composable&lt;/em&gt;  (or “local”) because, if operations on each object in a system are linearizable, then all operations in the system are linearizable.&lt;/p&gt;

&lt;h4 id=&quot;serializability-multi-operation-multi-object-arbitrary-total-order&quot;&gt;Serializability: multi-operation, multi-object, arbitrary total order&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Serializability is a guarantee about transactions, or groups of one or more operations over one or more objects.&lt;/em&gt; It guarantees that the execution of a set of transactions (usually containing read and write operations) over multiple items is equivalent to &lt;em&gt;some&lt;/em&gt; serial execution (total ordering) of the transactions.&lt;/p&gt;

&lt;p&gt;Serializability is the traditional “I,” or isolation, in &lt;a href=&quot;http://sites.fas.harvard.edu/~cs265/papers/haerder-1983.pdf&quot;&gt;ACID&lt;/a&gt;. If users’ transactions each preserve application correctness (“C,” or consistency, in ACID), a serializable execution also preserves correctness. Therefore, serializability is a mechanism for guaranteeing database correctness.&lt;sup id=&quot;fnref:mechanism&quot;&gt;&lt;a href=&quot;#fn:mechanism&quot; class=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Unlike linearizability, serializability does not—by itself—impose any real-time constraints on the ordering of transactions. Serializability is also not composable. Serializability does not imply any kind of deterministic order—it simply requires that &lt;em&gt;some&lt;/em&gt; equivalent serial execution exists.&lt;/p&gt;

&lt;h4 id=&quot;strict-serializability-why-dont-we-have-both&quot;&gt;Strict Serializability: Why don’t we have both?&lt;/h4&gt;

&lt;p&gt;Combining serializability and linearizability yields &lt;em&gt;strict serializability&lt;/em&gt;: transaction behavior is equivalent to some serial execution, and the serial order corresponds to real time. For example, say I begin and commit transaction T1, which writes to item &lt;em&gt;x&lt;/em&gt;, and you later begin and commit transaction T2, which reads from &lt;em&gt;x&lt;/em&gt;. A database providing strict serializability for these transactions will place T1 before T2 in the serial ordering, and T2 will read T1’s write. A database providing serializability (but not strict serializability) could order T2 before T1.&lt;sup id=&quot;fnref:implementation&quot;&gt;&lt;a href=&quot;#fn:implementation&quot; class=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;As &lt;a href=&quot;http://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf&quot;&gt;Herlihy and Wing&lt;/a&gt; note, “linearizability can be viewed as a special case of strict serializability where transactions are restricted to consist of a single operation applied to a single object.”&lt;/p&gt;

&lt;h4 id=&quot;coordination-costs-and-real-world-deployments&quot;&gt;Coordination costs and real-world deployments&lt;/h4&gt;

&lt;p&gt;Neither linearizability nor serializability is achievable without coordination. That is we can’t provide either guarantee with availability (i.e., CAP “AP”) under an asynchronous network.&lt;sup id=&quot;fnref:hardness&quot;&gt;&lt;a href=&quot;#fn:hardness&quot; class=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;In practice, your database is &lt;a href=&quot;http://www.bailis.org/blog/when-is-acid-acid-rarely/&quot;&gt;unlikely to provide serializability&lt;/a&gt;, and your multi-core processor is &lt;a href=&quot;http://preshing.com/20120930/weak-vs-strong-memory-models/&quot;&gt;unlikely to provide linearizability&lt;/a&gt;—at least by default. As the above theory hints, achieving these properties requires a lot of expensive coordination. So, instead, real systems often use cheaper-to-implement and often &lt;a href=&quot;http://www.bailis.org/blog/understanding-weak-isolation-is-a-serious-problem/&quot;&gt;harder-to-understand&lt;/a&gt; models. This trade-off between efficiency and programmability represents a fascinating and challenging design space.&lt;/p&gt;

&lt;h4 id=&quot;a-note-on-terminology-and-more-reading&quot;&gt;A note on terminology, and more reading&lt;/h4&gt;

&lt;p&gt;One of the reasons these definitions are so confusing is that linearizability hails from the distributed systems and concurrent programming communities, and serializability comes from the database community. Today, almost everyone uses &lt;em&gt;both&lt;/em&gt; distributed systems and databases, which often leads to overloaded terminology (e.g., “consistency,” “atomicity”).&lt;/p&gt;

&lt;p&gt;There are many more precise treatments of these concepts. I like &lt;a href=&quot;http://link.springer.com/book/10.1007%2F978-3-642-15260-3&quot;&gt;this book&lt;/a&gt;, but there is plenty of free, concise, and (often) accurate material on the internet, such as &lt;a href=&quot;https://www.cs.rochester.edu/~scott/458/notes/04-concurrent_data_structures&quot;&gt;these notes&lt;/a&gt;.&lt;/p&gt;

&lt;h4 id=&quot;notes&quot;&gt;Notes&lt;/h4&gt;

&lt;div class=&quot;footnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:mechanism&quot;&gt;
      &lt;p&gt;But it’s not the only mechanism!&lt;/p&gt;

      &lt;p&gt;Granted, serializability is (more or less) the most &lt;em&gt;general&lt;/em&gt; means of maintaining database correctness. In what’s becoming one of my favorite “underground” (i.e., relatively poorly-cited) references, &lt;a href=&quot;http://en.wikipedia.org/wiki/H._T._Kung&quot;&gt;H.T. Kung&lt;/a&gt; and &lt;a href=&quot;http://en.wikipedia.org/wiki/Christos_Papadimitriou&quot;&gt;Christos Papadimitriou&lt;/a&gt; dropped a paper in SIGMOD 1979 on “&lt;a href=&quot;http://www.eecs.harvard.edu/~htk/publication/1979-sigmod-kung-papadimitriou.pdf&quot;&gt;An Optimality Theory of Concurrency Control for Databases&lt;/a&gt;.” In it, they essentially show that, if all you have are transactions’ syntactic modifications to database state (e.g., read and write) and &lt;em&gt;no&lt;/em&gt; information about application logic, serializability is, in some sense, “optimal”: in effect, a schedule that is not serializable might modify the database state in a way that produces inconsistency for some (arbitrary) notion of correctness that is not known to the database.&lt;/p&gt;

      &lt;p&gt;However, if &lt;em&gt;do&lt;/em&gt; you know more about your user’s notions of correctness (say, you &lt;em&gt;are&lt;/em&gt; the user!), you can often do a lot more in terms of concurrency control and can circumvent many of the fundamental overheads imposed by serializability. Recognizing when you don’t need serializability (and subsequently exploiting this fact) is the best way I know to “beat CAP.” &lt;a href=&quot;#fnref:mechanism&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:implementation&quot;&gt;
      &lt;p&gt;Note that some implementations of serializability (such as two-phase locking with long write locks and long read locks) actually provide strict serializability. As &lt;a href=&quot;http://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf&quot;&gt;Herlihy and Wing&lt;/a&gt; point out, other implementations (such as some MVCC implementations) may not.&lt;/p&gt;

      &lt;p&gt;So, why didn’t the early papers that defined serializability call attention to this real-time ordering? In some sense, real time doesn’t really matter: all serializable schedules are equivalent in terms of their power to preserve database correctness! However, there are some weird edge cases: for example, returning NULL in response to every read-only transaction is serializable (provided we start with an empty database) but rather unhelpful.&lt;/p&gt;

      &lt;p&gt;One tantalizingly plausible theory for this omission is that, back in the 1970s when serializability theory was being invented, everyone was running on single-site systems anyway, so linearizability essentially “came for free.” However, I believe this theory is unlikely: for example, database pioneer &lt;a href=&quot;http://en.wikipedia.org/wiki/Phil_Bernstein&quot;&gt;Phil Bernstein&lt;/a&gt; was already looking at distributed transaction execution in his SDD-1 system &lt;a href=&quot;http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA131789&quot;&gt;as early as 1977&lt;/a&gt; (and there are &lt;a href=&quot;http://ieeexplore.ieee.org/xpl/login.jsp?tp=&amp;amp;arnumber=4567884&quot;&gt;older references&lt;/a&gt; yet). Even in this early work, Bernstein (and company) are careful to stress that “there may in fact be &lt;em&gt;several&lt;/em&gt; such equivalent serial orderings” [emphasis theirs]. To further put this theory to rest, Papadimitriou makes clear in his seminal &lt;a href=&quot;https://www.cs.purdue.edu/homes/bb/cs542-06Spr-bb/SCDU-Papa-79.pdf&quot;&gt;1979 JACM&lt;/a&gt; article that he’s familiar with problems inherent in a distributed setting. (If you ever want to be blown away by the literature, look at how much of the foundational work on concurrency control was done by the early 1980s.) &lt;a href=&quot;#fnref:implementation&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:hardness&quot;&gt;
      &lt;p&gt;For distributed systems nerds: achieving linearizability for reads and writes is, in a formal sense, “easier” to achieve than serializability. This is probably deserving of another post (encouragement appreciated!), but here’s some intuition: terminating atomic register read/write operations &lt;a href=&quot;http://www.cse.huji.ac.il/course/2004/dist/p124-attiya.pdf&quot;&gt;are achievable&lt;/a&gt; in a fail-stop model. Yet atomic commitment—which is needed to execute multi-site serializable transactions (think: AC is to 2PC as consensus is to Paxos)—is not: the &lt;a href=&quot;http://www.cs.utexas.edu/~lorenzo/corsi/cs380d/past/03F/notes/fischer.pdf&quot;&gt;FLP result&lt;/a&gt; says consensus is unachievable in a fail-stop model (hence &lt;em&gt;with One Faulty Process&lt;/em&gt;), and (non-blocking) atomic commitment is &lt;a href=&quot;http://link.springer.com/chapter/10.1007/BFb0022140&quot;&gt;“harder” than consensus&lt;/a&gt; (&lt;a href=&quot;http://infoscience.epfl.ch/record/83471/files/1596162953p115-delporte.pdf&quot;&gt;see also&lt;/a&gt;). Also, keep in mind that linearizability for read-modify-write &lt;a href=&quot;http://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf&quot;&gt;is harder than&lt;/a&gt; linearizable read/write. (linearizable read/write《 consensus《 atomic commitment) &lt;a href=&quot;#fnref:hardness&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>MSR Silicon Valley Systems Projects I Have Loved</title>
   <link href="http://bailis.org/blog//msr-silicon-valley-systems-projects-i-have-loved/"/>
   <updated>2014-09-19T00:00:00-07:00</updated>
   <id>http://bailis.org/blog//msr-silicon-valley-systems-projects-i-have-loved</id>
   <content type="html">&lt;p&gt;Microsoft &lt;a href=&quot;http://www.zdnet.com/microsoft-to-close-microsoft-research-lab-in-silicon-valley-7000033838/&quot;&gt;confirmed yesterday&lt;/a&gt; that it’s shuttering its &lt;a href=&quot;http://research.microsoft.com/en-us/labs/siliconvalley/&quot;&gt;Silicon Valley lab&lt;/a&gt;, home to 75 brilliant Computer Science researchers including 2013 Turing Award winner &lt;a href=&quot;http://en.wikipedia.org/wiki/Leslie_Lamport&quot;&gt;Leslie Lamport&lt;/a&gt;. Others have &lt;a href=&quot;https://plus.google.com/115237092509505721130/posts/H8yGEkmeQc6&quot;&gt;more&lt;/a&gt; and  &lt;a href=&quot;http://mybiasedcoin.blogspot.com/2014/09/on-academia-vs-industry-msr-svc-closing.html&quot;&gt;wiser&lt;/a&gt; things to say about this decision. However, I want to highlight some of the fantastic work that’s come out of MSR Silicon Valley in the recent past in my research area of databases and distributed systems:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;http://research.microsoft.com/en-us/projects/dryad/&quot;&gt;Dryad&lt;/a&gt; and &lt;a href=&quot;http://research.microsoft.com/en-us/projects/dryadlinq/&quot;&gt;DryadLINQ&lt;/a&gt;&lt;/strong&gt; were hugely influential systems that challenged the then-popular notions that high-performance distributed dataflow had to be &lt;em&gt;i.)&lt;/em&gt; simple map and reduce tasks only and &lt;em&gt;ii.)&lt;/em&gt; hard to program. Ideas such as Dryad’s general-purpose DAG execution model, flexible and lightweight data transfers, and lineage-based recovery model can be found in almost every later distributed dataflow system, from Microsoft SCOPE to Apache Tez and Spark. DryadLINQ provides language-integrated access to the Dryad engine, which set the stage for abstractions like RDDs and the return of automatic and online query optimizers. Both are now &lt;a href=&quot;https://github.com/MicrosoftResearch/Dryad&quot;&gt;open source&lt;/a&gt; on GitHub.&lt;/p&gt;

    &lt;p&gt;Papers: &lt;em&gt;&lt;a href=&quot;http://research.microsoft.com/pubs/63785/eurosys07.pdf&quot;&gt;Dryad&lt;/a&gt; @ EuroSys 2007; &lt;a href=&quot;http://research.microsoft.com/en-us/projects/dryadlinq/dryadlinq.pdf&quot;&gt;DryadLINQ&lt;/a&gt; @ OSDI 2008; &lt;a href=&quot;http://research.microsoft.com/pubs/185714/Optimus.pdf&quot;&gt;Optimus&lt;/a&gt; @ EuroSys 2013&lt;/em&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;http://research.microsoft.com/pubs/201100/naiad_sosp2013.pdf&quot;&gt;Naiad&lt;/a&gt;&lt;/strong&gt; is a more recent system for streaming, cyclic, distributed dataflow. Like Dryad, think MapReduce but for arbitrary task graphs, while combining low-latency incremental processing and large-scale batch operation. At Naiad’s core is an new dataflow abstraction called &lt;a href=&quot;http://bigdataatsvc.wordpress.com/2013/09/18/an-introduction-to-timely-dataflow/&quot;&gt;timely dataflow&lt;/a&gt; that allows cyclic computations to safely proceed in parallel. The team’s also been working on some exciting extensions such as &lt;a href=&quot;http://bigdataatsvc.wordpress.com/2014/05/08/graphlinq-a-graph-library-for-naiad/&quot;&gt;graph processing&lt;/a&gt;. This research won Best Paper at SOSP 2013, one of highest honors in the systems community and is &lt;a href=&quot;http://microsoftresearch.github.io/Naiad/&quot;&gt;open source&lt;/a&gt; on GitHub. The team’s earlier work on &lt;a href=&quot;http://bigdataatsvc.wordpress.com/2013/04/18/an-introduction-to-differential-dataflow/&quot;&gt;differential dataflow&lt;/a&gt; illustrated the potential for this efficient fixpoint processing.&lt;/p&gt;

    &lt;p&gt;Papers: &lt;em&gt;&lt;a href=&quot;http://research.microsoft.com/pubs/201100/naiad_sosp2013.pdf&quot;&gt;Naiad&lt;/a&gt; @ SOSP 2013; &lt;a href=&quot;http://research.microsoft.com/pubs/176693/differentialdataflow.pdf&quot;&gt;Differential Dataflow&lt;/a&gt; @ CIDR 2013&lt;/em&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;&lt;a href=&quot;http://www.cs.cornell.edu/courses/cs6452/2012sp/papers/psi-sosp11.pdf&quot;&gt;CORFU&lt;/a&gt;&lt;/strong&gt; (Clusters of Raw Flash Units) is a system that exposes a cluster of servers loaded with flash drives as a high-throughput shared log abstraction. This is, in itself, a challenging distributed systems problem that the team solved elegantly via a mix of fast sequencing and clever protocol design. However, CORFU’s power is perhaps better demonstrated by Tango, a system the researchers built on top. Tango demonstrates how to build fault-tolerant, high-performance distributed data structures such as trees, maps, and serializable transactions by making efficient use of the shared log abstraction. This architecture is not only creative, but it’s a great use of modern hardware with excellent empirical results to boot.&lt;/p&gt;

    &lt;p&gt;Papers: &lt;em&gt;&lt;a href=&quot;http://research.microsoft.com/pubs/157204/corfumain-final.pdf&quot;&gt;CORFU&lt;/a&gt; @ NSDI 2012; &lt;a href=&quot;http://research.microsoft.com/pubs/199947/Tango.pdf&quot;&gt;Tango&lt;/a&gt; @ SOSP 2013&lt;/em&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;MSR Silicon Valley has been on the bleeding edge of distributed data serving. Doug Terry (of &lt;a href=&quot;http://db.cs.berkeley.edu/cs286/papers/bayou-sosp1995.pdf&quot;&gt;Bayou&lt;/a&gt; fame) and others have built a system called &lt;strong&gt;&lt;a href=&quot;http://research.microsoft.com/en-us/projects/capcloud/&quot;&gt;Pileus&lt;/a&gt;&lt;/strong&gt; that allows fine-grained control and SLAs in geo-replicated storage systems. Marcos Aguilera and others have similarly been working on fast, wide-area transaction execution, from snapshot isolation in &lt;strong&gt;&lt;a href=&quot;http://research.microsoft.com/en-us/people/aguilera/walter-sosp2011.pdf&quot;&gt;Walter&lt;/a&gt;&lt;/strong&gt; to serializability (leveraging one of my favorite ideas in concurrency control: transaction chopping) via &lt;strong&gt;&lt;a href=&quot;http://research.microsoft.com/en-us/people/aguilera/transaction-chains-sosp2013.pdf&quot;&gt;transaction chains&lt;/a&gt;&lt;/strong&gt;. Both of these lines of work are highly relevant to the ongoing redesign of large-scale cloud databases; it’s great to see services like Microsoft Azure DocumentDB &lt;a href=&quot;http://azure.microsoft.com/en-us/documentation/articles/documentdb-consistency-levels/&quot;&gt;adopt ideas like Pileus’s tunable consistency&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Papers: &lt;a href=&quot;http://research.microsoft.com/pubs/201390/PileusSOSP2013.pdf&quot;&gt;Pileus&lt;/a&gt; @ SOSP 2013; &lt;a href=&quot;http://www.cs.cornell.edu/courses/cs6452/2012sp/papers/psi-sosp11.pdf&quot;&gt;Walter&lt;/a&gt; @ SOSP 2011; &lt;a href=&quot;http://research.microsoft.com/en-us/people/aguilera/transaction-chains-sosp2013.pdf&quot;&gt;Lynx&lt;/a&gt; @ SOSP 2013&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;There’s too much to list. But here’s a few more anyway:  &lt;a href=&quot;https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gunda.pdf&quot;&gt;Nectar&lt;/a&gt; (OSDI 2010) automates the caching and reuse of intermediate results in data-parallel compute systems. &lt;a href=&quot;http://research.microsoft.com/pubs/201110/sosp13-dandelion-final.pdf&quot;&gt;Dandelion&lt;/a&gt; (SOSP 2013) provides a language-integrated automated runtime for running applications on both data-parallel compute clusters and GPUs. &lt;a href=&quot;http://www.sigops.org/sosp/sosp09/papers/isard-sosp09.pdf&quot;&gt;Quincy&lt;/a&gt; (SOSP 2009) pioneered the study of fair cluster scheduling algorithms. Dahlia Malkhi has done and continues to do amazing work on distributed algorithms (e.g., &lt;a href=&quot;http://arxiv.org/pdf/1402.2701v1.pdf&quot;&gt;PODC 2014&lt;/a&gt;) in addition to working on systems projects such as CORFU. And, of course, Leslie Lamport’s recent work on TLA+ continues to make waves—for example, &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/tla/formal-methods-amazon.pdf&quot;&gt;at Amazon&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You’ve probably heard of many of the brilliant folks behind this work before: Doug Terry, Dahlia Malkhi, Martin Abadi, Michael Isard, Derek Murray, Frank McSherry, Marcos Aguilera, Yuan Yu, and the list goes on. And, as I’ve said, there have been many others and many other exciting projects (and entire groups outside distributed systems) at MSR Silicon Valley.&lt;/p&gt;

&lt;p&gt;Fortunately, MSR still has other branches—for example, many of the researchers studying core database issues are in Redmond. However, the above projects help illustrate why MSR Silicon Valley was such a research powerhouse and a welcome industrial neighbor to the west.&lt;/p&gt;

</content>
 </entry>
 
 <entry>
   <title>Understanding Weak Isolation Is a Serious Problem</title>
   <link href="http://bailis.org/blog//understanding-weak-isolation-is-a-serious-problem/"/>
   <updated>2014-09-16T00:00:00-07:00</updated>
   <id>http://bailis.org/blog//understanding-weak-isolation-is-a-serious-problem</id>
   <content type="html">&lt;p&gt;Modern transactional databases &lt;a href=&quot;http://www.bailis.org/blog/when-is-acid-acid-rarely/&quot;&gt;overwhelmingly don’t operate&lt;/a&gt; under textbook “ACID” isolation, or serializability. Instead, these databases—like Oracle 11g and SAP HANA—offer weaker guarantees, like Read Committed isolation or, if you’re lucky, &lt;a href=&quot;http://en.wikipedia.org/wiki/Snapshot_isolation&quot;&gt;Snapshot Isolation&lt;/a&gt;. There’s a good reason for this phenomenon: weak isolation is faster—often much faster—and incurs fewer aborts than serializability. Unfortunately, the exact behavior of these different isolation levels is difficult to understand and is highly technical. One of 2008 Turing Award winner Barbara Liskov’s Ph.D. students &lt;a href=&quot;http://pmg.csail.mit.edu/papers/adya-phd.pdf&quot;&gt;wrote an entire dissertation on the topic&lt;/a&gt;, and, even then, the definitions we have still aren’t perfect and can vary between databases.&lt;/p&gt;

&lt;h4 id=&quot;the-core-problem-a-monstrous-abstraction-in-most-every-database&quot;&gt;The core problem: a monstrous abstraction in (most) every database&lt;/h4&gt;

&lt;p&gt;Despite the ubiquity of weak isolation, I haven’t found a database architect, researcher, or user who’s been able to offer an explanation of when, and, probably more importantly, &lt;em&gt;why&lt;/em&gt; isolation models such as Read Committed are sufficient for correct execution. It’s reasonably well known that these weak isolation models represent “ACID in practice,” but I don’t think we have any real understanding of how so many applications are seemingly (!?) okay running under them. (If you haven’t seen these models before, they’re a little weird. For example, Read Committed isolation generally prevents users from reading uncommitted or non-final writes but allows a number of bad things to happen, like lost updates during concurrent read-modify-write operations. Why is this apparently okay for many applications?) In the research community and in our classrooms, we’ve historically contented ourselves with studying serializability and, to a certain extent, Snapshot Isolation. Understanding weak isolation as deployed in real-world databases is a wide open problem with serious consequences.&lt;sup id=&quot;fnref:vldb&quot;&gt;&lt;a href=&quot;#fn:vldb&quot; class=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;To put this problem in perspective, there’s a flood of &lt;a href=&quot;http://queue.acm.org/detail.cfm?id=2462076&quot;&gt;interesting new research&lt;/a&gt; that attempts to better understand programming models like eventual consistency. And, as you’re probably aware, there’s an ongoing and often lively debate between &lt;a href=&quot;http://cacm.acm.org/blogs/blog-cacm/99512-why-enterprises-are-uninterested-in-nosql/fulltext&quot;&gt;transactional adherents&lt;/a&gt; and more recent “NoSQL” upstarts about related issues of usability, data corruption, and performance. But, in contrast, many of these transactional adherents and the research community as a whole have effectively ignored weak isolation—even in a single server setting and despite the fact that literally millions of businesses today depend on weak isolation and that many of these isolation levels have been around for almost three decades.&lt;sup id=&quot;fnref:distributed&quot;&gt;&lt;a href=&quot;#fn:distributed&quot; class=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;h4 id=&quot;why-might-weak-isolation-actually-work-and-how-did-we-get-into-this-mess&quot;&gt;Why might weak isolation actually work, and how did we get into this mess?&lt;/h4&gt;

&lt;p&gt;I can offer a few guesses as to why weak isolation works:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Non-serializable isolation “anomalies” are really just different kinds of race conditions. To have a race, you need concurrency. For low-traffic and low-contention applications, it’s possible that anomalies don’t arise (e.g., the read-modify-write race from above might not affect applications because there simply aren’t concurrent transactions).&lt;/li&gt;
  &lt;li&gt;When anomalies do arise, it’s possible that the read-write anomalies don’t translate into application data corruption. Just because a read/write race occurs doesn’t mean all outcomes are necessarily wrong (e.g., two writers might perform the exact same read-modify-write operations with the same outcome regardless of order).&lt;/li&gt;
  &lt;li&gt;It’s possible data is actually (occasionally) corrupted, and apps just don’t care. Or, when they do, they manually issue the customer an IOU and/or send them an overdraft notice.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;However, these are still conjectures. With a few exceptions,&lt;sup id=&quot;fnref:addrefs&quot;&gt;&lt;a href=&quot;#fn:addrefs&quot; class=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; we’re in uncharted terrority.&lt;sup id=&quot;fnref:open&quot;&gt;&lt;a href=&quot;#fn:open&quot; class=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Why haven’t we already solved this problem? One possibility: as a community tradition, database architects have prided themselves on top-down design of systems that fulfill beautiful interfaces. For example, the relational database revolution in the 1970s was fomented by the introduction of an amazing abstraction: relational algebra. Serializability is a similarly powerful abstraction that vastly simplifies end-user programming. In contrast, weak isolation is grotesque. Its development has been overwhelmingly &lt;em&gt;mechanism-driven&lt;/em&gt; rather than &lt;em&gt;policy-&lt;/em&gt; or &lt;em&gt;application-driven&lt;/em&gt;. Do you know how Jim Gray et al. invented Read Committed back in 1975? Realizing serializability via two-phase locking is expensive, the System R gang &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/gray/papers/Granularity%20of%20Locks%20and%20Degrees%20of%20Consistency%20RJ%201654.pdf&quot;&gt;changed their long read locks to short read locks&lt;/a&gt;. No joke. (Aside: look at that typesetting!) What about Snapshot Isolation? In the early 1990s, companies like Interbase and Microsoft started shipping databases with a fancy new multi-version concurrency control mechanism. It took &lt;a href=&quot;http://arxiv.org/pdf/cs/0701157.pdf&quot;&gt;another paper&lt;/a&gt; by Gray and friends to define what these systems had actually implemented.&lt;/p&gt;

&lt;h4 id=&quot;towards-a-better-understanding-and-better-system-design&quot;&gt;Towards a better understanding and better system design&lt;/h4&gt;

&lt;p&gt;Understanding weak isolation is not “just” an academic problem—it’s a problem database users face every day. If a user wants to make responsible use of weak isolation, she first needs to learn the particulars of each isolation model (which receive only partial treatment even in the best textbooks). Second, she has to manually translate the set of low-level read/write anomalies that define each model to her application logic. &lt;a href=&quot;http://db.cs.berkeley.edu/papers/socc13-consistency.pdf&quot;&gt;This is hard.&lt;/a&gt; With effort and a lot of education, it &lt;em&gt;can&lt;/em&gt; be done. But remember, if our user’s database doesn’t support serializability and she cares about correctness, it &lt;em&gt;must&lt;/em&gt; be done.  Is this good design? Is this the best we can do? Exposing models that benefit the systems builder rather than the end user is, in my opinion, antithetical to the database tradition and to empathetic, user-centric design.&lt;/p&gt;

&lt;p&gt;I think there are at least three fronts for making progress on this problem:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;As a community, I think it’s time to pay attention to and give existing weak isolation models the deep treatment they deserve. We’ve a few hints already. For example, my &lt;a href=&quot;http://www.bailis.org/pubs.html#coord-avoid&quot;&gt;dissertation work&lt;/a&gt; (and plenty of excellent &lt;a href=&quot;http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA200979&quot;&gt;related work&lt;/a&gt;) helps identify and apply conditions under which we can race and preserve correctness (and concurrent execution). But there’s much work to be done in specifically examining &lt;em&gt;existing&lt;/em&gt; and &lt;em&gt;widely-deployed&lt;/em&gt; models. There are numerous Ph.D. dissertations to be written in this space and, more importantly, serious potential for impact on practice.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;As we develop new systems, we can avoid making the same mistakes as our architectural ancestors by focusing on applications, not mechanisms. I’d personally welcome a moratorium on writing papers on new isolation, data consistency, and read/write transaction models unless there’s a clear and specific set of motivating and specific &lt;em&gt;application-driven&lt;/em&gt; use cases.&lt;sup id=&quot;fnref:ra&quot;&gt;&lt;a href=&quot;#fn:ra&quot; class=&quot;footnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;There’s real promise in moving beyond the read-write interface of traditional isolation models entirely and building concurrency control systems that operate at a higher level of abstraction (holy grail: the application level). This helps both systems and users. Systems can exploit greater concurrency and therefore provide higher performance and availability. Users don’t have to think about data races. This is a topic for another post, but work on &lt;a href=&quot;http://db.cs.berkeley.edu/papers/cidr11-bloom.pdf&quot;&gt;Bloom/CALM&lt;/a&gt;, &lt;a href=&quot;http://hal.upmc.fr/docs/00/55/55/88/PDF/techreport.pdf&quot;&gt;CRDTs&lt;/a&gt;, and &lt;a href=&quot;http://www.bailis.org/pubs.html#coord-avoid&quot;&gt;I-confluent coordination avoidance&lt;/a&gt; hints at what’s achievable by reasoning about applications, not read/write histories.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We can do better, and there’s a ton of interesting research and design to be done. Go!&lt;/p&gt;

&lt;h4 id=&quot;notes&quot;&gt;Notes&lt;/h4&gt;

&lt;div class=&quot;footnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:vldb&quot;&gt;
      &lt;p&gt;There’s actually &lt;em&gt;some&lt;/em&gt; research (see below) on this topic, but I don’t think we have a satisfying answer yet—and, from the work I’ve seen, we definitely don’t have an answer to explain the prevalence of these models &lt;em&gt;in practice&lt;/em&gt;. I don’t think that there’s a clear, unambiguous one, either: given no knowledge about your program semantics, any model other than serializable isolation can corrupt data. However, as an example of a paper I’d like to read, a &lt;a href=&quot;http://pbs.cs.berkeley.edu/#demo&quot;&gt;PBS-style&lt;/a&gt; white-box probabilistic analysis of Postgres, MySQL, or another RDBMS could be enlightening.&lt;/p&gt;

      &lt;p&gt;We had a fun discussion about this topic in the &lt;a href=&quot;http://www.vldb.org/2014/program/lib/FullProgram.html#D3F1530T1700R4&quot;&gt;session on transaction processing&lt;/a&gt; at VLDB this year. Again, no one was able to come up with a great answer to explain the prevalence of these models. However, there was a considerable amount of excitement in the room (perhaps also surprising given the number of papers on serializability and Snapshot Isolation)—much more than I’ve previously seen in the database community. &lt;a href=&quot;#fnref:vldb&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:distributed&quot;&gt;
      &lt;p&gt;To be clear, I think we should work on &lt;em&gt;both&lt;/em&gt; distributed data consistency and weak isolation. In fact, the solutions may be similar. For example, Read Committed is, according to the usual definitions, not much stronger than what most eventually consistent stores provide (at the minimum, there are few—if any—guarantees about the orderings of concurrent transaction execution). So, many approaches to coordination-free or coordination-avoiding distributed implementations may indeed apply to weak isolation, &lt;a href=&quot;http://www.bailis.org/pubs.html#coord-avoid&quot;&gt;even on a single server&lt;/a&gt;. The main difference I see (and what I’m agitating for in this post) is that, while there’s a considerable amount of interest in examining often &lt;em&gt;even more esoteric&lt;/em&gt; read/write data consistency models in the distributed setting, there’s a stunning lack of work on what many more people today are already running. It’s possible we can kill two birds with one stone by, say, moving beyond the read-write abstraction, which causes all sorts of problems once we drop serializability.&lt;/p&gt;

      &lt;p&gt;Also, it’s pretty funny to discuss non-serializable “weak” distributed data consistency models such as eventual consistency as if their inherent usability challenges are foreign to traditional data management systems (that, as I’ve discussed, often aren’t serializable either). &lt;a href=&quot;#fnref:distributed&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:addrefs&quot;&gt;
      &lt;p&gt;A few examples:&lt;/p&gt;

      &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;http://www.vldb.org/pvldb/2/vldb09-185.pdf&quot;&gt;“Quantifying Isolation Anomalies”&lt;/a&gt; by Fekete, Goldrei, and Asenjo, &lt;em&gt;VLDB 2009&lt;/em&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;http://arxiv.org/pdf/0909.1788.pdf&quot;&gt;“Building on Quicksand”&lt;/a&gt; by Helland and Campbell, &lt;em&gt;CIDR 2009&lt;/em&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;http://db.cs.berkeley.edu/cs286/papers/isolation-icde2000.pdf&quot;&gt;“Semantic conditions for correctness at different isolation levels”&lt;/a&gt; by Bernstein and Lewis, &lt;em&gt;ICDE 2000&lt;/em&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;http://www.vldb.org/pvldb/vol7/p181-bailis.pdf&quot;&gt;“Highly Available Transactions: Virtues and Limitations”&lt;/a&gt; by Bailis, Davidson, Fekete, Franklin, Ghodsi, Hellerstein, and Stoica, &lt;em&gt;VLDB 2014&lt;/em&gt;&lt;/li&gt;
      &lt;/ul&gt;
      &lt;p&gt;&lt;a href=&quot;#fnref:addrefs&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:open&quot;&gt;
      &lt;p&gt;Read: a potential high impact research goldmine.&lt;/p&gt;

      &lt;p&gt;Related note: I’m a big fan of papers on new techniques (e.g., program analysis and synthesis, or theory) for increasing the scalability or concurrency of applications. However, I wish we saw more discussion of the &lt;em&gt;applications&lt;/em&gt; themselves. Techniques are valuable in their own right, but the idea of analysis-as-design-tool can be powerful—this &lt;a href=&quot;http://www.pdos.lcs.mit.edu/papers/commutativity:sosp13.pdf&quot;&gt;paper from MIT&lt;/a&gt; does a great job in this regard. For example, was a previously expensive syscall or transaction non-scalable because it was simply implemented in a silly way? Did the new tool you’re describing &lt;em&gt;also&lt;/em&gt; teach you something about the actual intent of the application that might have been obfuscated by its implementation? Did it surprise you? I’m fascinated by techniques for determining &lt;em&gt;if&lt;/em&gt; weak isolation is safe but, even moreso, &lt;em&gt;when&lt;/em&gt; (in terms of applications) and &lt;em&gt;why&lt;/em&gt; (in terms of programmer intent) it’s safe (or not). As a shameless plug, we’ve some work in the pipeline looking at a slew of open-source applications in this vein. Open source is great for this stuff—thank goodness for Github! &lt;a href=&quot;#fnref:open&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:ra&quot;&gt;
      &lt;p&gt;As a self-serving example, after our &lt;a href=&quot;http://www.vldb.org/pvldb/vol7/p181-bailis.pdf&quot;&gt;HAT research&lt;/a&gt; (on understanding which isolation levels are achievable without coordination/”AP”), we realized there was no existing isolation model that’d efficiently serve the indexing, materialized view, and multi-put requirements in Google’s Megastore, Facebook Tao, LinkedIn Espresso, Yahoo! PNUTS, and a number of open-source databases (you’d have to use Repeatable Read or Snapshot Isolation, both of which require coordination/”CP”). So we devised a new model, called &lt;a href=&quot;https://amplab.cs.berkeley.edu/wp-content/uploads/2014/04/ramp-sigmod2014.pdf&quot;&gt;Read Atomic (RA) isolation&lt;/a&gt; that directly addresses these use cases and is &lt;a href=&quot;http://www.bailis.org/blog/scalable-atomic-visibility-with-ramp-transactions/&quot;&gt;achievable via high-performance, coordination-free “AP” algorithms&lt;/a&gt;. Does RA further complicate this polluted space of isolation levels? You bet. But, when I talk to an end user, I can tell them “RAMP transactions ensure you don’t have dangling pointers in your global secondary index entries and also preserve foreign key relationships” rather than tell them “RAMP transactions prevent G0 (Write Cycles), G1a (Aborted Reads), G1b (Intermediate Reads), and G1c (Circular Information Flow), PMP (Predicate-Many-Preceders), and Fractured Reads but not G2 (Anti-dependency cycles), G-single, G-SIa (Interference) or G-SIb (Missed Effects).” The second explanation is meaningful (and rightly belongs in the paper!), but the former is immediately use-case driven.&lt;/p&gt;

      &lt;p&gt;(A side benefit of that HAT work along these lines: an existing application running on a coordinated implementation of a HAT isolation model could hypothetically run faster in a distributed/coordination-free manner.) &lt;a href=&quot;#fnref:ra&quot; class=&quot;reversefootnote&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>Bridging the Gap: Opportunities in Coordination-Avoiding Databases</title>
   <link href="http://bailis.org/blog//bridging-the-gap-opportunities-in-coordination-avoiding-databases/"/>
   <updated>2014-04-22T00:00:00-07:00</updated>
   <id>http://bailis.org/blog//bridging-the-gap-opportunities-in-coordination-avoiding-databases</id>
   <content type="html">&lt;style&gt;

#refs { font-size: 90%; line-height: 1.4; }
#refs p { margin-top: 6pt; margin-bottom: 6pt; }

&lt;/style&gt;

&lt;p&gt;&lt;strong&gt;Background:&lt;/strong&gt; I recently co-organized the
  &lt;a href=&quot;http://eventos.fct.unl.pt/papec/pages/program&quot;&gt;Principles and Practice of Eventual Consistency&lt;/a&gt;
  workshop at EuroSys, where I also gave a talk on lessons from some
  of our &lt;a href=&quot;http://bailis.org/pubs.html&quot;&gt;recent research&lt;/a&gt;. This post
  contains a summary (in the form of my talk proposal). This is joint work
  with &lt;a href=&quot;http://sydney.edu.au/engineering/it/~fekete/&quot;&gt;Alan Fekete&lt;/a&gt;,
  &lt;a href=&quot;http://www.cs.berkeley.edu/~alig/&quot;&gt;Ali Ghodsi&lt;/a&gt;,
  &lt;a href=&quot;http://www.cs.berkeley.edu/~franklin/&quot;&gt;Mike Franklin&lt;/a&gt;,
  &lt;a href=&quot;http://db.cs.berkeley.edu/jmh/&quot;&gt;Joe Hellerstein&lt;/a&gt;, and
  &lt;a href=&quot;http://www.cs.berkeley.edu/~istoica/&quot;&gt;Ion Stoica&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;My slides&lt;/strong&gt; are &lt;a href=&quot;https://speakerdeck.com/pbailis/bridging-the-gap-opportunities-in-coordination-avoiding-database-systems&quot;&gt;available on Speaker Deck&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; &lt;em&gt;Weakly consistent systems surface a controversial tension
between, on the one hand, availability, latency, and performance, and,
on the other, programmability. We propose the concept of coordination
avoidance as a unifying, underlying principle behind the former and
discuss lessons from our recent experiences mitigating the latter.&lt;/em&gt;&lt;/p&gt;

&lt;h4 id=&quot;trouble-in-paradise&quot;&gt;Trouble in Paradise&lt;/h4&gt;

&lt;p&gt;Faced with the task of operating “always on” services and lacking
sufficient guidance from the literature regarding alternative
distributed designs and algorithms, many Internet service architects and
engineers throughout the 2000s discarded traditional database semantics
and transactional models in favor of weaker but less principled models:
eventual consistency, few if any multi-object, multi-operation (i.e.,
transactional) guarantees, and ad-hoc application-specific
compensation—collectively, much of “NoSQL” &lt;a href=&quot;#1&quot;&gt;[1]&lt;/a&gt;. From a research
perspective, this space of weaker models has proven to be a fertile
area: new (or re-discovered), often esoteric, and almost always nuanced
semantics present many opportunities for new systems design, optimizations,
and formal study.&lt;/p&gt;

&lt;p&gt;Unfortunately, these weaker models come with serious usability
disadvantages: programmability suffers. Understanding the implications
of non-serializable isolation models for end-user applications is
difficult.  Programmers have little practical guidance as to how to choose
an appropriate model for their applications, and understanding the
differences between models effectively requires graduate-level
training in distributed systems and/or database theory &lt;a href=&quot;#2&quot;&gt;[2]&lt;/a&gt;.
Some members within the internet services industry that birthed the
resurgence of interest in these semantics have begun a backlash
against them: one recent and prominent industrial account
unequivocally claims that “designing applications to cope with
concurrency anomalies in their data is…ultimately not worth the
performance gains” &lt;a href=&quot;#15&quot;&gt;[15]&lt;/a&gt;. Statements like these (which, in our
experience, enjoy some popularity among practitioners and considerable
acceptance in the database community &lt;a href=&quot;#17&quot;&gt;[17]&lt;/a&gt;) suggest that, as a
research community centered on weak consistency, we are possibly failing
to communicate and demonstrate the benefits achievable with these
semantics, have underestimated the burden placed on programmers, or a
combination of both.&lt;/p&gt;

&lt;h4 id=&quot;coordination-free-execution-a-unifying-principle-behind-ap-benefits&quot;&gt;Coordination-Free Execution: A Unifying Principle Behind “AP” Benefits&lt;/h4&gt;

&lt;p&gt;In tribute to the CAP Theorem &lt;a href=&quot;#11&quot;&gt;[11]&lt;/a&gt; that widely popularized
these trade-offs, much of the dialogue around weakly consistent models
concerns the availability of operations under failures. Availability
is an important property, but, in our opinion, a sole focus on
availability undervalues the benefits of weak semantics. Daniel Abadi
has successfully argued that, while “availability” is primarily relevant in
the presence of failures, weakly consistent (“AP”) systems can also
offer low latency &lt;a href=&quot;#1&quot;&gt;[1]&lt;/a&gt;. Any replica can respond to any request,
alleviating the need for many communication
delays—&lt;a href=&quot;http://www.bailis.org/blog/communication-costs-in-real-world-networks/&quot;&gt;in our experience&lt;/a&gt;,
average LAN latencies can be up to 720x faster than those over WAN.&lt;/p&gt;

&lt;p&gt;We would take Abadi’s position even further: weak consistency also
allows aggressive scale-out, even at the level of a single data
item—more servers can be added without communication between
them. Modern, strongly consistent “NewSQL” systems can indeed provide
horizontal scale-out using shared-nothing database replication
techniques popularized in the 1980s &lt;a href=&quot;#16&quot;&gt;[16]&lt;/a&gt;. However, especially
for worst-case accesses, these systems are far from “as scalable as
NoSQL” &lt;a href=&quot;#15&quot;&gt;[15]&lt;/a&gt; systems offering weak isolation. In recent
research, we have examined the throughput penalties associated with
these “strong” models: in modern LAN and WAN networks, distributed
serializable transactions face a worst-case throughput limits of 1200
and 12 read-write transactions per item per second, independent of
implementation strategy &lt;a href=&quot;#6&quot;&gt;[6]&lt;/a&gt;. Recent systems [&lt;a href=&quot;#10&quot;&gt;10&lt;/a&gt;,
&lt;a href=&quot;#15&quot;&gt;15&lt;/a&gt;] are no exception: operations over disjoint data items can
proceed concurrently, increasing throughput, but conflicting
operations over non-disjoint data items are limited by network
latency. In contrast, appropriate implementations of weak consistency
face no such overheads.&lt;/p&gt;

&lt;p&gt;The three properties above—availability, low latency, and scale-out—are
consequences of a more fundamental principle underlying weakly
consistent systems: a lack of synchronous communication, or
&lt;em&gt;coordination&lt;/em&gt;, between concurrent operations. If operations can execute
coordination-free, they can run concurrently, on any available
resources, without communicating with or otherwise stalling concurrent
operations. The cost of coordination is easily and simultaneously cast
in the form of (minority) unavailability, latency (minimum 1 RTT), and
throughput (maximum 1/RTT). Moreover, and more
importantly, the concept of coordination-free execution is portable to
a range of system architectures: whereas traditional formulations of
availability are inherently tied to physical replication,
coordination-freedom is a property of the execution strategy and is
independent of physical deployment or topology. For example, a system
providing clients with snapshot reads can effectively act as a
coordination-free “replicated” system even if implemented by a set of
linearizable multi-versioned masters &lt;a href=&quot;#7&quot;&gt;[7]&lt;/a&gt;. Judicious use of strong
semantics in correct application execution equates to
&lt;em&gt;coordination-avoidance&lt;/em&gt; &lt;a href=&quot;#6&quot;&gt;[6]&lt;/a&gt;: the use of as little coordination
as possible.&lt;/p&gt;

&lt;h4 id=&quot;bridging-the-gap-experiences-applying-coordination-avoidance&quot;&gt;Bridging the Gap: Experiences Applying Coordination Avoidance&lt;/h4&gt;

&lt;p&gt;As F1’s authors highlight above, the decision to consider
coordination-avoiding algorithms or not requires a cost-benefit
judgment &lt;a href=&quot;#8&quot;&gt;[8]&lt;/a&gt;: will performance, availability, or latency
benefits outweigh the cost of ascertaining whether weak models are
sufficient?  In a sober assessment, many applications will likely be
able to (over-)pay for “strong consistency”: single-site operations
are inexpensive &lt;a href=&quot;#16&quot;&gt;[16]&lt;/a&gt;, while improvements in datacenter networks
&lt;a href=&quot;#19&quot;&gt;[19]&lt;/a&gt; lower the cost for non-geo-replicated systems. Yet, a
large class of applications—for example, non-partitionable
applications &lt;a href=&quot;#9&quot;&gt;[9]&lt;/a&gt;, applications with high mutation rates (i.e.,
write contention) &lt;a href=&quot;#18&quot;&gt;[18]&lt;/a&gt;, and geo-replicated applications
&lt;a href=&quot;#20&quot;&gt;[20]&lt;/a&gt;—will continue to be sensitive to extraneous coordination
costs and will likely necessitate further study.&lt;/p&gt;

&lt;p&gt;Identifying and serving this latter class of applications is paramount
to ensuring the future adoption of coordination-avoiding algorithms.
While represents a difficult task, we offer three examples from our
recent research:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;High-value, well-specified applications are ripe for
optimization.&lt;/strong&gt; We have found success in coordination-avoiding
optimization of high-value workloads. As an example, we recently
combined our recent work on
&lt;a href=&quot;http://www.bailis.org/blog/scalable-atomic-visibility-with-ramp-transactions/&quot;&gt;RAMP transactions&lt;/a&gt;
with our recent results on
&lt;a href=&quot;http://arxiv.org/pdf/1402.2237.pdf&quot;&gt;necessary and sufficient conditions for coordination-free execution&lt;/a&gt;
to attain an 25x improvement in throughput on the traditionally
challenging TPC-C OLTP benchmark [&lt;a href=&quot;#6&quot;&gt;6&lt;/a&gt;, &lt;a href=&quot;#18&quot;&gt;18&lt;/a&gt;]. As the
industry- and academic-standard benchmark for transactional
performance, TPC-C is accompanied by a rigorous specification for
compliance, which acted as a correctness specification in our
coordination analysis and was instrumental guiding our implementation
strategy. Beyond TPC-C, we have encountered few database workloads and
integrity constraints that require coordination for &lt;em&gt;all&lt;/em&gt; queries:
rather, many queries are amenable to coordination-free execution,
while a handful (like TPC-C New-Order ID assignment) require
coordination for correctness. The resulting
challenge is two-fold: first, identify the operations that &lt;em&gt;do&lt;/em&gt;
require coordination (ideally few or none), and second, determine an
appropriate coordination-free execution plan for those that do not.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Applications on ACID databases (surprisingly often) do not use
ACID transactions.&lt;/strong&gt; Traditional databases also have an equivalent of
weak consistency: although they were not explicitly developed for
distributed environments, databases provide a range of “weak
isolation” models, such as Read Committed and Repeatable Read. Many
“ACID” databases today adopt these weak isolation guarantees by
default
(&lt;a href=&quot;http://www.bailis.org/blog/when-is-acid-acid-rarely/&quot;&gt;only 3 of 18 in a survey we recently performed&lt;/a&gt;)
and sometimes as the strongest level supported (e.g., Oracle 11G)
&lt;a href=&quot;#5&quot;&gt;[5]&lt;/a&gt;. Existing applications deployed on these systems are
necessarily either tolerant of or otherwise must for compensate for
these weak semantics, hinting at opportunities for
optimization. Moreover, while many of these weak semantics are not
typically &lt;em&gt;implemented&lt;/em&gt; in a coordination-free manner (e.g., relying
on locking or validation protocols)—a remnant of their single-node
origins—they are often, in fact, implementable without resorting to
coordination. We recently classified commonly-deployed isolation
models as achievable via
&lt;a href=&quot;http://www.bailis.org/blog/hat-not-cap-introducing-highly-available-transactions/&quot;&gt;&lt;em&gt;Highly Available Transactions&lt;/em&gt;&lt;/a&gt;
or not &lt;a href=&quot;#5&quot;&gt;[5]&lt;/a&gt;: existing applications deployed on “HAT” isolation
models are excellent candidates for study in coordination-free
environments.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;ACID databases are not built using ACID transactions.&lt;/strong&gt;
High-performance database internals are maintained using specialized,
highly optimized algorithms that are carefully designed to maximize safe
concurrency &lt;a href=&quot;#12&quot;&gt;[12]&lt;/a&gt; (e.g., in the case of indexing, exotic data
structures like B-link trees &lt;a href=&quot;#13&quot;&gt;[13]&lt;/a&gt;). The database designer does
not use serializable access to the database data structures for at least
two reasons. First, doing so would be prohibitively expensive, and,
second, the expert designer does not need to: she has a well-defined
specification (e.g., secondary index lookup behavior) that she can use
to ensure correctness (without end-user intervention).&lt;/p&gt;

    &lt;p&gt;In a distributed database system, coordination-avoiding techniques
are similarly applicable to internal data structure implementation. As
experts of both databases and fast distributed algorithms, we can
ensure that the anomalies of our weakly consistent but fast algorithms
do not interfere with application-level correctness; we can
encapsulate the side-effects of weak semantics behind well-defined
(and existing) interfaces. Our recent work on
&lt;a href=&quot;http://www.bailis.org/blog/scalable-atomic-visibility-with-ramp-transactions/&quot;&gt;Read Atomic Multi-Partition (RAMP) transactions&lt;/a&gt;
was developed in this context and, as motivating use cases, focuses on
foreign key constraint maintenance, distributed secondary indexing,
and materialized view maintenance &lt;a href=&quot;#7&quot;&gt;[7]&lt;/a&gt;. Our focus on coordination
avoidance yielded algorithms that consistently outperformed
alternatives, especially those based on synchronous coordination such as
locking.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Overall, we have found success in &lt;em&gt;i.)&lt;/em&gt; focusing on existing
applications and &lt;em&gt;ii.)&lt;/em&gt; incorporating existing specifications to
enable coordination-avoiding execution. Towards these goals, our
ongoing research directly incorporates application-level invariants
(derived from real-world application and the SQL language) for
analysis under a necessary and sufficient property for
coordination-free execution &lt;a href=&quot;#6&quot;&gt;[6]&lt;/a&gt;. The use of application-level
invariants is key to safely maximizing concurrency without requiring
programmer expertise in weak isolation models. By focusing on
&lt;em&gt;existing&lt;/em&gt; invariants and specifications as above, we can provide
tangible improvements without necessarily affecting end-user experience.&lt;/p&gt;

&lt;h4 id=&quot;a-coordinated-future&quot;&gt;A Coordinated Future&lt;/h4&gt;

&lt;p&gt;The continued success of weakly consistent systems requires a
demonstration of and focus on utility. Delivering an understanding of
exactly &lt;em&gt;why&lt;/em&gt; these weakly consistent semantics can provide (in a
fundamental sense) greater availability, lower latency, and higher
throughput is paramount; our proposed focus on &lt;em&gt;coordination
avoidance&lt;/em&gt; is our attempt at providing a unified answer. By applying
the lens of coordination avoidance to a range of existing,
well-defined and ideally high-value domains, we have the opportunity
to demonstrate exactly &lt;em&gt;when&lt;/em&gt; weak consistency is adequate and,
equally importantly, when it is not. Without a full specification,
language techniques [&lt;a href=&quot;#3&quot;&gt;3&lt;/a&gt;, &lt;a href=&quot;#4&quot;&gt;4&lt;/a&gt;] and library-based optimizations
&lt;a href=&quot;#14&quot;&gt;[14]&lt;/a&gt; are helpful to programmers. However, with a full
specification and increased knowledge of application semantics
[&lt;a href=&quot;#6&quot;&gt;6&lt;/a&gt;, &lt;a href=&quot;#7&quot;&gt;7&lt;/a&gt;], we can fully realize the benefits of coordination
avoidance while further mitigating programmer burden. While
coordination cannot always be avoided, we are bullish on a continued
ability to effectively manage it.&lt;/p&gt;

&lt;h4 id=&quot;references&quot;&gt;References&lt;/h4&gt;

&lt;div id=&quot;refs&quot;&gt;

  &lt;p&gt;&lt;a name=&quot;1&quot;&gt;&lt;/a&gt;[1]
D. J. Abadi. &lt;a href=&quot;http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf&quot;&gt;Consistency tradeoffs in modern distributed database system design: CAP is only part of the story&lt;/a&gt;. &lt;em&gt;IEEE Computer&lt;/em&gt;, 45(2):37–42, 2012.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;2&quot;&gt;&lt;/a&gt;[2] P. Alvaro, P. Bailis, N. Conway, and J. M. Hellerstein. &lt;a href=&quot;http://www.bailis.org/papers/consistency-socc2013.pdf&quot;&gt;Consistency
without borders&lt;/a&gt;. In &lt;em&gt;SoCC 2013&lt;/em&gt;.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;3&quot;&gt;&lt;/a&gt;[3] P. Alvaro, N. Conway, J. M. Hellerstein, and D. Maier. Blazes:
&lt;a href=&quot;http://www.cs.berkeley.edu/~palvaro/ICDE14_conf_full_205.pdf&quot;&gt;Coordination analysis for distributed programs&lt;/a&gt;. In &lt;em&gt;ICDE
2014&lt;/em&gt;.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;4&quot;&gt;&lt;/a&gt;[4] P. Alvaro, N. Conway, J. M. Hellerstein, and W. Marczak. &lt;a href=&quot;http://db.cs.berkeley.edu/papers/cidr11-bloom.pdf&quot;&gt;Consistency
analysis in Bloom: a CALM and collected
approach&lt;/a&gt;. In &lt;em&gt;CIDR 2011&lt;/em&gt;.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;5&quot;&gt;&lt;/a&gt;[5] P. Bailis, A. Davidson, A. Fekete, A. Ghodsi, J. M. Hellerstein, and
I. Stoica. &lt;a href=&quot;http://www.bailis.org/papers/hat-vldb2014.pdf&quot;&gt;Highly Available Transactions: Virtues and Limitations&lt;/a&gt;. In
&lt;em&gt;VLDB 2014&lt;/em&gt;.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;6&quot;&gt;&lt;/a&gt;[6] P. Bailis, A. Fekete, M. J. Franklin, A. Ghodsi, J. M. Hellerstein, and
I. Stoica. &lt;a href=&quot;http://arxiv.org/pdf/1402.2237.pdf&quot;&gt;Coordination-Avoiding Database Systems&lt;/a&gt;. &lt;em&gt;arXiv:1402.2237&lt;/em&gt;,
2014.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;7&quot;&gt;&lt;/a&gt;[7] P. Bailis, A. Fekete, A. Ghodsi, J. M. Hellerstein, and I. Stoica.
&lt;a href=&quot;http://www.bailis.org/papers/ramp-sigmod2014.pdf&quot;&gt;Scalable Atomic Visibility with RAMP Transactions&lt;/a&gt;. In &lt;em&gt;SIGMOD
2014&lt;/em&gt;.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;8&quot;&gt;&lt;/a&gt;[8] P. Bailis and A. Ghodsi. &lt;a href=&quot;http://queue.acm.org/detail.cfm?id=2462076&quot;&gt;Eventual Consistency
Today: Limitations, extensions, and beyond&lt;/a&gt;. &lt;em&gt;ACM Queue&lt;/em&gt;, 11(3), 2013.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;9&quot;&gt;&lt;/a&gt;[9] N. Bronson et al. &lt;a href=&quot;http://www.eecs.harvard.edu/cs261/papers/bronson-2013.pdf&quot;&gt;Tao: Facebook’s distributed data store for
the social graph&lt;/a&gt;. In &lt;em&gt;USENIX ATC 2013&lt;/em&gt;.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;10&quot;&gt;&lt;/a&gt;[10] J. C. Corbett et al. &lt;a href=&quot;http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf&quot;&gt;Spanner: Google’s globally-distributed database&lt;/a&gt;. In
&lt;em&gt;OSDI 2012&lt;/em&gt;.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;11&quot;&gt;&lt;/a&gt;[11] S. Gilbert and N. Lynch. &lt;a href=&quot;http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf&quot;&gt;Brewer’s conjecture and the feasibility of
consistent, available, partition-tolerant web services&lt;/a&gt;. &lt;em&gt;ACM SIGACT News&lt;/em&gt;, 33(2):51–59,
2002.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;12&quot;&gt;&lt;/a&gt;[12] J. Gray and A. Reuter. Transaction Processing:
Concepts and Techniques. Morgan Kaufmann, 1993.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;13&quot;&gt;&lt;/a&gt;[13] P. L. Lehman and S. B. Yao. &lt;a href=&quot;http://www.cs.cmu.edu/~dga/15-712/F07/papers/Lehman81.pdf&quot;&gt;Efficient locking for concurrent operations
on B-trees&lt;/a&gt;. &lt;em&gt;ACM TODS&lt;/em&gt;, 6(4):650–670, 1981.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;14&quot;&gt;&lt;/a&gt;[14] M. Shapiro, N. Preguiça, C. Baquero, and M. Zawirski. &lt;a href=&quot;http://hal.upmc.fr/docs/00/55/55/88/PDF/techreport.pdf&quot;&gt;A
comprehensive study of convergent and commutative replicated data types&lt;/a&gt;.
Technical Report 7506, INRIA, 2011.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;15&quot;&gt;&lt;/a&gt;[15] J. Shute et al. &lt;a href=&quot;http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41344.pdf&quot;&gt;F1: A distributed SQL database that scales&lt;/a&gt;.
In &lt;em&gt;VLDB 2013&lt;/em&gt;.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;16&quot;&gt;&lt;/a&gt;[16]
M. Stonebraker. &lt;a href=&quot;http://pdf.aminer.org/000/255/770/the_case_for_shared_nothing.pdf&quot;&gt;The case for shared nothing&lt;/a&gt;.
&lt;em&gt;IEEE Database Engineering Bulletin&lt;/em&gt;, 9(1):4–9, 1986.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;17&quot;&gt;&lt;/a&gt;[17]
M. Stonebraker. &lt;a href=&quot;http://cacm.acm.org/blogs/blog-cacm/99512-why-enterprises-are-uninterested-in-nosql/fulltext&quot;&gt;Why enterprises are uninterested in NoSQL&lt;/a&gt;. &lt;em&gt;ACM
Queue Blog&lt;/em&gt;, September 2010.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;18&quot;&gt;&lt;/a&gt;[18] TPC Council. &lt;a href=&quot;http://www.tpc.org/tpcc/spec/tpcc_current.pdf&quot;&gt;TPC-C Benchmark Revision 5.11&lt;/a&gt;. 2010.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;19&quot;&gt;&lt;/a&gt;[19] Y. Xu, Z. Musgrave, B. Noble, and M. Bailey. &lt;a href=&quot;https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final77.pdf&quot;&gt;Bobtail: avoiding long
tails in the cloud&lt;/a&gt;. In &lt;em&gt;NSDI 2013&lt;/em&gt;.&lt;/p&gt;

  &lt;p&gt;&lt;a name=&quot;20&quot;&gt;&lt;/a&gt;[20] M. Zawirski, A. Bieniusa, V. Balegas, S. Duarte, C. Baquero, M. Shapiro,
and N. Preguiça. &lt;a href=&quot;http://arxiv.org/pdf/1310.3107.pdf&quot;&gt;SwiftCloud: Fault-tolerant geo-replication
integrated all the way to the client machine&lt;/a&gt;.
arXiv:1310.3107, 2013.&lt;/p&gt;

&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>Without Conflicts, Serializability Is Free</title>
   <link href="http://bailis.org/blog//without-conflicts-serializability-is-free/"/>
   <updated>2014-04-14T00:00:00-07:00</updated>
   <id>http://bailis.org/blog//without-conflicts-serializability-is-free</id>
   <content type="html">&lt;p&gt;Common pitches for modern, &lt;a href=&quot;http://en.wikipedia.org/wiki/Serializability&quot;&gt;serializable&lt;/a&gt; databases include claims that they are “as scalable as NoSQL,” they “combine the speed and scale advantages of NoSQL systems with ACID guarantees,” or they demonstrate that “the scalability, fault-tolerance, and performance of NoSQL databases are still achievable with [serializable] transactions.” These claims are somewhat misleading, and here’s why:&lt;/p&gt;

&lt;p&gt;Any two operations on the same data—at least one of which is a write—&lt;a href=&quot;http://research.microsoft.com/en-us/people/philbe/chapter2.pdf&quot;&gt;can compromise serializability&lt;/a&gt;, or the illusion of a sequential execution, if executed concurrently. So, when executing transactions that read and write to the same data, a database will &lt;em&gt;have to&lt;/em&gt; stall some of the transactions in order to preserve serializability. Adding more servers won’t necessarily improve throughput: if a workload bottlenecks on read/write synchronization, adding physical resources like extra servers won’t help.&lt;/p&gt;

&lt;p&gt;In contrast, a NoSQL system like Riak or Cassandra offering &lt;a href=&quot;http://queue.acm.org/detail.cfm?id=2462076&quot;&gt;“weak” consistency&lt;/a&gt; can avoid these synchronization bottlenecks. Additional servers can process additional requests in parallel, without communicating. This provides availability, low latency, and scalability—even for single-record accesses—allowing literally unbounded throughput. Of course, there’s no free lunch: these scalable systems provide weaker guarantees that can—&lt;a href=&quot;http://arxiv.org/pdf/1402.2237.pdf&quot;&gt;but do not always&lt;/a&gt;—compromise application-level consistency.&lt;/p&gt;

&lt;p&gt;However, for operations over disjoint data—that is, for transactions without read-write conflicts—serializable databases &lt;em&gt;can&lt;/em&gt; perform as well as weakly consistent systems. Under these workloads, there’s no need for synchronization between operations, which can safely execute concurrently. This is why I say that &lt;em&gt;without conflicts, serializability is free&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Will exploiting this disjoint data parallelism result in a quantum leap in distributed database design? Mike Stonebraker &lt;a href=&quot;http://pdf.aminer.org/000/255/770/the_case_for_shared_nothing.pdf&quot;&gt;would probably say “no”&lt;/a&gt;. Database system designs have optimized for data-parallel access patterns for decades: “shared nothing” serializable databases provide excellent programmability and perform well—just not for all workloads.&lt;/p&gt;

&lt;p&gt;Anyone providing strong semantics and claiming absolute performance, latency, or availability parity with “NoSQL” is either confused about database isolation, isn’t running workloads with conflicts, or is just trying to sell you a database. In practice, your mileage may vary: understand your read-write conflict patterns, and plan accordingly.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Scalable Atomic Visibility with RAMP Transactions</title>
   <link href="http://bailis.org/blog//scalable-atomic-visibility-with-ramp-transactions/"/>
   <updated>2014-04-07T00:00:00-07:00</updated>
   <id>http://bailis.org/blog//scalable-atomic-visibility-with-ramp-transactions</id>
   <content type="html">&lt;p&gt;We recently wrote a paper that will appear at
&lt;a href=&quot;http://www.sigmod2014.org/&quot;&gt;SIGMOD&lt;/a&gt; called
&lt;a href=&quot;http://www.bailis.org/papers/ramp-sigmod2014.pdf&quot;&gt;Scalable Atomic Visibility with RAMP Transactions&lt;/a&gt;. This
post introduces RAMP Transactions, explains why you should care,
and briefly describes how they work.&lt;/p&gt;

&lt;h4 id=&quot;executive-summary&quot;&gt;Executive Summary&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;What they are:&lt;/strong&gt; We’ve developed three new algorithms—called Read
  Atomic Multi-Partition (RAMP) Transactions—for ensuring &lt;em&gt;atomic
  visibility&lt;/em&gt; in partitioned (sharded) databases: either all of a
  transaction’s updates are observed, or none are.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why they’re useful:&lt;/strong&gt; In addition to general-purpose multi-key updates, atomic
  visibility is required for correctly maintaining foreign key
  constraints, secondary indexes, and materialized views.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why they’re needed:&lt;/strong&gt; Existing protocols like locking couple mutual
  exclusion and atomic visibility. Mutual exclusion in a distributed
  environment can lead to serious performance degradation and availability
  problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How they work:&lt;/strong&gt; RAMP transactions allow readers and writers to
  proceed concurrently. Operations race, but readers autonomously
  detect the races and repair any non-atomic reads. The write protocol
  ensures readers never stall waiting for writes to arrive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why they scale:&lt;/strong&gt; Clients can’t cause other clients to stall (via
  &lt;em&gt;synchronization independence&lt;/em&gt;) and clients only have to contact the
  servers responsible for items in their transactions (via &lt;em&gt;partition
  independence&lt;/em&gt;). As a consequence, there’s no mutual exclusion or
  synchronous coordination across servers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The end result:&lt;/strong&gt; RAMP transactions outperform existing approaches
  across a variety of workloads, and, &lt;em&gt;for a workload of 95% reads,
  RAMP transactions scale to over 7 million ops/second on 100 servers
  at less than 5% overhead.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where the overhead goes:&lt;/strong&gt; Writes take 2 RTTs and attach either a constant (in
  RAMP-Small and RAMP-Hybrid algorithms) or linear (in RAMP-Fast)
  amount of metadata to each write, while reads take 1-2 RTTs
  depending on the algorithm.&lt;/p&gt;

&lt;h4 id=&quot;background&quot;&gt;Background&lt;/h4&gt;

&lt;p&gt;Together with colleagues at Berkeley and the University of Sydney,
I’ve spent &lt;a href=&quot;http://www.bailis.org/pubs.html&quot;&gt;the last several years&lt;/a&gt;
exploring the limits of coordination-free,
&lt;a href=&quot;http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed&quot;&gt;“AP”&lt;/a&gt;
execution in distributed
databases. &lt;a href=&quot;http://www.bailis.org/pubs.html#coord-avoid&quot;&gt;Coordination-free execution&lt;/a&gt;
is powerful: it guarantees a response from any non-failing servers,
provides low latency, and allows scale out, even at the granularity of
a single database record. Of course, coordination-free execution
permits races between concurrent operations, so the question is: what
useful guarantees are achievable under coordination-free execution?&lt;/p&gt;

&lt;p&gt;We &lt;a href=&quot;http://www.bailis.org/papers/hat-vldb2014.pdf&quot;&gt;recently&lt;/a&gt; spent
time studying traditional database isolation levels from the
perspective of coordination free execution. In this work, we realized
that a classic property from database systems was achievable without
coordination—even though most almost all databases implement it
using expensive mechanisms!  That is, we realized that we could
provide the atomic visibility property—either all updates should be
visible to another transaction or none are—without resorting to
typical strategies like locking. We introduced an early version of our
algorithms in a paragraph of our
&lt;a href=&quot;http://www.bailis.org/papers/hat-vldb2014.pdf&quot;&gt;VLDB 2014 paper&lt;/a&gt; on
&lt;a href=&quot;../hat-not-cap-introducing-highly-available-transactions/&quot;&gt;Highly Available Transactions&lt;/a&gt;
but felt there was more work to be done. The
&lt;a href=&quot;https://news.ycombinator.com/item?id=5781040&quot;&gt;positive reaction&lt;/a&gt; to
&lt;a href=&quot;../non-blocking-transactional-atomicity/&quot;&gt;my post&lt;/a&gt; on our prior
algorithm further inspired us to write a full paper.&lt;/p&gt;

&lt;h4 id=&quot;why-atomic-visibility&quot;&gt;Why Atomic Visibility?&lt;/h4&gt;

&lt;p&gt;It turns out that many use cases require atomic visibility: either all
or none of a transaction’s updates should be visible to other
transactions. For example, if I update a table in my database, any
secondary indexes associated with that table should also be
updated. In the paper, we outline three use cases in detail:
&lt;a href=&quot;http://en.wikipedia.org/wiki/Foreign_key&quot;&gt;foreign key constraints&lt;/a&gt;,
&lt;a href=&quot;http://en.wikipedia.org/wiki/Database_index&quot;&gt;secondary indexing&lt;/a&gt;, and
&lt;a href=&quot;http://en.wikipedia.org/wiki/Materialized_view&quot;&gt;materialized view maintenance&lt;/a&gt;. These
use cases crop up in a surprising number of real-world
applications. For example, the authors of
&lt;a href=&quot;https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson&quot;&gt;Facebook’s Tao&lt;/a&gt;,
&lt;a href=&quot;http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p1135-qiao.pdf&quot;&gt;LinkedIn Espresso&lt;/a&gt;,
and
&lt;a href=&quot;http://www.mpi-sws.org/~druschel/courses/ds/papers/cooper-pnuts.pdf&quot;&gt;Yahoo! PNUTS&lt;/a&gt;
each describe how users can observe inconsistent data due to fast but
non-atomic updates to multiple entities and secondary indexes in their
systems. These systems have chosen scalability over correctness. We
wanted to know: can we provide these use cases with both scalability &lt;em&gt;and&lt;/em&gt;
correctness?&lt;/p&gt;

&lt;p&gt;If you pick up a database textbook today and try to figure out how to
implement atomic visibility, you’re likely to end up with a solution
like two-phase locking or some version of optimistic concurrency
control. This works great on a single machine, but, as soon as your
operations span multiple servers, you’re likely to run into problems:
if an operation holds a lock while waiting on an RPC, any concurrent
lock requests are going to have to wait. Throughput effectively drops
to 1/(Round Trip Time): a few thousand requests per second on a modern
cloud computing network like EC2. To allow partitioned (sharded)
operations to perform well, we’ll have to avoid any kind of blocking,
or synchronous coordination.&lt;/p&gt;

&lt;p&gt;Our thesis in this work is that &lt;em&gt;traditional approaches like locking
 (often unnecessarily) couple atomic visibility and mutual
 exclusion. We can offer the benefits of the former without the
 penalties of the latter.&lt;/em&gt;&lt;/p&gt;

&lt;h4 id=&quot;ramp-internals-in-brief&quot;&gt;RAMP Internals, In Brief&lt;/h4&gt;

&lt;p&gt;In RAMP transactions, we allow reads and writes to proceed
concurrently over the same data. This provides excellent scalability,
but it introduces a race condition: what if a transaction only
observes a subset of another, concurrent transaction’s writes? We address this
race condition via two mechanisms (for now, assume read-only and
write-only transactions):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Use metadata to detect the race:&lt;/strong&gt; Clients attach metadata to
each write that allows other clients to detect partial reads from
in-flight writes. For example, in the RAMP-Fast algorithm, writers
attach a &lt;a href=&quot;https://news.ycombinator.com/item?id=5781309&quot;&gt;unique&lt;/a&gt;,
per-transaction timestamp and a list of items written in the
transaction. Clients can combine all of the metadata from versions
they have &lt;em&gt;actually&lt;/em&gt; read to determine the highest timestamp that
they &lt;em&gt;should have&lt;/em&gt; read for each item. If any item they read from
servers has a lower timestamp than calculated, it means that the
write was in-flight at the time of the read. The client can
subsequently fetch the right version(s) from the server(s) in a
second (parallel) set of requests and return the resulting set of
versions.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Prevent readers from waiting:&lt;/strong&gt; To prevent readers from ever
having to wait for the “right version” to arrive at a partition,
writes proceed in two rounds. The first round places each write on
its respective partition but doesn’t yet make it visible to
readers (i.e., &lt;em&gt;prepares&lt;/em&gt; the write). The second round makes each
write visible to readers (i.e., &lt;em&gt;commits&lt;/em&gt; the write). This means
that, if a reader observes a write from a transaction, all of the
other writes the reader might request are guaranteed to be present
on their respective servers. The reader simply has to identify the
right version (using metadata above) and the server can return it.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The three RAMP algorithms we develop in the paper offer a trade-off
between metadata size and round trip times (RTTs) required for
reads. RAMP-Fast requires 1 RTT in the race-free case and 2 RTTs in
the event of a race but stores the list of keys written in the
transaction as metadata. RAMP-Small always requires 2 RTTs for reads
but only uses a single timestamp as metadata. RAMP-Hybrid uses a Bloom
filter to compress the RAMP-Fast metadata and requires between 1-2
RTTs per read (determined by the Bloom filter false positive
rate). Writes always require 2 RTTs.&lt;/p&gt;

&lt;p&gt;As we show in the paper, RAMP-Fast and RAMP-Hybrid perform especially
well for read-dominated workloads; there’s no overhead on reads that
don’t race, and, for reads that &lt;em&gt;do&lt;/em&gt; race writes, reads take an extra
RTT. The worst-case throughput penalty is about 50% due to extra RTTs
(either by using RAMP-Small or executing write-heavy workloads), and
we observed a linear relationship between write proportion and
throughput penalty. None of the existing algorithms we benchmarked
(and describe in the paper) performed as well.&lt;/p&gt;

&lt;h4 id=&quot;scalability-for-reals&quot;&gt;Scalability, For Reals&lt;/h4&gt;

&lt;p&gt;The term “scalability” is badly abused these days. (It’s almost as
meaningless as the
&lt;a href=&quot;../when-is-acid-acid-rarely/&quot;&gt;use of the term “ACID” in modern databases&lt;/a&gt;.) When
devising the RAMP transaction protocols, we wanted a more rigorous set
of criteria to capture our goals for “scalability” in a partitioned
database. We decided on the following:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Synchronization independence&lt;/strong&gt; prevents one client’s operations
from causing another’s to stall or fail. This means locks are out
of the question.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Partition independence&lt;/strong&gt; means that each client only has to
contact partitions responsible for items it wants to access.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can (and, in the paper, do) prove that RAMP transactions satisfy
these two criteria. Moreover, when we actually implemented and
experimentally compared our algorithms to existing algorithms that
failed one or both of these properties, the effects were measurable:
algorithms that lacked partition independence required additional
communication (in pathological cases, a lot more), while algorithms
that lacked synchronization independence simply didn’t work well under
high read/write contention. You can see the gory details in Section 5
and read more about these properties in Section 3
&lt;a href=&quot;http://www.bailis.org/papers/ramp-sigmod2014.pdf&quot;&gt;of the paper&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Note that, by providing the above criteria, we’re disallowing some
useful semantics. For example, synchronization independence
effectively prevents writers from stalling other writers. So, if you
might have update conflicts that you need to resolve up-front, you
shouldn’t use these algorithms and you should expect to incur
throughput penalties due to coordination.&lt;/p&gt;

&lt;h4 id=&quot;for-distributed-systems-nerds&quot;&gt;For Distributed Systems Nerds&lt;/h4&gt;

&lt;p&gt;If you’ve made it this far, you’re probably a distributed systems nerd
(welcome!). As a distributed systems researcher, there two aspects
of our approach that I think are particularly interesting:&lt;/p&gt;

&lt;p&gt;First, we use an
&lt;a href=&quot;http://www.cs.jhu.edu/~yairamir/cs437/week8/sld011.htm&quot;&gt;atomic commitment protocol (ACP)&lt;/a&gt;
for writes. However, every ACP can
&lt;a href=&quot;http://research.microsoft.com/en-us/people/philbe/chapter7.pdf&quot;&gt;(provably) block&lt;/a&gt;
during failures. (Remember, AC is
&lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.27.6456&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;harder than consensus&lt;/a&gt;!)
Our observation is that ACPs aren’t harmful on their own; ACPs get a
bad rap because they’re usually used &lt;em&gt;along with&lt;/em&gt; with blocking
concurrency control primitives like locking! By employing an ACP with
non-blocking concurrency control and semantics, we allow individual
ACP rounds to stall without blocking non-failed clients. We talk about
(and evaluate the overhead of) unblocking blocked ACP rounds in the paper.&lt;/p&gt;

&lt;p&gt;Second, in the paper, we use “AP” (i.e., coordination-free) algorithms in a
“CP” environment. That is, we can execute RAMP transactions in an
available active-active (multi-master) environment, but we instead
chose to implement them in a single-master-per-partition
system. Master-based systems can also benefit from coordination-free
algorithms; it’s as if each concurrent operation gets to execute on
&lt;a href=&quot;http://arxiv.org/pdf/1402.2237.pdf&quot;&gt;its own (logical) replica&lt;/a&gt;, plus
we can make strong guarantees on data recency. For example, after a 
write completes, all later transactions are guaranteed 
&lt;a href=&quot;http://stackoverflow.com/questions/8871633/whats-the-difference-between-safe-regular-and-atomic-registers&quot;&gt;to observe its effects&lt;/a&gt;. This
probably deserves another post, but I see the application of
coordination-free algorithms to otherwise linearizable systems as an exciting
area for further research.&lt;/p&gt;

&lt;h4 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h4&gt;

&lt;p&gt;RAMP Transactions provide a scalable alternative to locking and other
coordination-intensive solutions to atomic visibility. We’ve found
them very useful (e.g.,
&lt;a href=&quot;http://arxiv.org/pdf/1402.2237.pdf&quot;&gt;our recent 25x prior-best on the TPC-C New-Order benchmark&lt;/a&gt;
was implemented using a variant of RAMP-Fast), and there’s a
principled reason for their performance: synchronization and partition
independence. Please check out our
&lt;a href=&quot;http://www.bailis.org/papers/ramp-sigmod2014.pdf&quot;&gt;SIGMOD paper&lt;/a&gt; for
more details, and many thanks for the encouraging feedback so
far. Happy scaling!&lt;/p&gt;

</content>
 </entry>
 
 <entry>
   <title>Causality Is Expensive (and What To Do About It)</title>
   <link href="http://bailis.org/blog//causality-is-expensive-and-what-to-do-about-it/"/>
   <updated>2014-02-05T00:00:00-08:00</updated>
   <id>http://bailis.org/blog//causality-is-expensive-and-what-to-do-about-it</id>
   <content type="html">&lt;p&gt;In this post, I briefly motivate the use of causality in distributed
systems, discuss (likely) fundamental lower bounds on metadata
overheads required to capture it, and discuss four strategies for
circumventing these overheads.&lt;/p&gt;

&lt;h3 id=&quot;why-care-about-causality&quot;&gt;Why care about causality?&lt;/h3&gt;

&lt;p&gt;In 1978, Leslie Lamport introduced the important concept of &lt;a href=&quot;http://www.stanford.edu/class/cs240/readings/lamport.pdf&quot;&gt;partial
ordering in distributed
systems&lt;/a&gt;:
given a partial view over global system state, how can we safely say
whether a particular event “happens before” another? Instead of
relying on a total order (e.g., using synchronized clocks) to order
events, Lamport’s proposed &lt;a href=&quot;http://en.wikipedia.org/wiki/Happened-before&quot;&gt;“happens-before”
relation&lt;/a&gt; captures
dependencies between events as a &lt;a href=&quot;http://book.mixu.net/distsys/time.html&quot;&gt;partial
order&lt;/a&gt;: “happens-before”
reflects the order of events within each process as well as the order
of events across processes, captured via message channels. This
formulation conveniently means that reasoning about “happens-before”
does not require synchronous coordination between processes and also
captures the possibility that two events may be completely independent
of one another (i.e., are concurrent; just like &lt;a href=&quot;http://en.wikipedia.org/wiki/Light_cone&quot;&gt;light
cones&lt;/a&gt; in the real
world). Accordingly, “happens-before” is a powerful concept and forms
the basis of &lt;em&gt;causality&lt;/em&gt; in distributed systems, which is used in many
contexts:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Distributed snapshot algorithms (e.g., &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.63.4399&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;consistent
  cuts&lt;/a&gt;)
  and global predicate detection algorithms typically leverage causal
  ordering for efficient execution (e.g., enable consistent snapshots
  without forcing processes to pause). This is particularly useful in
  debugging.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://en.wikipedia.org/wiki/Version_vector&quot;&gt;Version vectors&lt;/a&gt;
  are used in databases like Dynamo, Riak, and Voldemort to track
  concurrent updates to data and &lt;a href=&quot;http://zoo.cs.yale.edu/classes/cs422/2013/bib/terry95managing.pdf&quot;&gt;manage update
  conflicts&lt;/a&gt;
  without fast-path synchronization between replicas.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://en.wikipedia.org/wiki/Causal_consistency&quot;&gt;Causal
  consistency&lt;/a&gt; and
  &lt;a href=&quot;http://www.info.ucl.ac.be/courses/SINF2345/2010-2011/slides/5b-causal-broadcast-hand.pdf&quot;&gt;causal
  broadcast&lt;/a&gt;
  provide databases and messaging systems with ordering guarantees
  that respect Lamport’s “happens-before” relation. This means, for
  example, that replies on Twitter won’t be seen without their parent
  Tweets. These two use cases in particular have recently seen a
  resurgence of interest in the research community.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As a theoretical construct and, increasingly, in real-world
distributed systems, causality is important. I’ll defer a full
description and discussion of causality to the expansive literature
(&lt;a href=&quot;http://aqualab.cs.northwestern.edu/component/attachments/download/302&quot;&gt;here’s&lt;/a&gt;
a survey, and &lt;a href=&quot;http://www.vs.inf.ethz.ch/publ/papers/holygrail.pdf&quot;&gt;here’s my
favorite&lt;/a&gt;—check
out that subtitle!). Instead, I want to ask a specific question: what
does causality cost?&lt;/p&gt;

&lt;h3 id=&quot;causality-is-expensive&quot;&gt;Causality is expensive&lt;/h3&gt;

&lt;p&gt;To use causal ordering, we need some way to &lt;em&gt;capture&lt;/em&gt; it via a data
structure or other piece of information. There are a variety of
techniques for doing so in the literature that you may have heard of,
like &lt;a href=&quot;http://en.wikipedia.org/wiki/Vector_clock&quot;&gt;vector clocks&lt;/a&gt; (note
that the related &lt;a href=&quot;http://en.wikipedia.org/wiki/Lamport_timestamps&quot;&gt;Lamport
clocks&lt;/a&gt; don’t allow
us to distinguish between “concurrent” and “earlier” events). If
you’re familiar with vector clocks, you’ll know that each process in
the system requires a position in the data structure; this means that,
with N processes, each vector clock takes up O(N) space.&lt;/p&gt;

&lt;p&gt;This leads to a tough question: how much space is &lt;em&gt;required&lt;/em&gt; in order
to capture causality? This is a difficult question to answer, but it’s
fascinating to think about and has serious implications for our above
use cases. Fortunately, Bernadette Charron-Bost thought seriously
about this problem, and, in 1991, published &lt;a href=&quot;http://dl.acm.org/citation.cfm?id=117606&quot;&gt;a surprising
result&lt;/a&gt;; the actual paper is
fairly hairy, but Schwarz and Mattern &lt;a href=&quot;http://www.vs.inf.ethz.ch/publ/papers/holygrail.pdf&quot;&gt;summarize
well&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Is there a way to find a “better” timestamping algorithm based on
smaller time vectors which truly characterizes causality? As it
seems, the answer is negative. Charron-Bost showed…that causality
can be characterized only by vector timestamps of size N.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Wow. Charron-Bost’s result seems to imply that we can’t use less than
O(N) metadata! For small numbers of processes, this isn’t so bad, but,
if we scale to hundreds or thousands of nodes, &lt;em&gt;each&lt;/em&gt; message (or, in
a database, operation) is going to require a lot of metadata. Schwarz
and Mattern (do you recognize Mattern from earlier?) continue:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;It is not immediately evident that — for a more sophisticated type
 of vector order than &amp;lt; — a smaller vector could not suffice to
 characterize causality, although the result of Charron-Bost seems to
 indicate that this is rather unlikely…A definite theorem about the
 size of vector clocks would require some statement about the minimum
 amount of information that has to be contained in timestamps in
 order to define a partial order of dimension N on them. Finding
 such an information theoretical proof is still an open problem.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, we don’t have a definitive proof, but, in all likelihood, we’re
not going to do better. Moreover, in the now 23 years following this
result, we haven’t seen anyone do better.&lt;/p&gt;

&lt;p&gt;Intuitively, I think of the lower bound as follows: if I’m a process,
and I want to perform an event, I need some way to distinguish my new
event from all of the prior events that I’ve performed. This hints
that I’ll need some sort of unique marker for my event—as in a
vector clock, I can use a local timestamp that I increment on every
event (which requires O(log(events)) space). Now, if &lt;em&gt;every&lt;/em&gt; other
process simultaneously wants to perform a new event, then we’ll
collectively need N timestamps. We can’t coalesce these timestamps,
since they’re due to unique events, so this puts us at (at least) O(N)
metadata! Some &lt;a href=&quot;http://research.microsoft.com/pubs/201602/replDataTypesPOPL13-complete.pdf&quot;&gt;recent
results&lt;/a&gt;
from Microsoft Research and the CRDT team show similar bounds for
vector-based data structures.&lt;/p&gt;

&lt;h3 id=&quot;and-what-to-do-about-it&quot;&gt;…and what to do about it&lt;/h3&gt;

&lt;p&gt;There are many optimizations that reduce the overhead of causal
tracking in the best case, but these &lt;a href=&quot;http://en.wikipedia.org/wiki/Murphy&#39;s_law&quot;&gt;worst-case
overheads&lt;/a&gt; are too costly
for many modern services running at scale. (Perhaps surprisingly, many
modern implementations are even more expensive, with worst-case
metadata overheads that are linear in the number of events or the
number of keys in a database.) If you’re interested, we wrote a paper
a while ago about &lt;a href=&quot;http://www.bailis.org/papers/explicit-socc2012.pdf&quot;&gt;how bad this overhead can
become&lt;/a&gt;
(&lt;a href=&quot;http://vimeo.com/51578973&quot;&gt;voiceover&lt;/a&gt;) for causally consistent
databases backing modern internet services.&lt;/p&gt;

&lt;p&gt;Can we do anything to avoid these overheads?&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;Restrict the set of participants:&lt;/em&gt; To reduce the O(N) factor, we
  can reduce N, or the number of processes across which we track
  causal information. For example, if we’re building a distributed
  database, we can simply track causality across replicas of a each
  data item instead of causality across all servers. This sacrifices
  causal guarantees across data items but allows us to detect update
  conflicts for a single data item and is exactly the strategy adopted
  by &lt;a href=&quot;http://en.wikipedia.org/wiki/Version_vector&quot;&gt;version
  vectors&lt;/a&gt;. In most
  systems, the number of replicas for an item is much smaller than the
  number of servers in the system (e.g., 3 vs. 100), so this is a
  substantial reduction &lt;em&gt;in practice&lt;/em&gt;. (Carlos Baquero has a &lt;a href=&quot;http://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/&quot;&gt;good
  post&lt;/a&gt;
  on this distinction.)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;Explicitly specify relevant relationships:&lt;/em&gt; The above discussion
  assumes that all events matter equally; in practice, this isn’t
  necessarily the case. On Twitter, when a user posts a reply to a
  Tweet, the causal relationship between the reply and the parent
  Tweet is—from a UX perspective—more important than the
  relationship between all of the Tweets the user read at login and
  her new reply. Effectively, if traditional forms of causality (i.e.,
  &lt;em&gt;potential causality&lt;/em&gt;) treat all &lt;em&gt;possible&lt;/em&gt; (transitive) influences
  equally, what if we could &lt;em&gt;explicitly&lt;/em&gt; specify which partial orders
  matter? In our Twitter example, tracking this &lt;em&gt;explicit causality&lt;/em&gt;
  would only require a metadata overhead of O(1) for the “reply-to”
  relationship. The trade-off is that (like foreign key dependencies
  in database systems), the user now has to specify her causal
  dependencies manually at write time; our
  &lt;a href=&quot;http://www.bailis.org/papers/explicit-socc2012.pdf&quot;&gt;paper&lt;/a&gt; I
  mentioned earlier describes this strategy in greater detail.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;Reduce availability:&lt;/em&gt; The problem with reducing the set of
  participants or using explicit causality is that we will necessarily
  throw away some causal dependencies. The upshot is that we were able
  to to reduce metadata while preserving availability. An alternative
  strategy is to attempt to compress causality by restricting
  availability: if we bound the number of processes that can
  simultaneously perform operations to a constant factor K, we only
  need K entries in our vector at any given time (i.e., to perform an
  operation, a process must “reserve” a spot in the vector, then
  “catch up” to the current vector position in the causal
  history—by, say, processing the events created and received by the
  prior occupant of the position).  Under this strategy, metadata size
  determines maximum concurrency; in the limit, with K=1, we have a
  total order on events (close—if not identical
  to—&lt;a href=&quot;http://en.wikipedia.org/wiki/Linearizability&quot;&gt;linearizability&lt;/a&gt;). With
  this strategy, we’ve traded metadata by sacrificing availability and
  forcing some processes to effectively “share” causal dependencies.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;Drop happens-before entirely:&lt;/em&gt; If we don’t want to suffer
  metadata overheads, require programmer intervention, or sacrifice
  availability, we can always use a weaker partial order (i.e., weaker
  but still available model). For example, if, in a database, we
  simply want each user to read her writes, we don’t (necessarily)
  need any metadata and can simply use &lt;a href=&quot;http://www.bailis.org/blog/stickiness-and-client-server-session-guarantees/&quot;&gt;sticky routing
  policies&lt;/a&gt;. Vanilla
  &lt;a href=&quot;http://www.bailis.org/blog/safety-and-liveness-eventual-consistency-is-not-safe/&quot;&gt;eventual
  consistency&lt;/a&gt;
  is even cheaper. Of course, this
  &lt;a href=&quot;http://www.datastax.com/dev/blog/why-cassandra-doesnt-need-vector-clocks&quot;&gt;strategy&lt;/a&gt;
  can &lt;a href=&quot;http://aphyr.com/posts/294-call-me-maybe-cassandra/&quot;&gt;clearly
  compromise&lt;/a&gt;
  application consistency because we lose the ability to distinguish
  between concurrent writes and overwrites to the same item, but, on
  the plus side, it doesn’t get much cheaper!&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It’s also important to remember that, regardless of the model we
  choose, if we want true “availability”, we necessarily &lt;a href=&quot;http://www.bailis.org/papers/hat-vldb2014.pdf&quot;&gt;lose the
  ability to make many useful
  guarantees&lt;/a&gt;, like
  preventing concurrent updates. There’s no free lunch, but, given
  that not all “weak” models are created equal (at least in terms of
  metadata cost), sometimes it makes sense to drop full causal
  ordering across all events and all processes and settle for
  enforcing a less costly alternative.&lt;/p&gt;

&lt;h3 id=&quot;takeaways&quot;&gt;Takeaways&lt;/h3&gt;

&lt;p&gt;Causality is an immensely powerful concept in distributed systems, but
it’s unlikely that we’ll discover a more compact, sub-linear
representation that is sufficient to characterize it. I have no doubt
that causality will remain important for debugging and reasoning about
global states of distributed computations and am excited by the recent
work in causally consistent distributed systems (full disclosure: I
spent &lt;a href=&quot;http://www.bailis.org/papers/bolton-sigmod2013.pdf&quot;&gt;some time on
this&lt;/a&gt; earlier in
my Ph.D.). As researchers, it’s our job to push the envelope, and
understanding the compromises required in light of the (likely)
fundamental trade-offs I’ve described is a worthwhile
exercise. However, given the worst-case overheads of causality
tracking—at least in real-world deployments—and lack of a more
compact counterexample, I’m more bullish on the four alternatives I’ve
outlined.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Stickiness and Client-Server Session Guarantees</title>
   <link href="http://bailis.org/blog//stickiness-and-client-server-session-guarantees/"/>
   <updated>2014-01-13T00:00:00-08:00</updated>
   <id>http://bailis.org/blog//stickiness-and-client-server-session-guarantees</id>
   <content type="html">&lt;h4 id=&quot;session-guarantees&quot;&gt;Session Guarantees&lt;/h4&gt;

&lt;p&gt;One of the most common consistency requirements I encounter in modern
web services is a guarantee called “read your writes” (RYW): each
user’s reads should reflect the client’s prior writes. This means that,
for example, once I successfully post a Tweet, I’ll be able to read it
after a page refresh. Without RYW, I have no idea whether my update
succeeded or was lost, and I might end up posting &lt;em&gt;again&lt;/em&gt;, resulting
in a second update.&lt;/p&gt;

&lt;p&gt;RYW is part of a larger set of &lt;a href=&quot;http://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/SessionGuaranteesBayou.pdf&quot;&gt;“session
guarantees”&lt;/a&gt;
developed in the 1990s (and &lt;a href=&quot;http://queue.acm.org/detail.cfm?id=1466448&quot;&gt;popularized by Werner Vogels&lt;/a&gt;; &lt;a href=&quot;http://pages.cs.wisc.edu/~cs739-1/papers/consistencybaseball.pdf&quot;&gt;also useful&lt;/a&gt;). These session guarantees are useful for at least two
reasons. First, they capture intuitive requirements for end-user
behavior: RYW and other guarantees, like “monotonic reads” (roughly
requiring that time doesn’t appear to go backwards) are easy to
understand and, as we saw above, often desirable for human-facing services. Second,
session guarantees are cheap: while stronger models like
&lt;a href=&quot;http://en.wikipedia.org/wiki/Linearizability&quot;&gt;linearizability&lt;/a&gt; (“C”
in &lt;a href=&quot;http://henryr.github.io/cap-faq/&quot;&gt;CAP&lt;/a&gt;) provide every session
guarantee and then some, they are notoriously expensive—usually
requiring unavailability during partial failure and &lt;a href=&quot;http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf&quot;&gt;increased
latency&lt;/a&gt;. What
early systems like
&lt;a href=&quot;http://zoo.cs.yale.edu/classes/cs422/2013/bib/terry95managing.pdf&quot;&gt;Bayou&lt;/a&gt;
discovered is that there are implementation techniques for achieving
session guarantees without paying these costs.&lt;/p&gt;

&lt;h4 id=&quot;session-guarantees-and-availability&quot;&gt;Session Guarantees and Availability&lt;/h4&gt;

&lt;p&gt;Interestingly (and as the subject of this post), most—but not
all—session guarantees are achievable with CAP-style
availability. &lt;a href=&quot;http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf&quot;&gt;Gilbert and Lynch’s proof of the CAP
Theorem&lt;/a&gt;
defines availability by requiring that every non-failing server
guarantees a response to each request despite arbitrary network
partitions between the servers. So, if we want to build an available
system providing the monotonic reads session guarantee, we can ensure
that read operations only return writes when the writes are present on
all servers. This ensures that, regardless of which server a client
connects to, it won’t be forced to read older data and “go back in time.”&lt;/p&gt;

&lt;p&gt;Via a classic partitioning argument, we can see that RYW is not
achievable under the stringent CAP availability model. We can partition a
client C away from all but one server S and require C to perform a
write. If our implementation is available, S should eventually
acknowledge the write as successful. If C reads from S, it’ll achieve
RYW. But, what if we partition C away from S and allow it to only
communicate with server T? If we require C to perform a read, T will
have to respond, and C will not read its prior write. This demonstrates
that it’s not possible to guarantee RYW for arbitrary read/write
operations in an available manner.&lt;/p&gt;

&lt;p&gt;In our recent work on &lt;a href=&quot;http://www.bailis.org/papers/hat-vldb2014.pdf&quot;&gt;Highly Available
Transactions&lt;/a&gt;, we
performed a similar analysis for each of the session guarantees, and
found that all but RYW are achievable with availability (see Figure 2
and Table 3 on page 8; perhaps surprisingly, this also means that
causal consistency is not available in a client-server model). The
question is: if RYW isn’t available, why does it still seem to be
cheaper than, say, linearizability?&lt;/p&gt;

&lt;h4 id=&quot;sticky-availability-and-mechanisms&quot;&gt;Sticky Availability and Mechanisms&lt;/h4&gt;

&lt;p&gt;To understand why RYW is still “cheap” but not quite as cheap as other
session guarantees, we formalized a new model of availability. RYW is
indeed achievable (as Vogels points out) if clients stay connected
to—or are “sticky” with—a server (really, a complete copy of the
database). This requires a stronger assumption than CAP-style
availability, but it’s still much weaker than, say, requiring that
clients contact a majority of servers. In the &lt;a href=&quot;http://www.bailis.org/papers/hat-vldb2014.pdf&quot;&gt;HAT
paper&lt;/a&gt;, we formalize
this as “sticky availability” (page 4, Section 4.1).&lt;/p&gt;

&lt;p&gt;In practice, there are two primary ways of achieving sticky
availability. First (and easier) is to rethink the definition of a
“server”: if clients cache their writes (thereby acting as a “server”
in our above model), they’ll be able to read them in the future. By
keeping a local copy of data, the clients trivially maintain
stickiness. Several systems (including systems from both &lt;a href=&quot;http://www.bailis.org/papers/bolton-sigmod2013.pdf&quot;&gt;our
group&lt;/a&gt; and &lt;a href=&quot;http://arxiv.org/pdf/1310.3107.pdf&quot;&gt;Marc
Shapiro’s CRDT group&lt;/a&gt;) leverage
these techniques and can provide unparalleled low latency. The problem
here is that caches can grow large, and, based on my experiences, it’s
unclear how well caching works for general-purpose applications. Second,
clients can use sticky request routing to ensure their requests always
contact the same servers. In a single datacenter, this can be
difficult, requiring the storage tier’s request routers to know the
identity of the end-user making a request. This is feasible but
potentially requires tight coupling between application logic and the
database. In a multi-datacenter deployment, if each datacenter has a
linearizable cluster (e.g.,
&lt;a href=&quot;http://www-users.cselabs.umn.edu/classes/Fall-2012/csci8980-2/papers/cops.pdf&quot;&gt;COPS&lt;/a&gt;),
users can be assigned to a given region and their requests routed at
the edge—also doable, but with availability penalties.&lt;/p&gt;

&lt;p&gt;In my experience, sticky routing is more common, with memcached acting
as a non-durable cache with likely (but not guaranteed)
stickiness. However, I’m not aware of any public accounts of actual
sticky available (but non-linearizable) architectures, hinting that
these approaches may either fall into the realm of what Jim Gray calls
“exotics” or, more optimistically, may simply be on the engineering horizon.&lt;/p&gt;

&lt;h4 id=&quot;why-sticky-availability-deserves-more-study&quot;&gt;Why Sticky Availability Deserves More Study&lt;/h4&gt;

&lt;p&gt;Stickiness only becomes evident in a client-server model. In
traditional models of distributed computing, stickiness is often
guaranteed by default. If we consider a set of communicating processes
(take &lt;a href=&quot;http://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/papers/time-clocks.pdf&quot;&gt;Lamport’s paper on
causality&lt;/a&gt;
or even the CAP proof), each process (modulo limits on memory
capacity) can trivially observe its past actions. This model
effectively &lt;em&gt;presumes&lt;/em&gt; the existence of a cache in the form of process
memory. In a classic client-server model, the server is often tasked
with maintaining the client’s state. This is, in most cases,
fundamental to the utility of the system architecture. Yet, as we have
seen above, remembering the past in the client-server
model—especially without stateful clients—is non-trivial! The
difference between these models is subtle but makes a big difference
in practice, and I don’t think the implications have been sufficiently
explored.&lt;/p&gt;

&lt;p&gt;I also find it interesting that most implementations of the available
session guarantees (even those in Bayou) presume sticky
availability. While every session guarantee except RYW is achievable
with availability, they’re often implemented using sticky available
mechanisms! This means that these implementations are “less available”
than they could be, with implications for latency, fault tolerance,
and scalability. The trade-off is between what I’ll call “per-user
visibility” and availability. In our above, available implementation
of monotonic reads, a write might never become visible to any readers
if a non-failed server is permanently partitioned. In contrast, if we
have stickiness, the write can become visible to the writer (and,
possibly, other readers) without sacrificing liveness. I haven’t seen
a system evaluate this trade-off in depth, though many systems
(including the HAT research) have addressed the more general (and
classic) trade-off between global visibility and scalability.&lt;/p&gt;

&lt;p&gt;As a final note, sticky available guarantees still face many of the
same problems as traditional available systems when it comes to
&lt;a href=&quot;http://www.youtube.com/watch?v=_rAdJkAbGls&quot;&gt;maintaining application
correctness&lt;/a&gt;. Mutual
exclusion is unavailable, so how does RYW help applications? The
per-user visibility benefits are useful to end-users, but this is
often a matter of user experience rather than an issue of data
integrity. One benefit is that web services are frequently
single-writer-per-data-item, meaning conflicts are rare or impossible
as long as the single writer observes her updates. But, in general, I
haven’t encountered many constraints on &lt;em&gt;data&lt;/em&gt; that benefit from stickiness.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>On Consistency and Durability</title>
   <link href="http://bailis.org/blog//on-consistency-and-durability/"/>
   <updated>2013-12-10T00:00:00-08:00</updated>
   <id>http://bailis.org/blog//on-consistency-and-durability</id>
   <content type="html">&lt;p&gt;In case you’ve missed it, there’s been a &lt;a href=&quot;https://news.ycombinator.com/item?id=6878005&quot;&gt;great
discussion&lt;/a&gt; about
consistency, availability, and durability on the &lt;a href=&quot;https://groups.google.com/forum/#!topic/redis-db/Oazt2k7Lzz4&quot;&gt;Redis mailing
list&lt;/a&gt;
and &lt;a href=&quot;https://twitter.com/kellabyte/status/410224523602960385&quot;&gt;Twitter&lt;/a&gt;
over the past few days. I wanted to weigh in and specifically address
&lt;a href=&quot;https://news.ycombinator.com/item?id=6880574&quot;&gt;antirez’s point&lt;/a&gt; that&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;While CAP and durability are orthogonal they are very related in actual systems….&lt;/p&gt;
&lt;/blockquote&gt;

&lt;hr /&gt;

&lt;p&gt;We can effectively cast all statements about availability and
consistency into the form:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;If operations can contact AF of N correct replicas, the system provides a guaranteed response that is correct with respect to semantics S.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Availability is all about the precondition (AF of N): under what
conditions is a safe response guaranteed? Gilbert and Lynch’s &lt;a href=&quot;http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf&quot;&gt;proof
of the CAP
theorem&lt;/a&gt;,
shows that when S means
&lt;a href=&quot;http://en.wikipedia.org/wiki/Linearizability&quot;&gt;linearizability&lt;/a&gt; and N
is greater than 1, AF cannot equal 1. In fact, most
&lt;a href=&quot;http://webee.technion.ac.il/uploads/file/publication/731.pdf&quot;&gt;implementations&lt;/a&gt;
of linearizability use a notion of majorities to pick AF = (N+1)/2.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Now, let’s consider statements about durability:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The effects of operations will survive DF fail-stop server failures.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To survive DF failures, we need to contact DF+1 servers. Therefore, we
can provide availability and durability only when enough servers are
online and reachable.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;As stated, two concepts are remarkably similar, but there’s an
important difference. For semantics like linearizability, AF is
typically a function of N and grows with replication factor. In
contrast, DF is typically constant and independent of replication
factor.&lt;/p&gt;

&lt;p&gt;This brings us to antirez’s point. When N=3 and we want writes to
survive one server failure, majority quorums require AF=2 and
durability also requires DF=2; they’re the same! When we want higher
durability without having to contact all servers, N=5 with DF=3 is a
reasonable choice, and, again, durability matches majority quorum
size. For large replication factors, N=100, the difference grows: we
can still get DF=3, while majority quorums require AF=51. But, in
practice, replication factors are often small, so the preconditions
for availability when maintaining both durability and consistency are
often equivalent.&lt;/p&gt;

&lt;p&gt;It’s worth noting that AF=1 and DF=1 &lt;em&gt;is&lt;/em&gt; an option, and it’s
&lt;a href=&quot;http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf&quot;&gt;fast&lt;/a&gt;,
but it will preclude durability in the event of server failures and
also disallows linearizable semantics in the event that you have
multiple active replicas (N &amp;gt; 2).&lt;/p&gt;

&lt;p&gt;The above analysis doesn’t take into account reads, which, in weakly
consistent systems, can contact any non-failing replica, but I think
this sheds some light on the discussion.        &lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Non-blocking Transactional Atomicity</title>
   <link href="http://bailis.org/blog//non-blocking-transactional-atomicity/"/>
   <updated>2013-05-28T00:00:00-07:00</updated>
   <id>http://bailis.org/blog//non-blocking-transactional-atomicity</id>
   <content type="html">&lt;p&gt;&lt;em&gt;tl;dr: You can perform non-blocking multi-object atomic reads and
 writes across arbitrary data partitions via some simple
 multi-versioning and by storing metadata regarding related items.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;N.B. This is a long post, but it’s comprehensive. Reading &lt;a href=&quot;#putting_it_all_together&quot;&gt;the first third&lt;/a&gt; will give you
  most of the understanding.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Edit 4/2014: We wrote a &lt;a href=&quot;http://bailis.org/papers/ramp-sigmod2014.pdf&quot;&gt;SIGMOD paper&lt;/a&gt; on these ideas! Check it
  out or read an &lt;a href=&quot;../scaling-atomic-visibility-with-ramp-transactions/&quot;&gt;updated post&lt;/a&gt; on the new algorithms.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Performing multi-object updates is a common but difficult problem in
real-world distributed systems. When updating two or more items at
once, it’s useful for other readers of those items to observe
atomicity: &lt;em&gt;either all of your updates are visible or none of them
are&lt;/em&gt;.&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#atomicity-note&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;
This crops up in a bunch of contexts, from social network graphs
(e.g., &lt;a href=&quot;http://hive.asu.edu/sigmod12/index.php?option=com_community&amp;amp;view=courses&amp;amp;task=viewpresentation&amp;amp;groupid=644&amp;amp;Itemid=0&quot;&gt;Facebook’s Tao
system&lt;/a&gt;,
where bi-directional “friend” relationships are stored in two
uni-directional pointers) to distributed data structures like counters
(e.g., &lt;a href=&quot;http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011&quot;&gt;Twitter’s
Rainbird&lt;/a&gt;
hierarchical aggregator) and secondary indexes (a topic for a future
post). In conversations I’ve had regarding our work on &lt;a href=&quot;http://www.bailis.org/blog/hat-not-cap-introducing-highly-available-transactions/&quot;&gt;Highly
Available
Transactions&lt;/a&gt;,
atomic multi-item update, or transactional atomicity, is often the
most-requested feature.&lt;/p&gt;

&lt;h4 id=&quot;existing-techniques-locks-entity-groups-and-fuck-it-mode&quot;&gt;Existing Techniques: Locks, Entity Groups, and “Fuck-it Mode”&lt;/h4&gt;

&lt;p&gt;The state of the art in transactional multi-object update typically
employs one of three strategies. &lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Use locks to update multiple items at once. Grab write locks
on update and read locks for reads and you’ll ensure transactional
atomicity. However, in a distributed environment, the possibility of
partial failure and network latency means locking can lead to a Bad
Time™.&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#lock-note&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Co-locate distributed objects you’d like to update together. This
strategy (often called &lt;a href=&quot;http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf&quot;&gt;“entity
groups”&lt;/a&gt;)
makes transactional atomicity easy: locking on a single machine is
fast and not subject to the problems of distributed locking under
partial failure and network latency. Unfortunately, this solution
impacts data layout and distribution and does not work well for data
that is difficult to partition (social networks, anyone?).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Use “fuck-it mode,” whereby you simultaneously update all keys
without any concurrency control and hope readers observe transactional
atomicity. This final option is remarkably common: it scales well and
is applicable to any system, but it doesn’t provide any atomicity
guarantees until the system stabilizes (i.e., converges, or is
eventually consistent).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this post, I’ll provide a simple alternative (let’s call it
&lt;em&gt;Non-blocking Transactional Atomicity&lt;/em&gt;, or NBTA) that uses
multi-versioning and some extra metadata to ensure transactional
atomicity without the use of locks. Specifically, our solution does
not block readers or writers in the event of arbitrary process failure
and, as long as readers and writers can contact a server for each data
item they want to access, the system can guarantee transactional
atomicity of both reads or writes. At a high level, the key idea is to
avoid performing in-place updates and to use additional metadata to
substitute for synchronous synchronization across replicas.&lt;/p&gt;

&lt;style&gt;
img { width: 80%; height:auto; display: block; margin-left: auto; margin-right: auto; }
.big { width: 35%; height:auto; }
&lt;/style&gt;

&lt;h1 id=&quot;nbta-by-example&quot;&gt;NBTA by Example&lt;/h1&gt;

&lt;p&gt;To illustrate the NBTA algorithm, consider the simple scenario where
there are two servers, one storing item &lt;code&gt;x&lt;/code&gt; and the other storing item
&lt;code&gt;y&lt;/code&gt;, both of which have value &lt;code&gt;0&lt;/code&gt;. Say we have two clients, one of
which wishes to write &lt;code&gt;x=1&lt;/code&gt;, &lt;code&gt;y=1&lt;/code&gt; and another that wants to read &lt;code&gt;x&lt;/code&gt;
and &lt;code&gt;y&lt;/code&gt; together (i.e., &lt;code&gt;x=y=0&lt;/code&gt; or &lt;code&gt;x=y=1&lt;/code&gt;). (We’ll discuss
replication later.)&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;big&quot; src=&quot;../post_data/2013-05-28/1-setup.png&quot; /&gt;&lt;/p&gt;

&lt;h4 id=&quot;good-pending-and-invariants&quot;&gt;&lt;em&gt;good&lt;/em&gt;, &lt;em&gt;pending&lt;/em&gt;, and Invariants&lt;/h4&gt;

&lt;p&gt;Let’s split each server’s storage into two parts: &lt;code&gt;good&lt;/code&gt; and
&lt;code&gt;pending&lt;/code&gt;. We will maintain the invariant that every write stored in
&lt;code&gt;good&lt;/code&gt; will have its transactional sibling writes (i.e., the other
writes originating from the transactionally atomic operation) present
on each of their respective servers, either in &lt;code&gt;good&lt;/code&gt; or
&lt;code&gt;pending&lt;/code&gt;. That is, if &lt;code&gt;x=1&lt;/code&gt; is in &lt;code&gt;good&lt;/code&gt; on the server for &lt;code&gt;x&lt;/code&gt;, then,
in the example above, &lt;code&gt;y=1&lt;/code&gt; will be guaranteed to be in &lt;code&gt;good&lt;/code&gt; or
&lt;code&gt;pending&lt;/code&gt; on the server for &lt;code&gt;y&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../post_data/2013-05-28/2-invariant.png&quot; /&gt;&lt;/p&gt;

&lt;p&gt;To maintain the above invariant, servers first place writes into
&lt;code&gt;pending&lt;/code&gt;. Then, once servers learn (possibly asynchronously) that a
write’s transactional siblings are all in &lt;code&gt;pending&lt;/code&gt; (let’s call this
process “learning that a write is stable”), the servers individually
move their respective writes into &lt;code&gt;good&lt;/code&gt;. One simple strategy for
informing servers that a write is stable is to have the writing client
perform two rounds of communication: the first round places writes
into &lt;code&gt;pending&lt;/code&gt;, then, once all servers have acknowledged writes to
&lt;code&gt;pending&lt;/code&gt;, the client notifies each server that its write is
stable. (If you’re nervous, this isn’t two-phase commit; more on that
&lt;a href=&quot;#so_what_just_happened&quot;&gt;later&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../post_data/2013-05-28/3-rounds.png&quot; /&gt;&lt;/p&gt;

&lt;h4 id=&quot;races-and-pointers&quot;&gt;Races and Pointers&lt;/h4&gt;

&lt;p&gt;We’re almost done. If readers read from &lt;code&gt;good&lt;/code&gt;, then they’re
guaranteed to be able to read transactional siblings from other
servers. However, there’s a race condition: what if one server has
placed its write in &lt;code&gt;good&lt;/code&gt; but another still has its transactional
sibling in &lt;code&gt;pending&lt;/code&gt;? We need a way to tell the second server to serve
its read from &lt;code&gt;pending&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../post_data/2013-05-28/4-race.png&quot; /&gt;&lt;/p&gt;

&lt;p&gt;To handle this race condition, we attach additional information to
each write: a list of transactional siblings. At the start of a
multi-key update, clients generate a unique timestamp for all of their
writes (say, client ID plus local clock or random hash), which they
attach to each write, along with a list of the keys written to in the
transaction. Now, when a client reads from &lt;code&gt;good&lt;/code&gt;, it will have a list
of transactional siblings with the same timestamp. When the client
requests a read from one of those sibling items, the server can fetch
it from either &lt;code&gt;pending&lt;/code&gt; or &lt;code&gt;good&lt;/code&gt;. If a client doesn’t need to read a
specific item, the server can respond with the highest timestamped
item from &lt;code&gt;good&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../post_data/2013-05-28/5-metadata.png&quot; /&gt;&lt;/p&gt;

&lt;h4 id=&quot;putting-it-all-together&quot;&gt;Putting it all together&lt;/h4&gt;

&lt;p&gt;We now have an algorithm that guarantees that all writes in a
multi-key update are accessible before revealing them to readers. If a
reader accesses a write, it is guaranteed to be able to access its
transactional siblings without blocking. This way, readers will never
stall waiting for a sibling that hasn’t arrived on its respective
server. To make sure readers can access siblings in both &lt;code&gt;good&lt;/code&gt; and
&lt;code&gt;pending&lt;/code&gt;, we attached additional metadata to each write that can be
used by servers in the event of skews in stable write detection across
servers. If readers or writers fail, there is no effect on other
readers or writers. Any partially written multi-key updates will never
become stable, and servers can optionally guarantee write stability by
performing &lt;code&gt;pending&lt;/code&gt; acknowledgments (i.e., performing the second
phase of the client write) for themselves.&lt;/p&gt;

&lt;h2 id=&quot;it-gets-better&quot;&gt;It Gets Better!&lt;/h2&gt;

&lt;h4 id=&quot;because-optimizations-are-awesome&quot;&gt;…because optimizations are awesome…&lt;/h4&gt;

&lt;p&gt;There are several optimizations and modifications we can make to the
NBTA protocol:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;Size of&lt;/em&gt; &lt;code&gt;pending&lt;/code&gt; &lt;em&gt;and&lt;/em&gt; &lt;code&gt;good&lt;/code&gt;&lt;em&gt;:&lt;/em&gt; if users want “last writer wins”
semantics, there’s no need to store more than one write in
&lt;code&gt;good&lt;/code&gt;. However, if we do this, a write’s sibling may have been
overwritten. If we want to prevent readers from reading “forwards
in time” (e.g., read &lt;code&gt;x=0&lt;/code&gt; then &lt;code&gt;y=1&lt;/code&gt; then &lt;code&gt;x=1&lt;/code&gt;, which preserves
the property that once one write becomes visible, all of
transaction’s writes become visible but does not guarantee a
consistent snapshot across items), then servers can retain items in
&lt;code&gt;good&lt;/code&gt; for a bounded amount of time (e.g., as long as a multi-item
read might take) and/or clients can retry reads in the presence of
overwrites.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;Faster writes:&lt;/em&gt; As I alluded to above, it’s not necessary to have
the client perform the second round of communication (which requires
three message delays until visibility). Instead, servers can
directly contact one another once they’ve placed writes in
&lt;code&gt;pending&lt;/code&gt;, requiring only two message delays. Alternatively, clients
can issue the second round of communication asynchronously. However,
to ensure that clients read their writes in these scenarios, they
need to retain metadata until they have (asynchronously) detected
that each write is in &lt;code&gt;good&lt;/code&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;Replication:&lt;/em&gt; So far, I’ve only discussed having a single server
for each data item. With “strong consistency” (i.e.,
linearizability) per server, the above algorithm works fine. With
asynchronous, or lazy, replication between servers (e.g., “eventual
consistency”), there are two options. If all clients contact
disjoint sets of servers (e.g., all clients in a datacenter contact
a full set of replicas), then clients only need to update their
local set of servers, and each set of servers can detect when writes
are stable within their groups. However, if clients can connect to
any server, then writes should only become stable whenever all
respective servers have placed their writes in &lt;code&gt;good&lt;/code&gt; or
&lt;code&gt;pending&lt;/code&gt;. This can take indefinitely long in the presence of
partial failure.&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#ha-note&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;Read/write transactions:&lt;/em&gt; I’ve discussed read-only and write-only
transactions here, but it’s easy to use these techniques for
general-purpose read/write transactions. The main problem when
aiming for models like ANSI Repeatable Read (i.e., snapshot reads)
is ensuring that reads come from a transactionally atomic set: this
can be done by pre-declaring all reads in the transaction and
fetching all items at the start of the transaction or via fancier
(and more expensive) metadata like vector clocks, which I won’t get
into here.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;Metadata sizes:&lt;/em&gt; The metadata required above is linear in the number
of keys written. This is modest in practice, but metadata can also
be dropped once all sibling writes are present in &lt;code&gt;good&lt;/code&gt; (i.e.,
there is no race condition for the transactional writes).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;and-it-works-in-real-life&quot;&gt;…and it works in real life.&lt;/h4&gt;

&lt;p&gt;We’ve built a database based on LevelDB that implements NBTA with all
of the above optimizations except metadata pruning (&lt;a href=&quot;../post_data/2013-05-28/ta-rc-pseudocode.png&quot;&gt;related
pseudocode here&lt;/a&gt;). Under
the Yahoo! Cloud Serving Benchmark, NBTA transactions of 8 operations
each achieve within 33% (all writes) to 4.8% (all reads) of the peak
throughput of eventually consistent (i.e., “fuck-it”) operation (with
3.8–48% higher latency). Our implementation scales linearly, to over
250,000 operations per second for transactions of length 8 consisting
of 50% reads and 50% writes on a deployment of 50 EC2 instances.&lt;/p&gt;

&lt;p&gt;In our experience, NBTA performs substantially better than lock-based
operation because there is no blocking involved. The two primary
sources of overhead are due to metadata (expected to be small for
real-world transactions like the Facebook and secondary indexing) and
moving writes from &lt;code&gt;pending&lt;/code&gt; to &lt;code&gt;stable&lt;/code&gt; (if, as in our
implementation, writes to &lt;code&gt;pending&lt;/code&gt; are persistent, this results in
two durable server-side writes for every client-initiated
write). Given these results, we’re excited to start applying NBTA to
other data stores (and secondary indexing).&lt;/p&gt;

&lt;h4 id=&quot;so-what-just-happened&quot;&gt;So what just happened?&lt;/h4&gt;

&lt;p&gt;If you’re a distributed systems or database weenie like me, you may be
curious how NBTA relates to well-known problems like two-phase commit.&lt;/p&gt;

&lt;p&gt;The NBTA algorithm is a variant of uniform reliable broadcast with
additional metadata to address the case where some servers have
delivered writes but others have not yet, providing safety (e.g., &lt;a href=&quot;http://www.newbooks-services.de/MediaFiles/Texts/7/9783642152597_Excerpt_001.pdf&quot;&gt;see
Algorithm
3.4&lt;/a&gt;). Formally,
NBTA as presented here does not guarantee termination: servers may not
realize that a write in &lt;code&gt;pending&lt;/code&gt; will never become
stable. Recognizing a “dead write” in &lt;code&gt;pending&lt;/code&gt; requires failure
detection and, in practice, writes can be removed from &lt;code&gt;pending&lt;/code&gt; once
sibling servers have been marked as dead, the server detects that a
client died mid-write, the write (under last-writer-wins semantics) is
overwritten by a higher timestamped write in &lt;code&gt;good&lt;/code&gt;, or, more
pragmatically, after a timeout.&lt;/p&gt;

&lt;p&gt;As presented here, servers don’t &lt;code&gt;abort&lt;/code&gt; updates, but this isn’t
fundamental. Instead of placing items in &lt;code&gt;pending&lt;/code&gt;, servers can
instead reject updates, so any updates that were placed into &lt;code&gt;pending&lt;/code&gt;
on other servers will never become stable. NBTA is weaker than
traditional &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.5491&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;non-blocking atomic commitment
protocols&lt;/a&gt;
because it allows non-termination for individual transactional updates
(that is, garbage collecting &lt;code&gt;pending&lt;/code&gt; may take a while). The trick is
that, in practice, as long as independent transactional updates can be
executed concurrently (as is the case with last-writer-wins and as is
the case for all &lt;a href=&quot;http://www.bailis.org/blog/hat-not-cap-introducing-highly-available-transactions/&quot;&gt;Highly Available
Transaction&lt;/a&gt;)
semantics, a stalled transactional update won’t affect other
updates. In contrast, traditional techniques like two-phase commit
with two-phase locking will require stalling in the presence of
coordinator failure.&lt;/p&gt;

&lt;p&gt;There are several ideas in the database literature that are similar to
NBTA. The optimization for reducing message round trips is similar to
the optimizations employed by &lt;a href=&quot;http://research.microsoft.com/pubs/64636/tr-2003-96.pdf&quot;&gt;Paxos
Commit&lt;/a&gt;,
while the use of additional metadata to guard against concurrent
updates may remind you of &lt;a href=&quot;http://www.cs.cornell.edu/courses/cs4411/2009sp/blink.pdf&quot;&gt;B-link
trees&lt;/a&gt; or
other lockless data structures. And, of course, &lt;a href=&quot;http://research.microsoft.com/en-us/people/philbe/chapter5.pdf&quot;&gt;multi-version
concurrency
control&lt;/a&gt;
and &lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.142.552&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;timestamp-based concurrency
control&lt;/a&gt;
have a long history in database systems. The key in NBTA is to achieve
transactional atomicity while avoiding a centralized timestamp
authority or concurrency control mechanism.&lt;/p&gt;

&lt;p&gt;All this said, I haven’t seen a distributed transactional atomicity
algorithm like NBTA before; if you have, please do &lt;a href=&quot;http://www.bailis.org/contact.html&quot;&gt;let me
know&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;This post demonstrated how to achieve atomic multi-key updates across
arbitrary data partitions without using locks or losing the ability to
provide a safe response despite arbitrary failures of readers,
writers, and (depending on the configuration) servers. The key idea
was to establish an invariant that all writes have to be present on
the appropriate servers before showing them to readers. The challenge
was in solving a race condition between showing writes on different
servers—trivial for locks but harder for a highly available
system. And it works in practice–rather well, and much better than
similar lock-based techniques! If you’ve made it this far, you’ve
probably followed along, but I look forward to following up with a
post on how to perform consistent secondary indexing via similar
techniques—a potential killer application for NBTA, particularly
given that it’s &lt;a href=&quot;https://cs.brown.edu/courses/cs227/archives/2012/papers/weaker/cidr07p15.pdf&quot;&gt;often considered impossible in scalable
systems&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As always, feedback is welcomed and encouraged. If you’re interested
in these algorithms in your system, let’s talk.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Thanks to &lt;a href=&quot;https://twitter.com/palvaro&quot;&gt;Peter Alvaro&lt;/a&gt;, &lt;a href=&quot;https://twitter.com/neil_conway&quot;&gt;Neil
 Conway&lt;/a&gt;, &lt;a href=&quot;https://twitter.com/apanda&quot;&gt;Aurojit
 Panda&lt;/a&gt;, and &lt;a href=&quot;http://shivaram.info/&quot;&gt;Shivaram
 Venkataraman&lt;/a&gt;, and &lt;a href=&quot;https://twitter.com/rxin&quot;&gt;Reynold
 Xin&lt;/a&gt; for early feedback on this post. This
 research is joint work with Aaron Davidson, &lt;a href=&quot;http://www.cs.usyd.edu.au/~fekete&quot;&gt;Alan
 Fekete&lt;/a&gt;, &lt;a href=&quot;http://www.cs.berkeley.edu/~alig/&quot;&gt;Ali
 Ghodsi&lt;/a&gt;, &lt;a href=&quot;http://db.cs.berkeley.edu/jmh/&quot;&gt;Joe
 Hellerstein&lt;/a&gt;, and &lt;a href=&quot;http://www.cs.berkeley.edu/~istoica/&quot;&gt;Ion
 Stoica&lt;/a&gt; at UC Berkeley and the
 University of Sydney.&lt;/em&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;div id=&quot;footnotetitle&quot;&gt;Footnotes&lt;/div&gt;

&lt;p&gt;&lt;span class=&quot;footnote&quot; id=&quot;atomicity-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#atomicity-note&quot;&gt;[1]&lt;/a&gt;  Note that
this “atomicity” is not the same as
&lt;a href=&quot;http://en.wikipedia.org/wiki/Linearizability&quot;&gt;linearizability&lt;/a&gt;, the
data consistency property addressed in Gilbert and Lynch’s &lt;a href=&quot;http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf&quot;&gt;proof of
the CAP
Theorem&lt;/a&gt;
and often &lt;a href=&quot;http://research.microsoft.com/en-us/um/people/lamport/pubs/interprocess.pdf&quot;&gt;referred
to&lt;/a&gt;
as “atomic” consistency. Linearizability concerns ordering operations
with respect to real time and is a single-object guarantee. The
“atomicity” here stems from a database context (namely, the &lt;a href=&quot;http://en.wikipedia.org/wiki/Atomicity_(database_systems)&quot;&gt;“A” in
“ACID”&lt;/a&gt;
and concerns performing and observing operations over multiple
objects. To avoid further confusion, we’ll call this “atomicity”
“transactional atomicity.”&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;footnote&quot; id=&quot;lock-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#lock-note&quot;&gt;\[2\]&lt;/a&gt;&amp;nbsp; More
specifically, there are a bunch of ways things can get weird. If a
client dies while holding locks, then the servers should eventually
revoke the locks. This often requires some form of failure detection
or timeout, which leads to awkward scenarios over asynchronous
networks, coupled with effective unavailability prior to lock
revocation. In a linearizable system, as in the example, we&#39;ve already
given up on availability, so this isn&#39;t necessarily horrible---but
it&#39;s a shame (read: it&#39;s slow) to block readers during updates and
vice-versa. If we&#39;re going for a highly available (F=N-1
fault-tolerant) setup (as we will&amp;nbsp;&lt;a href=&quot;#because_optimizations_are_awesome&quot;&gt;later on&lt;/a&gt;), locks are a
non-starter; locks are fundamentally at odds with providing available
operation on all replicas during partitions. &lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;footnote&quot; id=&quot;ha-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#ha-note&quot;&gt;\[3\]&lt;/a&gt;&amp;nbsp; Hold up, cowboy!
What does this replication mean for availability?  As I&#39;ll
discuss&amp;nbsp;&lt;a href=&quot;#so_what_just_happened&quot;&gt;soon&lt;/a&gt;, we haven&#39;t
talked about when the effects of transactions will become visible in
the event of replica failures (i.e., when people will read my
writes). Readers will *always* be able to read transactionally atomic
sets of data items from non-failing replicas; however, depending on
the desired availability, reads may not be the most &quot;up to date&quot; set
that is available on some servers. One way to look at this trade-off
is as follows:&lt;/span&gt; &lt;ol class=&quot;footnote&quot;&gt; &lt;li&gt;You can achieve
linearizability and transactional atomicity, whereby everyone sees all
writes after they complete, but writes may take an indefinite amount
of time to complete (&quot;CP&quot;)&lt;/li&gt;&lt;br /&gt; &lt;li&gt;You can achieve
read-your-writes and transactional atomicity, whereby you can see your
writes after they complete, but you&#39;ll have to remain &quot;sticky&quot; and
continue to contact the same (logical) set of servers during execution
(your &quot;sticky&quot; neighboring clients will also see your writes;
&quot;Sticky-CP&quot;)&lt;/li&gt;&lt;br /&gt; &lt;li&gt;You can achieve transactional atomicity and
be able to contact any server, but writes won&#39;t become visible until
all servers you might read from have received the transactional writes
(&quot;AP&quot;; at the risk of sub-footnoting myself, I&#39;ll note that there are
cool and useful connections to different kinds of [failure
detectors](http://pine.cs.yale.edu/pinewiki/FailureDetectors)
here).&lt;/li&gt; &lt;/ol&gt; &lt;span class=&quot;footnote&quot;&gt; Transactionally atomic
[safety
properties](http://www.bailis.org/blog/safety-and-liveness-eventual-consistency-is-not-safe/)
are guaranteed in all three scenarios, but the safety guarantees on
*recency* offered by each vary. The main ideas presented here apply to
all three cases but were developed in the context of HA
systems.&lt;/span&gt;&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Communication Costs in Real-world Networks</title>
   <link href="http://bailis.org/blog//communication-costs-in-real-world-networks/"/>
   <updated>2013-05-17T00:00:00-07:00</updated>
   <id>http://bailis.org/blog//communication-costs-in-real-world-networks</id>
   <content type="html">&lt;p&gt;&lt;em&gt;tl;dr: Network latencies in the wild can be expensive, especially at
 the tail and across datacenters: how bad are they, and what can we do
 about them? Make sure to explore the &lt;a href=&quot;#explore&quot;&gt;the demo&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Network latency makes distributed programming hard. Even when a
distributed system is fault-free, any communication between servers
affects performance. While the theoretical lower-bound on
communication delay is the speed of light—not horrible, at least
within a single datacenter—latencies are rarely this fast. I’ve been
working on and benchmarking &lt;a href=&quot;http://www.bailis.org/blog/hat-not-cap-introducing-highly-available-transactions/&quot;&gt;communication-avoiding
databases&lt;/a&gt;
and wanted to isolate and quantify the behavior of real-world networks
both within and across datacenters. This post contains both an
interactive &lt;a href=&quot;#explore&quot;&gt;demo&lt;/a&gt; of what we found, some
high-level &lt;a href=&quot;#highlevel_takeaways&quot;&gt;trends&lt;/a&gt;, and some &lt;a href=&quot;#implications_for_distributed_systems_designers&quot;&gt;implications&lt;/a&gt;
for distributed systems designs.&lt;/p&gt;

&lt;p&gt;I wasn’t aware of any datasets describing network behavior both within
and across datacenters, so we launched m1.small Amazon EC2 instances
in each of the eight geo-distributed “Regions,” across the three
us-east “Availability Zones” (three co-located datacenters in
Virginia), and within one datacenter (us-east-b). We measured RTTs
between hosts for a week at a granularity of one ping per
second.&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#methodology-note&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;div id=&quot;explore&quot; style=&quot;margin-top:1em;&quot;&gt;&lt;i&gt;I&#39;ve made the raw data &lt;a href=&quot;https://github.com/pbailis/aws-ping-traces&quot;&gt;available on
Github&lt;/a&gt; but, as an excuse to play with &lt;a href=&quot;http://d3js.org/&quot;&gt;D3.js&lt;/a&gt;, I built this interactive
visualization; select different percentiles and drag and zoom into the
graph:&lt;/i&gt;&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#nc-note&quot;&gt;2&lt;/a&gt;,&lt;a class=&quot;no-decorate&quot; href=&quot;#render-note&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;

&lt;iframe style=&quot;width:800px; height:800px; frame:0px;&quot; frameborder=&quot;0&quot; src=&quot;http://bailis.org/blog/post_data/2013-05-17/latencies.html&quot;&gt;I suggest you enable iframes.&lt;/iframe&gt;
&lt;/div&gt;

&lt;h4 id=&quot;high-level-takeaways&quot;&gt;High-level Takeaways&lt;/h4&gt;

&lt;p&gt;Aside from the absolute numbers and the raw data, I think that there
     are a few interesting takeaways. If you’re a networking guru,
     these may be obvious, but I found the magnitude of these trends
     surprising. &lt;em&gt;(N.B. These aren’t necessarily Amazon-specific, and
     this is hardly an indictment of AWS.)&lt;/em&gt;&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#latency-lit-note&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Latency » Speed of Light&lt;/strong&gt; The minimum RTT between any two nodes
  was 227µs, almost two orders of magnitude higher than the
  theoretical minimum. Across continents, latencies were also
  higher than the speed of light requires: Dublin to Sydney could
  take around 115 milliseconds but requires around 350ms on
  average. Instead, routers, network topologies, virtualization,
  and the end-host software stack all get in the way.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Average « Tail&lt;/strong&gt; Within us-east-b, ping times averaged around
  400µs; this is close to Jeff Dean’s figure from his &lt;a href=&quot;http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html&quot;&gt;Numbers
  Everyone Should
  Know&lt;/a&gt;. However,
  at the tail, latencies get much worse: at the 99.9th percentile,
  latency (again, within a single datacenter) rose to between 11.6
  and 21ms. At the 99.999th percentile, latency increased to
  between 84 and 151ms—a 160 to 350x increase over the average!&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Cross-Datacenter Communication is Expensive&lt;/strong&gt; On average,
  communicating across availability zones was 2–7x slower than
  communicating within an availability zone; communicating across
  geographic regions was 44–720x slower. Notably, latencies for
  cross-geographic regions performed relatively better at the tail:
  at the 99.999th percentile, cross-region RTTs were only 1.4–45x
  slower than us-east RTTs. I suspect this is because transit
  delays on the wire are fixed, while routing and software-related
  delays are more likely to vary. However, the network distance
  between AZs &lt;em&gt;also&lt;/em&gt; varied: us-east-b and us-east-c had a minimum
  RTT of 693µs but us-east-c to us-east-d had a minimum RTT of
  1.31ms (and, on average, a difference of almost 3.5x); not all
  local DC communication links are equal.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;implications-for-distributed-systems-designers&quot;&gt;Implications for Distributed Systems Designers&lt;/h4&gt;

&lt;p&gt;Aside from any particular statistical behavior or correlations, this
data highlights the importance of reasoning about latency in
distributed system design. While many &lt;a href=&quot;http://www.eecs.harvard.edu/~waldo/Readings/waldo-94.pdf&quot;&gt;five-star
wizards&lt;/a&gt; of
distributed computing &lt;a href=&quot;https://blogs.oracle.com/jag/resource/Fallacies.html&quot;&gt;have long warned
us&lt;/a&gt; of the
pitfalls of network latency, there are at least two additional
challenges today: almost every new system is distributed and many
systems are operating at larger scale than ever before. The former
means that more distributed systems developers need
&lt;a href=&quot;http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf&quot;&gt;communication-avoiding
techniques&lt;/a&gt;,
while the latter means that &lt;a href=&quot;http://dl.acm.org/citation.cfm?id=2408794&quot;&gt;the tail will continue to
grow&lt;/a&gt;. Even if we solve the
LAN latency problem, the lower bound on communication cost is still
much higher than that of local data access, and multi-datacenter
system deployments are increasingly common. While we can reduce some
inefficiencies today, there are fundamental barriers to improvement,
like the speed of light; I believe the solution to avoiding latency
penalties will come from better software, algorithms, and programming
techniques instead of better network hardware. &lt;a href=&quot;http://www.bloom-lang.net/&quot;&gt;Better
languages&lt;/a&gt;,
&lt;a href=&quot;http://www.bailis.org/blog/hat-not-cap-introducing-highly-available-transactions/&quot;&gt;semantics&lt;/a&gt;,
and
&lt;a href=&quot;http://hal.upmc.fr/docs/00/55/55/88/PDF/techreport.pdf&quot;&gt;libraries&lt;/a&gt;
are a start.&lt;/p&gt;

&lt;hr /&gt;

&lt;div id=&quot;footnotetitle&quot;&gt;Footnotes&lt;/div&gt;

&lt;p&gt;&lt;span class=&quot;footnote&quot; id=&quot;methodology-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#methodology-note&quot;&gt;[1]&lt;/a&gt; There’s a
non-negligible chance that this post generates debate with respect to
this methodology. My primary purpose for this experiment was to
demonstrate the considerable gap between LAN and WAN latencies, which
are easily captured by the data (if this is your cup of tea, let’s
talk!). However, it’s possible that EC2 virtualization and the choice
of m1.small instances led to higher latencies due to factors like
multi-tenancy and VM migration. There’s also no doubt that larger
packet sizes would change these trends; indeed, in recent database
benchmarking, we’ve observed several additional effects related to
local processing and EC2 NIC behavior under heavy traffic. Please feel
free to leave a comment or get in contact, especially if you have
suggestions for improvement or have any data to share; I’ll gladly
link to it and use it if possible.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;footnote&quot; id=&quot;nc-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#nc-note&quot;&gt;[2]&lt;/a&gt; If you like this
stuff, there’s some really cool research that studies the
metric spaces that arise from network topologies; one of my favorite
papers is &lt;a href=&quot;http://www.eecs.harvard.edu/~syrah/nc/wild07.pdf&quot;&gt;“Network Coordinates in the
Wild”&lt;/a&gt; by Ledlie et
al. in NSDI 2007, which applies network coordinates to the (real-world/”production”) Azureus
file sharing network.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;footnote&quot; id=&quot;render-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#render-note&quot;&gt;[3]&lt;/a&gt; My apologies
for overlaying the cross-AZ and the us-east results on top of the
cross-region data. I looked into pinning each of these two clusters in
designated locations but was only able to pin one at a time and
eventually gave up, settling for the effect that auto-adjusts the label
size.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;footnote&quot; id=&quot;latency-lit-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#latency-lit-note&quot;&gt;[4]&lt;/a&gt; I don’t
study networks (rather, I spend my time building systems on top of
them), but there’s a lot of ongoing work on alleviating these
problems. For a short position paper regarding what &lt;em&gt;should&lt;/em&gt; be
possible and what we may need to fix, check out &lt;a href=&quot;http://www.scs.stanford.edu/~rumble/papers/latency_hotos11.pdf&quot;&gt;“It’s Time For Low
Latency”&lt;/a&gt;
by Rumble et al. in HotOS 2011.&lt;/span&gt;&lt;/p&gt;

</content>
 </entry>
 
 <entry>
   <title>HAT, not CAP: Introducing Highly Available Transactions</title>
   <link href="http://bailis.org/blog//hat-not-cap-introducing-highly-available-transactions/"/>
   <updated>2013-02-05T00:00:00-08:00</updated>
   <id>http://bailis.org/blog//hat-not-cap-introducing-highly-available-transactions</id>
   <content type="html">&lt;p&gt;&lt;em&gt;tl;dr: &lt;a href=&quot;http://arxiv.org/pdf/1302.0309.pdf&quot;&gt;Highly Available
 Transactions&lt;/a&gt; show it’s
 possible to achieve many of the transactional guarantees of today’s
 databases without sacrificing high availability and low latency.&lt;/em&gt;&lt;/p&gt;

&lt;h4 id=&quot;cap-and-acid&quot;&gt;CAP and ACID&lt;/h4&gt;

&lt;p&gt;Distributed systems designers face hard trade-offs between factors
like latency, availability, and consistency. Perhaps most famously,
the &lt;a href=&quot;http://en.wikipedia.org/wiki/CAP_theorem&quot;&gt;CAP Theorem&lt;/a&gt; dictates
that it is impossible to achieve “consistency” while remaining
available in the presence of network and system partitions.&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#cap-note&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; Further, even without
partitions, there is &lt;a href=&quot;http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html&quot;&gt;a trade-off between response time and
consistency&lt;/a&gt;. These
fundamental limitations mean distributed databases can’t have it all,
and the limitations aren’t simply theoretical: across datacenters, the
penalties for strong consistency are on the order of &lt;a href=&quot;http://highscalability.com/numbers-everyone-should-know&quot;&gt;hundreds of
milliseconds&lt;/a&gt;
(compared to single-digit latencies for weak consistency) and, in
general, unavailability takes the form of a
&lt;a href=&quot;http://en.wikipedia.org/wiki/HTTP_404&quot;&gt;404&lt;/a&gt; or &lt;a href=&quot;http://www.whatisfailwhale.info/&quot;&gt;Fail
Whale&lt;/a&gt; on a website. Over twelve
years after Eric Brewer &lt;a href=&quot;http://www.eecs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf&quot;&gt;first stated the CAP
Theorem&lt;/a&gt;
(and after &lt;a href=&quot;http://www.rfc-editor.org/rfc/rfc677.txt&quot;&gt;decades of building distributed database
systems&lt;/a&gt;), data store
designers have taken CAP to heart, some choosing consistency and
others choosing availability and low latency.&lt;/p&gt;

&lt;p&gt;While the CAP Theorem is fairly well understood, the relationship
between CAP and &lt;a href=&quot;http://en.wikipedia.org/wiki/ACID&quot;&gt;ACID transactions&lt;/a&gt;
is not. If we consider the current lack of highly available systems
providing arbitrary multi-object operations with ACID-like semantics,
it appears that CAP and transactions are incompatible. This is partly
due to the historical design of distributed database systems, which
typically chose consistency over high availability. Standard database
techniques like &lt;a href=&quot;http://en.wikipedia.org/wiki/Two-phase_locking&quot;&gt;two-phase
locking&lt;/a&gt; and
&lt;a href=&quot;http://research.microsoft.com/en-us/people/philbe/chapter4.pdf&quot;&gt;multi-version concurrency
control&lt;/a&gt;
do not typically perform well in the event of partial failure, and the
master-based (i.e., master-per-shard) and overlapping quorum-based
techniques often adopted by &lt;a href=&quot;http://research.microsoft.com/en-us/people/philbe/chapter8.pdf&quot;&gt;many distributed database
designs&lt;/a&gt;
are similarly unavailable if users are partitioned from the anointed
primary copies.&lt;/p&gt;

&lt;h4 id=&quot;hats-for-everyone&quot;&gt;HATs for Everyone&lt;/h4&gt;

&lt;p&gt;&lt;a href=&quot;http://arxiv.org/pdf/1302.0309.pdf&quot;&gt;In recent research at UC Berkeley&lt;/a&gt;,
we show that high availability and transactions are &lt;em&gt;not&lt;/em&gt; mutually
exclusive: it is possible to match the semantics provided by many of
today’s “ACID” and “NewSQL” databases without sacrificing high
availability. While these Highly Available Transactions (HATs) do not
provide serializability—which is not highly available under
arbitrary read/write transactions—&lt;a href=&quot;http://www.bailis.org/blog/when-is-acid-acid-rarely/&quot;&gt;as I blogged about last
week&lt;/a&gt;, many ACID databases
provide a weaker form of isolation. The problem is that these
databases do not &lt;em&gt;implement&lt;/em&gt; their guarantees using highly available
algorithms. However, as our recent results demonstrate, we &lt;em&gt;can&lt;/em&gt;
implement these guarantees and achieve other useful properties without
giving up high availability or having to incur cross-replica (or, in a
georeplicated scenario, cross-datacenter) latencies.&lt;/p&gt;

&lt;p&gt;At a high level, HATs provide several guarantees that can be achieved
with high availability&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#availability-note&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; for arbitrary read/write
transactions across a given set of data items, irrespective of data
layout:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Transactional atomicity across arbitrary data items (e.g., see all
or none of a transaction’s updates, or “A” in “ACID”), regardless of
how many shards a transaction accesses and without using a master.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;ANSI-compliant Read Committed and Repeatable Read isolation levels&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#ansi-note&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;
(“I” in “ACID” matching many existing databases).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Session guarantees including read-your-writes, monotonic reads
(i.e., time doesn’t go backwards), and causality within and across
transactions.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Eventual consistency, meaning that, if writes to a data item stop,
all transaction reads will eventually return the last written value.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We believe that this is the strongest set of guarantees that have been
provided with high availability, and many of the algorithms—like the
atomicity and isolation guarantees—are brand new, namely because
they don’t use masters or other coordination on transactions’ fast
paths. &lt;a href=&quot;http://arxiv.org/abs/1302.0309&quot;&gt;The brief report we just
released&lt;/a&gt; runs slightly over five
pages and includes proof-of-concept algorithms for each guarantee.&lt;/p&gt;

&lt;h4 id=&quot;trade-offs&quot;&gt;Trade-offs&lt;/h4&gt;

&lt;p&gt;Of course, there are several guarantees that HATs cannot provide. Not
even the best of marketing teams can produce a real database that
“beats CAP”; HATs cannot make guarantees on data recency during
partitions, although, in the absence of partitions, data &lt;a href=&quot;http://pbs.cs.berkeley.edu/#demo&quot;&gt;may not be
very stale&lt;/a&gt;. HATs cannot be “100%
ACID compliant” as they cannot guarantee serializability, yet they
meet the default and sometimes maximum guarantees of many “ACID”
databases. HATs cannot guarantee global integrity constraints
(e.g., uniqueness constraints across data items) but can perform local
checking of predicates (e.g., per-record integrity maintenance like
null value checks). In the report, we classify many of these anomalies
in terms of previously documented isolation levels.&lt;/p&gt;

&lt;p&gt;Are these guarantees worthwhile? If users need high availability or
low latency, HATs provide a set of semantics that is stronger than any
existing highly available data store. If users need strong consistency
guarantees, they will need to accept the possibility of unavailability
and expect to pay at least one round trip time for each of their
operations. As an example, people often ask me about &lt;a href=&quot;http://www.wired.com/wiredenterprise/2012/11/google-spanner-time/&quot;&gt;Spanner, from
Google&lt;/a&gt;. Spanner
provides strong consistency and typically low latency read-only
transactions. Users that are partitioned from the majority of Spanner
nodes will experience unavailability and read-write transactions will
incur WAN latencies due to Spanner’s two-phase locking
mechanism. Spanner’s authors don’t hide these facts—for example,
&lt;a href=&quot;http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/spanner-osdi2012.pdf&quot;&gt;look at Table 6 on page 12 of the
paper&lt;/a&gt;:
read/write transactions are between 8.3 and 11.9 times slower than
read-only transactions. For Google, who has optimized their WAN
networks, atomic clocks, and infrastructure engineering and whose
workload (also in Table 6) is composed of over 98% of read-only
transactions, Spanner makes sense. When high availability and
guaranteed low latency matter, even Google might choose a different
architecture.&lt;/p&gt;

&lt;h4 id=&quot;coming-soon&quot;&gt;Coming soon&lt;/h4&gt;

&lt;p&gt;Our work on HATs at Berkeley is just beginning. We’re benchmarking a
HAT prototype and are tuning our algorithms for performance and
scalability. Once the algorithms are better explored, I would
personally like to help integrate HATs into existing data stores, much
as we recently did with our &lt;a href=&quot;http://www.bailis.org/blog/using-pbs-in-cassandra-1.2.0/&quot;&gt;PBS work in
Cassandra&lt;/a&gt;. It’d be
interesting to port an application running on Oracle Database to a
NoSQL store and provide the same semantic guarantees with
substantially improved performance, availability, and cost
effectiveness. We’re also working on additional theoretical results to
further explain HATs in the context of CAP. I plan to share these
results as we develop them further.&lt;/p&gt;

&lt;p&gt;In the meantime, we’d welcome feedback on our work so far and are
curious where HATs make sense in your stack. If you’re an application
developer who wishes she had transactional atomicity or weak
isolation, a distributed database developer interested in HATs, or you
just think HATs are cool, let us know. We’re always looking for
anecdotes, workloads, and good conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is Part Two of a two part series on Transactions and
Availability.&lt;br /&gt; &lt;a href=&quot;http://www.bailis.org/blog/when-is-acid-acid-rarely/&quot;&gt;Part One: When is ACID
ACID?&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This research is joint work with &lt;a href=&quot;http://www.cs.usyd.edu.au/~fekete&quot;&gt;Alan
 Fekete&lt;/a&gt;, &lt;a href=&quot;http://www.cs.berkeley.edu/~alig/&quot;&gt;Ali
 Ghodsi&lt;/a&gt;, &lt;a href=&quot;http://db.cs.berkeley.edu/jmh/&quot;&gt;Joe
 Hellerstein&lt;/a&gt;, and &lt;a href=&quot;http://www.cs.berkeley.edu/~istoica/&quot;&gt;Ion
 Stoica&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;span id=&quot;footnotetitle&quot;&gt;Footnotes&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;footnote&quot; id=&quot;cap-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#cap-note&quot;&gt;[1]&lt;/a&gt; &lt;a href=&quot;http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf&quot;&gt;As formally
proven by Gilbert and
Lynch&lt;/a&gt;,
the CAP Theorem states that
&lt;a href=&quot;http://en.wikipedia.org/wiki/Linearizability&quot;&gt;linearizability&lt;/a&gt; and
high availability are incompatible. Linearizability is often called
“atomicity,” yet “atomicity” means something different in database
parlance, namely, we see all or none of a transaction’s updates. For
clarity, I’ll call “ACID atomicity” “transactional atomicity.”&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;footnote&quot; id=&quot;availability-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#availability-note&quot;&gt;[2]&lt;/a&gt;&amp;nbsp;As we
discuss in Section 2 of the paper, we have to be careful how to define
&quot;high availability&quot;: a system that always aborts all transactions is,
in a sense, &quot;available,&quot; if not very useful. In short, we say that a
system provides high availability if every transaction that can
contact at least one server for each data item in the transaction
eventually commits (or alternatively, aborts itself due to an internal
integrity constraint violation). The system is not allowed to
indefinitely abort transactions for the purposes of maintaining
availability.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;footnote&quot; id=&quot;ansi-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#ansi-note&quot;&gt;[3]&lt;/a&gt;&amp;nbsp;If you&#39;re a
database nut, you may object that the ANSI SQL definitions are
[notoriously
underspecified](http://ftp.research.microsoft.com/pub/tr/tr-95-51.pdf). However,
rest assured that HAT &quot;Read Committed&quot; matches all of the definitions
we&#39;ve found in the literature, including those by [Berenson et
al. (SIGMOD
1995)](http://ftp.research.microsoft.com/pub/tr/tr-95-51.pdf) and
[Adya (MIT Ph.D. thesis, ICDE
2000)](http://www.pmg.lcs.mit.edu/~adya/pubs/phd.pdf). HAT &quot;Repeatable
Read&quot;---and &quot;Repeatable Read&quot; interpretations in general---is more
complicated; HAT &quot;Repeatable Read&quot; does match the ANSI spec, and we
provide a detailed discussion in Section 4 of the paper.&lt;/span&gt;&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>When is "ACID" ACID? Rarely.</title>
   <link href="http://bailis.org/blog//when-is-acid-acid-rarely/"/>
   <updated>2013-01-22T00:00:00-08:00</updated>
   <id>http://bailis.org/blog//when-is-acid-acid-rarely</id>
   <content type="html">&lt;style&gt;

table {
border: 1px solid black;
border-spacing:0px;
width: 100%;
}

td.serializable {
background-color: #EEE;
}

td {
padding: 4px;
text-align: center;
border-bottom: 1px solid black;
padding-right:10px;
}

.dbname {
text-align: left;
padding-right: 24px;
}

a.tablelink:link {text-decoration: none; color: black; }
a.tablelink:hover {text-decoration: none; color: #666;}

#legendbox {
font-style: italic;
text-align: left;
width: 420px;
}

#legendlabel {
font-weight: bold;
text-align: left;
}

&lt;/style&gt;

&lt;p&gt;&lt;em&gt;tl;dr: ACID and NewSQL databases rarely provide true ACID guarantees
 by default, if they are supported at all. See &lt;a href=&quot;#acidtable&quot;&gt;the
 table&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Many databases today differentiate themselves from their NoSQL
counterparts by claiming to support &lt;a class=&quot;no-decorate&quot; href=&quot;http://www.nuodb.com/explore/sql-cloud-database-product/&quot;&gt;“100%
ACID”&lt;/a&gt; transactions or by &lt;a class=&quot;no-decorate&quot; href=&quot;http://www.aerospike.com/performance/acid-compliance/&quot;&gt;“guaranteeing
strong consistency (ACID).”&lt;/a&gt; In reality, few of these
databases—including traditional “big iron” systems like
Oracle—provide formal ACID guarantees, &lt;a href=&quot;http://docs.oracle.com/cd/E11882_01/server.112/e10713/transact.htm#i1666&quot;&gt;even when they claim to do so&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The textbook definition of ACID Isolation is &lt;a class=&quot;no-decorate&quot; href=&quot;http://en.wikipedia.org/wiki/Serializability&quot;&gt;serializability&lt;/a&gt;
(e.g., &lt;a href=&quot;http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf&quot;&gt;Architecture
of a Database System&lt;/a&gt;, Section 6.2), which states that the outcome
of executing a set of transactions should be equivalent to some serial
execution of those transactions. This means that each transaction gets
to operate on the database as if it were running by itself, which &lt;a class=&quot;no-decorate&quot; href=&quot;http://research.microsoft.com/en-us/people/philbe/chapter1.pdf&quot;&gt;ensures
database correctness, or consistency&lt;/a&gt;. A database with
serializability (“I” in ACID), provides arbitrary read/write
transactions and guarantees consistency (“C” in ACID), or correctness,
of the database. Without serializability, ACID, particularly
consistency, is generally&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#arbitrary-note&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; not guaranteed&lt;/p&gt;

&lt;p&gt;Nevertheless, most publicly available databases (often claiming to
provide “ACID” transactions) do not provide serializability. I’ve
compiled the isolation guarantees provided by 18 popular databases
below (sources hyperlinked). Only three of 18 databases provide
serializability by default, and only nine provide serializability as an
option at all (shaded):&lt;/p&gt;

&lt;center&gt;
&lt;table id=&quot;acidtable&quot;&gt;
&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;b&gt;Database&lt;/b&gt;&lt;/td&gt;&lt;td&gt;&lt;b&gt;Default Isolation&lt;/b&gt;&lt;/td&gt;&lt;td&gt;&lt;b&gt;Maximum Isolation&lt;/b&gt;&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://docs.actian.com/ingres/10s/database-administrator-guide/2349-isolation-levels&quot;&gt;Actian Ingres 10.0/10S&lt;/a&gt;&lt;/td&gt;&lt;td class=&quot;serializable&quot;&gt;S&lt;/td&gt;&lt;td class=&quot;serializable&quot;&gt;S&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://www.aerospike.com/performance/acid-compliance/&quot;&gt;Aerospike&lt;/a&gt;&lt;/td&gt;&lt;td&gt;RC&lt;/td&gt;&lt;td&gt;RC&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://www.akiban.com/ak-docs/admin/persistit/Transactions.html&quot;&gt;Akiban Persistit&lt;/a&gt;&lt;/td&gt;&lt;td&gt;SI&lt;/td&gt;&lt;td&gt;SI&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://www.clustrix.com/Portals/146389/docs/Clustrix_System_Administrators_Guide_v4.1.pdf&quot;&gt;Clustrix CLX 4100&lt;/a&gt;&lt;/td&gt;&lt;td&gt;RR&lt;/td&gt;&lt;td&gt;?&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://media.gpadmin.me/wp-content/uploads/2012/11/GPDBAGuide.pdf&quot;&gt;Greenplum 4.1&lt;/a&gt;&lt;/td&gt;&lt;td&gt;RC&lt;/td&gt;&lt;td class=&quot;serializable&quot;&gt;S&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=%2Fcom.ibm.db2z10.doc.perf%2Fsrc%2Ftpc%2Fdb2z_chooseisolationoption.htm&quot;&gt;IBM DB2 10 for z/OS&lt;/a&gt;&lt;/td&gt;&lt;td&gt;CS&lt;/td&gt;&lt;td class=&quot;serializable&quot;&gt;S&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://publib.boulder.ibm.com/infocenter/idshelp/v115/index.jsp?topic=%2Fcom.ibm.sqls.doc%2Fids_sqs_1161.htm&quot;&gt;IBM Informix 11.50&lt;/a&gt;&lt;/td&gt;&lt;td&gt;Depends&lt;/td&gt;&lt;td&gt;RR&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://dev.mysql.com/doc/refman/5.6/en/set-transaction.html&quot;&gt;MySQL 5.6&lt;/a&gt;&lt;/td&gt;&lt;td&gt;RR&lt;/td&gt;&lt;td class=&quot;serializable&quot;&gt;S&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://developers.memsql.com/docs/1b/isolationlevel.html&quot;&gt;MemSQL 1b&lt;/a&gt;&lt;/td&gt;&lt;td&gt;RC&lt;/td&gt;&lt;td&gt;RC&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://msdn.microsoft.com/en-us/library/ms173763.aspx&quot;&gt;MS SQL Server 2012&lt;/a&gt;&lt;/td&gt;&lt;td&gt;RC&lt;/td&gt;&lt;td class=&quot;serializable&quot;&gt;S&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://www.nuodb.com/nuodb-online-documentation/references/r_Lang/r_Transactions.html&quot;&gt;NuoDB&lt;/a&gt;&lt;/td&gt;&lt;td&gt;CR&lt;/td&gt;&lt;td&gt;CR&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://docs.oracle.com/cd/B28359_01/server.111/b28318/consist.htm#autoId8&quot;&gt;Oracle 11g&lt;/a&gt;&lt;/td&gt;&lt;td&gt;RC&lt;/td&gt;&lt;td&gt;SI&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://docs.oracle.com/cd/E17277_02/html/TransactionGettingStarted/isolation.html&quot;&gt;Oracle Berkeley DB&lt;/a&gt;&lt;/td&gt;&lt;td class=&quot;serializable&quot;&gt;S&lt;/td&gt;&lt;td class=&quot;serializable&quot;&gt;S&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://docs.oracle.com/cd/E17277_02/html/TransactionGettingStarted/isolation.html&quot;&gt;Oracle Berkeley DB JE&lt;/a&gt;&lt;/td&gt;&lt;td&gt;RR&lt;/td&gt;&lt;td class=&quot;serializable&quot;&gt;S&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://www.postgresql.org/docs/9.2/static/transaction-iso.html&quot;&gt;Postgres 9.2.2&lt;/a&gt;&lt;/td&gt;&lt;td&gt;RC&lt;/td&gt;&lt;td class=&quot;serializable&quot;&gt;S&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://help.sap.com/hana/html/sql_set_transaction.html&quot;&gt;SAP HANA&lt;/a&gt;&lt;/td&gt;&lt;td&gt;RC&lt;/td&gt;&lt;td&gt;SI&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;http://www.scaledb.com/pdfs/ScaleDB_Cluster_Manual.pdf&quot;&gt;ScaleDB 1.02&lt;/a&gt;&lt;/td&gt;&lt;td&gt;RC&lt;/td&gt;&lt;td&gt;RC&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td class=&quot;dbname&quot;&gt;&lt;a class=&quot;tablelink&quot; href=&quot;https://voltdb.com/&quot;&gt;VoltDB&lt;/a&gt;&lt;/td&gt;&lt;td class=&quot;serializable&quot;&gt;S&lt;/td&gt;&lt;td class=&quot;serializable&quot;&gt;S&lt;/td&gt;
&lt;/tr&gt;

&lt;tr&gt;
&lt;td id=&quot;legendlabel&quot;&gt;&lt;span id=&quot;legendlabel&quot;&gt;Legend&lt;/span&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; id=&quot;legendbox&quot;&gt; RC: read committed, RR: repeatable read, S: serializability,&lt;br /&gt;SI: snapshot isolation, CS: cursor stability, CR: consistent read&lt;/td&gt;
&lt;/tr&gt;

&lt;/table&gt;
&lt;/center&gt;

&lt;p&gt;Instead of providing serializability, many these databases provide one
of several weaker variants,&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#weak-note&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; often when marketing material and
documentation claim otherwise.&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#oracle-note&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; There is no &lt;em&gt;fundamental&lt;/em&gt; reason why a
database shouldn’t &lt;em&gt;support&lt;/em&gt; serializability—&lt;a href=&quot;http://research.microsoft.com/en-us/people/philbe/ccontrol.aspx&quot;&gt;we have the
algorithms&lt;/a&gt;,
and we’ve made great strides in improving ACID scalability.&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#research-note&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; So why not
provide serializability by default, or, at the least, provide
serializability as an option at all? One key factor is performance:
serializable isolation can limit concurrency; traditional techniques
such as two-phase locking are expensive compared to, say, &lt;a class=&quot;no-decorate&quot; href=&quot;http://diaswww.epfl.ch/courses/adms07/papers/GrayLocks.pdf&quot;&gt;taking
short read locks on data items&lt;/a&gt;. Additionally, it is &lt;a href=&quot;http://www.cs.cornell.edu/courses/CS614/2004sp/papers/DGS85.pdf&quot;&gt;impossible to
simultaneously achieve high availability and
serializability&lt;/a&gt;
(though most of these database implementations are not highly
available anyway, even when providing weaker models). A third reason
is that transactions may be less likely to deadlock or abort due to
conflicts under weaker isolation. However, these benefits aren’t free:
the consistency anomalies that arise from the weak levels shown above
are &lt;a href=&quot;http://www.cse.iitb.ac.in/dbms/Data/Courses/CS632/2009/Papers/p492-fekete.pdf&quot;&gt;well-understood&lt;/a&gt;
and &lt;a href=&quot;http://www.vldb.org/pvldb/2/vldb09-185.pdf&quot;&gt;quantifiable&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Where’s the silver lining? We &lt;em&gt;can&lt;/em&gt; get real ACID in some of our
databases (if not by default). And, despite the fact that many other
“ACID” databases don’t provide ACID properties—at least according to
decades of research and development and formally proven guarantees
regarding database correctness (although &lt;a href=&quot;https://twitter.com/CurtMonash/status/292120597947895808&quot;&gt;perhaps marketing has
rewritten the
books&lt;/a&gt;)—we
can still &lt;a href=&quot;http://www.oracle.com/us/corporate/customers/customersearch/sabre-holdings-1-gg-ss-1849966.html&quot;&gt;reserve
travel tickets&lt;/a&gt;, &lt;a href=&quot;http://www.oracle.com/us/corporate/customers/customersearch/bank-of-baroda-1-db-ss-1875825.html&quot;&gt;use
our bank accounts&lt;/a&gt;, and &lt;a href=&quot;http://www.oracle.com/us/corporate/press/1871463&quot;&gt;fight
crime&lt;/a&gt;. How? One possibility is that anomalies are rare and the
performance benefits of weak isolation outweigh the cost of
inconsistencies. Another possibility is that applications are
performing their own concurrency control external to the database;
database programmers can use commands like &lt;a href=&quot;http://dev.mysql.com/doc/refman/5.5/en/innodb-locking-reads.html&quot;&gt;SELECT FOR
UPDATE&lt;/a&gt;,
&lt;a href=&quot;http://dev.mysql.com/doc/refman/5.6/en/lock-tables.html&quot;&gt;manual LOCK
TABLE&lt;/a&gt;, and
&lt;a href=&quot;http://www.postgresql.org/docs/8.1/static/ddl-constraints.html&quot;&gt;UNIQUE
constraints&lt;/a&gt;
to manually perform their own synchronization. The answer is likely a
mix of each, but, stepping back, these strategies should remind you of
what’s often done today in NoSQL-style data infrastructure: &lt;a href=&quot;http://pbs.cs.berkeley.edu/#demo&quot;&gt;“good
enough” consistency&lt;/a&gt; and some
hand-rolled, application-specific concurrency control. Perhaps there’s a better
question: when is “ACID” NoSQL?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is Part One of a two part series on Transactions and
Consistency.&lt;br /&gt; &lt;a href=&quot;http://www.bailis.org/blog/hat-not-cap-introducing-highly-available-transactions&quot;&gt;Part Two: recent research on Highly Available
Transactions
(HATs).&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Thanks to &lt;a href=&quot;http://www.neilconway.org/&quot;&gt;Neil Conway&lt;/a&gt;, &lt;a href=&quot;http://www.cs.berkeley.edu/~alig/&quot;&gt;Ali
 Ghodsi&lt;/a&gt;, and &lt;a href=&quot;http://www.cs.usyd.edu.au/~fekete&quot;&gt;Alan
 Fekete&lt;/a&gt; for early feedback on this
 post.&lt;/em&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;&lt;span id=&quot;footnotetitle&quot;&gt;Footnotes&lt;/span&gt;&lt;/p&gt;

&lt;p&gt; &lt;span class=&quot;footnote&quot; id=&quot;arbitrary-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#arbitrary-note&quot;&gt;[1]&lt;/a&gt; There’s a considerable
amount of research focusing on how to provide ACID consistency without
serializability. As an example, we can restrict the types of
operations that transactions can perform, as in
&lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.3821&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;escrow&lt;/a&gt;
and
&lt;a href=&quot;http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1701989&quot;&gt;read-only&lt;/a&gt;
transactions and with &lt;a href=&quot;http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper35.pdf&quot;&gt;monotonic
logic&lt;/a&gt;. We
can also consider hypothetical databases that introduce dummy
transactions to fill in anomalous behavior in the serial schedule,
which would be silly but technically serializable. The systems in
question don’t (usually) provide these sorts of “special-case”
ACID-compliant transactions as features.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;footnote&quot; id=&quot;weak-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#weak-note&quot;&gt;[2]&lt;/a&gt; There are a
bunch of different weak isolation models to consider, but their
definitions often vary depending on where you look. In this table,
when necessary, I’ve mapped the stated guarantees back to a known
model (e.g.,
&lt;a href=&quot;http://www.aerospike.com/performance/acid-compliance/&quot;&gt;Aerospike&lt;/a&gt;); in
any event, only the databases marked as such provide
serializability. The best vendor documentation will tell you exactly
what is implemented, even if the description doesn’t match the
name (see &lt;a href=&quot;#oracle-note&quot;&gt;Footnote 3&lt;/a&gt;). If you like database theory,
the best description of these levels I’ve seen, describing both
multi-version and lock-based databases, is &lt;a href=&quot;http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TR-786.pdf&quot;&gt;Atul Adya’s MIT
Ph.D. thesis from 1999&lt;/a&gt;.&lt;/span&gt; &lt;/p&gt;

&lt;p&gt; &lt;span class=&quot;footnote&quot; id=&quot;oracle-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#oracle-note&quot;&gt;[3]&lt;/a&gt; As a detailed
example of what can happen, consider Oracle 11g. (Admittedly, I’m
picking on Oracle, due mostly to the wealth of available information.)
11g’s strongest isolation level is called “serializable,” while &lt;a href=&quot;http://docs.oracle.com/cd/B28359_01/server.111/b28318/consist.htm#BABIJEJI&quot;&gt;its
description&lt;/a&gt;
matches &lt;a href=&quot;http://en.wikipedia.org/wiki/Snapshot_isolation&quot;&gt;snapshot
isolation&lt;/a&gt;. This
behavior is well-documented in both the &lt;a href=&quot;http://www.cse.iitb.ac.in/dbms/Data/Courses/CS632/2009/Papers/p492-fekete.pdf&quot;&gt;academic
literature&lt;/a&gt;
and &lt;a href=&quot;http://iggyfernandez.wordpress.com/2010/09/20/dba-101-what-does-serializable-really-mean/&quot;&gt;by
practitioners&lt;/a&gt;. For
more fun, try to figure out what can happen when you &lt;a href=&quot;http://docs.oracle.com/cd/E11882_01/server.112/e25494/ds_txnman010.htm&quot;&gt;execute
distributed
transactions&lt;/a&gt;.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;footnote&quot; id=&quot;research-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#research-note&quot;&gt;[4]&lt;/a&gt; As an
example, check out &lt;a href=&quot;http://en.wikipedia.org/wiki/Michael_Stonebraker&quot;&gt;Michael
Stonebraker&lt;/a&gt; and
&lt;a href=&quot;http://cs.brown.edu/~pavlo/&quot;&gt;Andy Pavlo&lt;/a&gt;’s research on the 
&lt;a href=&quot;http://hstore.cs.brown.edu/&quot;&gt;HStore project&lt;/a&gt; (commercialized via VoltDB) or
&lt;a href=&quot;http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/spanner-osdi2012.pdf&quot;&gt;Google’s
Spanner&lt;/a&gt;. Each
of these systems makes trade-offs (e.g., Spanner still uses two-phase
locking for read-write transactions, which is expensive over wide-area
networks, and doesn’t support transaction-level read-your-write semantics)
but is pushing the limits of true ACID scalability.&lt;/span&gt;&lt;/p&gt;

</content>
 </entry>
 
 <entry>
   <title>Using PBS in Cassandra 1.2.0</title>
   <link href="http://bailis.org/blog//using-pbs-in-cassandra-1.2.0/"/>
   <updated>2013-01-14T00:00:00-08:00</updated>
   <id>http://bailis.org/blog//using-pbs-in-cassandra-1.2.0</id>
   <content type="html">&lt;p&gt;With the help of the Cassandra community, we &lt;a href=&quot;https://issues.apache.org/jira/browse/CASSANDRA-4261&quot;&gt;recently
released&lt;/a&gt; PBS
consistency predictions as a feature in the official &lt;a href=&quot;https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/cassandra-1.2.0&quot;&gt;Cassandra 1.2.0
stable
release&lt;/a&gt;. In
case you aren’t familiar, &lt;a href=&quot;http://pbs.cs.berkeley.edu#demo&quot;&gt;PBS (Probabilistically Bounded Staleness)
predictions&lt;/a&gt; help answer questions
like: how eventual is eventual consistency? how consistent is eventual
consistency? These predictions help you profile your existing
Cassandra cluster and determine which configuration of N,R, and W are
the best fit for your application, expressed quantitatively in terms
of latency, consistency, and durability (see &lt;a href=&quot;#pbsoutput&quot;&gt;output below&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;There are several resources for understanding the theory behind PBS,
including &lt;a href=&quot;http://vimeo.com/37758648&quot;&gt;talks&lt;/a&gt;, &lt;a href=&quot;http://pbs.cs.berkeley.edu/#demo&quot;&gt;a
demo&lt;/a&gt;,
&lt;a href=&quot;http://www.bailis.org/talks/twitter-pbs.pdf&quot;&gt;slides&lt;/a&gt;, and an
&lt;a href=&quot;http://www.bailis.org/papers/pbs-vldb2012.pdf&quot;&gt;academic paper&lt;/a&gt;. We’ve
used PBS to look at the effect of SSDs and disks, wide-area networks,
and compare different web services’ data store deployments. My goal in
this post is to show how to profile an existing cluster and briefly
explain what’s going on behind the scenes. If you prefer, you can
download a (mostly) &lt;a href=&quot;http://www.bailis.org/blog/post_data/2013-01-14/pbs-1.2.0-demo.sh&quot;&gt;fully automated demo script&lt;/a&gt; instead.&lt;/p&gt;

&lt;h2 id=&quot;step-one-get-a-cassandra-cluster&quot;&gt;Step One: Get a Cassandra cluster.&lt;/h2&gt;

&lt;p&gt;The PBS predictor provides custom consistency and latency predictions
based on observed latencies in deployed clusters. To gather data for
predictions, we need a cluster to profile. If you have a cluster
running 1.2.0, you can skip these instructions.&lt;/p&gt;

&lt;p&gt;The easiest way to spin up a cluster for testing is to use
&lt;a href=&quot;https://github.com/pcmanus/ccm&quot;&gt;&lt;code&gt;ccm&lt;/code&gt;&lt;/a&gt;. Let’s start a 5-node
Cassandra cluster running on localhost:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;git clone https://github.com/pcmanus/ccm.git
&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;ccm &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; sudo ./setup.py install
ccm create pbstest -v 1.2.0
ccm populate -n 5
ccm start
&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CASS_HOST&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;127.0.0.1&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If &lt;code&gt;ccm start&lt;/code&gt; fails, you might need to initialize more loopback
interfaces (e.g., &lt;code&gt;sudo ifconfig lo0 alias 127.0.0.2&lt;/code&gt;)—see the &lt;a href=&quot;http://www.bailis.org/blog/blog/post_data/2013-01-14/pbs-1.2.0-demo.sh&quot;&gt;script&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;step-two-enable-pbs-metrics-on-a-cassandra-server&quot;&gt;Step Two: Enable PBS metrics on a Cassandra server.&lt;/h2&gt;

&lt;p&gt;The PBS predictor works by profiling message latencies that it sees in
a production cluster. You only need to enable logging on a single
node, and all reads and writes that the node performs will be used in
predictions.&lt;/p&gt;

&lt;p&gt;The prediction module logs latencies in a circular buffer with a FIFO
eviction policy (default: 20,000 reads and writes). By default, this
logging is turned off, saving about 300k of memory. To turn it on, use
a JMX tool to call the &lt;code&gt;org.apache.cassandra.service.PBSPredictor&lt;/code&gt;
MBean’s &lt;code&gt;enableConsistencyPredictionLogging&lt;/code&gt; method. You can use
&lt;code&gt;jconsole&lt;/code&gt;&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#jconsole-note&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; or use a command line JMX interface
like &lt;a href=&quot;http://wiki.cyclopsgroup.org/jmxterm/download&quot;&gt;&lt;code&gt;jmxterm&lt;/code&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;wget http://downloads.sourceforge.net/cyclops-group/jmxterm-1.0-alpha-4-uber.jar
&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&amp;quot;run -b org.apache.cassandra.service:type=PBSPredictor enableConsistencyPredictionLogging&amp;quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;|&lt;/span&gt; java -jar jmxterm-1.0-alpha-4-uber.jar -l &lt;span class=&quot;nv&quot;&gt;$CASS_HOST&lt;/span&gt;:7100&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id=&quot;step-three-run-a-workload&quot;&gt;Step Three: Run a Workload&lt;/h2&gt;

&lt;p&gt;The PBS predictor is entirely passive: it profiles the reads and
writes that are already occuring in the cluster. This means that
predictions don’t interfere with live requests but also means that we
need to have a workload to get results.&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#prediction-note-note&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; &lt;/p&gt;

&lt;p&gt;We can use the Cassandra stress test, below executing 10,000 read and
write requests with a replication factor of three.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;cd&lt;/span&gt; ~/.ccm/repository/1.2.0/
chmod +x tools/bin/cassandra-stress
tools/bin/cassandra-stress -d &lt;span class=&quot;nv&quot;&gt;$CASS_HOST&lt;/span&gt; -l &lt;span class=&quot;m&quot;&gt;3&lt;/span&gt; -n &lt;span class=&quot;m&quot;&gt;10000&lt;/span&gt; -o insert
tools/bin/cassandra-stress -d &lt;span class=&quot;nv&quot;&gt;$CASS_HOST&lt;/span&gt; -l &lt;span class=&quot;m&quot;&gt;3&lt;/span&gt; -n &lt;span class=&quot;m&quot;&gt;10000&lt;/span&gt; -o &lt;span class=&quot;nb&quot;&gt;read&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id=&quot;step-four-run-predictions&quot;&gt;Step Four: Run predictions.&lt;/h2&gt;

&lt;p&gt;We can now connect to the node performing the profiling and have it
perform some Monte Carlo analysis for us. The consistency prediction
is triggered via JMX, but this time using the &lt;code&gt;nodetool&lt;/code&gt;
administration interface packaged with Cassandra:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;bin/nodetool -h &lt;span class=&quot;nv&quot;&gt;$CASS_HOST&lt;/span&gt; -p &lt;span class=&quot;m&quot;&gt;7100&lt;/span&gt; predictconsistency &lt;span class=&quot;m&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;100&lt;/span&gt; 1&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Here’s some sample output from a run on one of our clusters. You can
vary the replication factor, the amount of time you’d like to consider
after writes, and even multi-versioned staleness. Remember that, aside
from taking up some CPU on the predicting node, this profiling doesn’t
affect query performance:&lt;/p&gt;

&lt;div class=&quot;boundedbox20&quot; id=&quot;pbsoutput&quot;&gt;&lt;pre&gt;&lt;code&gt;Performing consistency prediction
100ms after a given write, with maximum version staleness of k=1
N=3, R=1, W=1
Probability of consistent reads: 0.678900
Average read latency: 5.377900ms (99.900th %ile 40ms)
Average write latency: 36.971298ms (99.900th %ile 294ms)

N=3, R=1, W=2
Probability of consistent reads: 0.791600
Average read latency: 5.372500ms (99.900th %ile 39ms)
Average write latency: 303.630890ms (99.900th %ile 357ms)

N=3, R=1, W=3
Probability of consistent reads: 1.000000
Average read latency: 5.426600ms (99.900th %ile 42ms)
Average write latency: 1382.650879ms (99.900th %ile 629ms)

N=3, R=2, W=1
Probability of consistent reads: 0.915800
Average read latency: 11.091000ms (99.900th %ile 348ms)
Average write latency: 42.663101ms (99.900th %ile 284ms)

N=3, R=2, W=2
Probability of consistent reads: 1.000000
Average read latency: 10.606800ms (99.900th %ile 263ms)
Average write latency: 310.117615ms (99.900th %ile 335ms)

N=3, R=3, W=1
Probability of consistent reads: 1.000000
Average read latency: 52.657501ms (99.900th %ile 565ms)
Average write latency: 39.949799ms (99.900th %ile 237ms)&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h2 id=&quot;conclusions-and-caveats&quot;&gt;Conclusions and Caveats&lt;/h2&gt;

&lt;p&gt;Once configured, the PBS predictions are both easy and fast to
run. The great thing about predictions is that they can be run
entirely off of the fast path; our PBS code module performs simple
message profiling (timestamp logging), then, when prompted, performs
forward prediction of how the system might behave in different
scenarios in the background. This is a fundamental algorithmic
property of the prediction problem, and, provided all nodes in the
system attach the required timestamps on messages, only one node has
to actually log data and perform predictions&lt;/p&gt;

&lt;p&gt;Before I end, there are a few caveats to the current
implementation. (Warning: this is a bit technical.) First, we only
simulate non-local operations. In Cassandra, a node can act as a
coordinator and as a replica for a given operation. We only collect
data for operations for which the predicting node was a coordinator,
not a replica. This means that, for example, if the predicting node
serves all reads locally, we won’t have enough data for accurate
predictions. The reason we did this is because we’d otherwise have to
model coordinator and data accesses, which gets tricky in a running
cluster. Second, we don’t consider failures or hinted handoff. We do
capture slow node behavior. Third, we don’t differentiate between
column families or different data items. This (like the rest) was an
engineering decision that I’m sure we could change in future releases.&lt;/p&gt;

&lt;p&gt;Despite these limitations, I think the current functionality is useful
for getting a sense of how clusters are behaving and the potential
impact of replication parameters. Moreover, I’m confident that we can
fix the above issues if there’s enough interest. If you’re interested
in using, further developing, or learning more about this
functionality, please let me know and &lt;a href=&quot;http://www.bailis.org/pubs.html#pbs-talks&quot;&gt;we can
talk&lt;/a&gt;. We built this
implementation because we care about real-world research impact; let
us know what you think.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Thanks to &lt;a href=&quot;https://github.com/shivaram/&quot;&gt;Shivaram Venkataraman&lt;/a&gt;, who
 co-authored our patch, and the Cassandra community, particularly
 &lt;a href=&quot;https://twitter.com/spyced&quot;&gt;Jonathan Ellis&lt;/a&gt;, for being so
 accommodating.&lt;/em&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;div id=&quot;footnotetitle&quot;&gt;Footnotes&lt;/div&gt;

&lt;p&gt;
&lt;span class=&quot;footnote&quot; id=&quot;prediction-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#prediction-note&quot;&gt;[1]&lt;/a&gt; You &lt;em&gt;can&lt;/em&gt; run predictions without workloads, just not
within Cassandra. Take a look at &lt;a href=&quot;http://www.bailis.org/papers/pbs-vldb2012.pdf&quot;&gt;our
paper&lt;/a&gt; or some old
&lt;a href=&quot;https://github.com/pbailis/cassandra-pbs/blob/trunk/pbs/analyze_pbs.py&quot;&gt;Python
code&lt;/a&gt;.
&lt;/span&gt;
&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;footnote&quot; id=&quot;jconsole-note&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#jconsole-note&quot;&gt;[2]&lt;/a&gt; This is ugly, so I put the instructions down
here. Run &lt;code&gt;jconsole&lt;/code&gt; (if you used CCM, your 127.0.0.1 node will likely
have the lowest PID), click &lt;code&gt;MBeans&lt;/code&gt;, then
&lt;code&gt;org.apache.cassandra.service&lt;/code&gt; (bottom of the menu), &lt;code&gt;PBSPredictor&lt;/code&gt;,
&lt;code&gt;Operations&lt;/code&gt;, &lt;code&gt;enableConsistencyPredictionLogging&lt;/code&gt;, then click the
&lt;code&gt;enableConsistencyPredictionLogging&lt;/code&gt; button (screenshot
&lt;a href=&quot;http://www.bailis.org/blog/post_data/2013-01-14/enable-pbs-jmx.png&quot;&gt;here&lt;/a&gt;).
&lt;/span&gt;
&lt;/p&gt;

</content>
 </entry>
 
 <entry>
   <title>Doing Redundant Work to Speed Up Distributed Queries</title>
   <link href="http://bailis.org/blog//doing-redundant-work-to-speed-up-distributed-queries/"/>
   <updated>2012-09-20T00:00:00-07:00</updated>
   <id>http://bailis.org/blog//doing-redundant-work-to-speed-up-distributed-queries</id>
   <content type="html">&lt;p&gt;&lt;em&gt;tl;dr: In distributed data stores, redundant operations can
 dramatically drop tail latency at the expense of increased system
 load; different Dynamo-style stores handle this trade-off
 differently, and there’s room for improvement.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update 10/2013: Cassandra has since added support for &lt;a href=&quot;http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2&quot;&gt;“speculative
  retry”&lt;/a&gt;–effectively,
  Dean’s suggestions applied to Dynamo reads, as described below.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update 9/2014: Akka 2.3.5 &lt;a href=&quot;https://twitter.com/jboner/status/499857543704096768&quot;&gt;introduced support&lt;/a&gt; for this kind of speculative retry.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At scale, tail latencies matter. When serving high volumes of traffic,
even a miniscule fraction of requests corresponds to a large number of
operations.  Latency &lt;a href=&quot;http://perspectives.mvdirona.com/2009/10/31/TheCostOfLatency.aspx&quot;&gt;
has a huge impact on service quality&lt;/a&gt;, and looking at the &lt;em&gt;average&lt;/em&gt;
service latency alone is often insufficient. Instead, folks running
high-performance systems at places like Amazon and Google look to the
tail when measuring their performance.&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#tailnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; High variance often hides in distribution
tails: in a &lt;a href=&quot;http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf&quot;&gt;talk
at Berkeley last spring&lt;/a&gt;, Jeff Dean reported a 95 percentile
latency of 24ms and a 99.9th percentile latency of 994ms in Google’s
BigTable service—a 42x difference!&lt;/p&gt;

&lt;p&gt;In distributed systems, there’s a subtle and somewhat underappreciated
strategy for reducing tail latencies: doing redundant work. If you
send the same request to multiple servers, (all else equal) you’re
going to get an answer back faster than waiting for a single
server. Waiting for, say, one of three servers to reply is often
faster than waiting for one of one to reply. The basic cause is due to
variance in modern service components: requests take different amounts
of time in the network and on different servers at different times.&lt;a class=&quot;no-decorate&quot; href=&quot;#poweroftwonote&quot;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; In Dean’s
experiments, BigTable’s 99.9th percentile latency dropped to 50ms when
he sent out a second, redundant request if the initial request hadn’t
come back in 10ms—a 40x improvement. While there’s a cost associated
with redundant work—increased service load—the load increase may
be modest. In the example I’ve mentioned, Dean recorded only a 5%
total increase in number of requests.&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#deannote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Learning about Google’s systems is instructional, but we can also
observe the trade-off between tail latency and load in several
publicly-available distributed data stores patterned on Amazon’s
influential Dynamo data store. In the &lt;a href=&quot;http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf&quot;&gt;original
paper&lt;/a&gt;, Dynamo sends a client’s read and write requests to all
replicas for a given key. For writes, the system needs to update all
replicas anyway. For reads, requests are idempotent, so the system
doesn’t necessarily &lt;em&gt;need&lt;/em&gt; to contact all replicas—should it?
Sending read requests to all replicas results in a linear increase in
load compared to sending to the minimum required number of
replicas.&lt;sup&gt;&lt;a href=&quot;#consistencynote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; For
read-dominated workloads (like many internet applications), this
optimization has a cost. When is it worthwhile?&lt;/p&gt;

&lt;p&gt;Open-source Dynamo-style stores have different answers. Apache
Cassandra originally sent reads to all replicas, but
&lt;a href=&quot;https://issues.apache.org/jira/browse/CASSANDRA-930&quot;&gt;CASSANDRA-930&lt;/a&gt;
and
&lt;a href=&quot;https://issues.apache.org/jira/browse/CASSANDRA-982&quot;&gt;CASSANDRA-982&lt;/a&gt;
changed this: one commenter &lt;a href=&quot;https://issues.apache.org/jira/browse/CASSANDRA-982?focusedCommentId=12973721&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12973721&quot;&gt;argued
that&lt;/a&gt;
“in IO overloaded situations” it was better to send read requests only
to the minimum number of replicas. By default, Cassandra now sends
reads to the minimum number of replicas 90% of the time and to all
replicas 10% of the time, primarily for consistency purposes.&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#cassandrainternalsnote&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;
(Surprisingly, the relevant JIRA issues don’t even mention the latency
impact.)  LinkedIn’s Voldemort also uses a
&lt;a href=&quot;https://github.com/voldemort/voldemort/blob/master/src/java/voldemort/store/routed/PipelineRoutedStore.java#L186&quot;&gt;send-to-minimum&lt;/a&gt;
strategy (and has evidently done so &lt;a href=&quot;https://github.com/voldemort/voldemort/blob/fbd0f95d62ac2c5e97e5a4df5a732e9342d60da1/src/java/voldemort/store/routed/RoutedStore.java#L230&quot;&gt;since it was
open-sourced&lt;/a&gt;). In
contrast, Basho Riak chooses the “true” Dynamo-style &lt;a href=&quot;https://github.com/basho/riak_kv/blob/42eb6951b369e3fd9a42f7f54fb7618a40f1a9fb/src/riak_kv_get_fsm.erl#L153&quot;&gt;send-to-all&lt;/a&gt;
read policy.&lt;/p&gt;

&lt;p&gt;Who’s right? What do these choices mean for a real NoSQL deployment?
We can do a back-of-the-envelope analysis pretty easily. For &lt;a href=&quot;http://www.bailis.org/papers/pbs-vldb2012.pdf&quot;&gt;one of
our recent papers&lt;/a&gt; on
latency-consistency trade-offs in Dynamo style systems, we obtained
latency data from Yammer’s Riak clusters. If we run some simple Monte
Carlo analysis (script available
&lt;a href=&quot;https://github.com/pbailis/bailis.org-blog/blob/master/post_data/2012-09-20/dynamo-montecarlo.py&quot;&gt;here&lt;/a&gt;),
we see that—perhaps unsurprisingly—redundant work can have a big
effect on latencies. For example, at the 99.9th percentile, sending a
single read request to two servers instead of one is 17x faster than
sending to one—maybe worth the 2x load increase. Sending reads to
three servers and waiting for one is 30x faster. Pretty good!&lt;/p&gt;

&lt;center&gt;
&lt;table cellpadding=&quot;6&quot;&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td colspan=&quot;6&quot; align=&quot;center&quot;&gt;Requests sent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td rowspan=&quot;7&quot; align=&quot;center&quot; style=&quot;padding-right:10px;&quot;&gt;Responses&lt;br /&gt;waited for&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;&lt;b&gt;1&lt;/b&gt;&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;&lt;b&gt;2&lt;/b&gt;&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;&lt;b&gt;3&lt;/b&gt;&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;&lt;b&gt;4&lt;/b&gt;&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;&lt;b&gt;5&lt;/b&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&quot;right&quot;&gt;&lt;b&gt;1&lt;/b&gt;&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;170.0&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;10.7&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;5.6&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;4.8&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;4.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&quot;right&quot;&gt;&lt;b&gt;2&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;200.6&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;33.9&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;6.5&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;5.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&quot;right&quot;&gt;&lt;b&gt;3&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;218.2&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;50.0&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;7.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&quot;right&quot;&gt;&lt;b&gt;4&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;231.1&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;59.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&quot;right&quot;&gt;&lt;b&gt;5&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td align=&quot;right&quot;&gt;242.2&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;div style=&quot;margin-top: 5px;&quot;&gt;&lt;b&gt;99.9th percentile read latencies (in ms) for the Yammer Dynamo-style latency model.&lt;/b&gt;&lt;/div&gt;
&lt;/center&gt;

&lt;p&gt;The numbers above assume you send both requests at the same time, but
this need not be the case. For &lt;a href=&quot;https://github.com/pbailis/bailis.org-blog/blob/master/post_data/2012-09-20/dynamo-multirequest-montecarlo.py&quot;&gt;example&lt;/a&gt;, sending a second request if
the first hasn’t come back within 8ms results in a modest 4.2%
increase in requests sent and a 99.9th percentile read latency of
11.0ms. This is due to the long tail of the latency distributions we
see in Yammer’s clusters—we only have to speed up a small fraction
of queries to improve the overall performance.&lt;/p&gt;

&lt;p&gt;To preempt any dissatisfaction, I’ll admit that this analysis is
simplistic. First, I’m not considering the increased load on each
server due to sending multiple requests. The increased load may in
turn increase latencies, which would decrease the benefits we see
here. This effect depends on the system and workload. Second, I’m
assuming that each request is identically, independently
distributed. This means that each server behaves the same (according
to the Yammer latency distribution we have). This models a system
equally loaded, equally powerful servers, but this too may be
different in practice.&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#cassandrasnitchnote&quot;&gt;6&lt;/a&gt;&lt;/sup&gt; Third, with a different
latency distribution, the numbers will change. Real-world benchmarking
is the best source of truth, but this analysis is a starting point,
and you can easily play with different distributions of your own
either with the provided script or in your browser using &lt;a href=&quot;http://pbs.cs.berkeley.edu/#demo&quot;&gt;an older demo&lt;/a&gt; I built.&lt;/p&gt;

&lt;p&gt;Ultimately, the balance between redundant work and tail latency
depends on the application. However, in latency-sensitive environments
(particularly when there are serial dependencies between requests)
this redundant work has a massive impact. And we’ve only begun: we
don’t need to send to &lt;em&gt;all&lt;/em&gt; replicas to see benefits—even one
extra request can help—while delay and cancellation mechanisms like the ones
that Jeff Dean hints at can further reduce load penalties. There’s a
large amount of hard work and research to be done designing,
implementing, and battle-testing these strategies, but I suspect that
these kinds of techniques will have a substantial impact on future
large-scale distributed data systems.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Thanks to &lt;a href=&quot;https://github.com/shivaram/&quot;&gt;Shivaram Venkataraman&lt;/a&gt;, &lt;a href=&quot;http://www.cs.berkeley.edu/~alig/&quot;&gt;Ali
 Ghodsi&lt;/a&gt;, &lt;a href=&quot;http://www.eecs.berkeley.edu/~keo/&quot;&gt;Kay
 Ousterhout&lt;/a&gt;, &lt;a href=&quot;http://www.pwendell.com/&quot;&gt;Patrick
 Wendell&lt;/a&gt;, &lt;a href=&quot;https://twitter.com/seancribbs&quot;&gt;Sean
 Cribbs&lt;/a&gt;, and &lt;a href=&quot;https://twitter.com/argv0&quot;&gt;Andy
 Gross&lt;/a&gt; for assistance and feedback that
 contributed to this post.&lt;/em&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;div id=&quot;footnotetitle&quot;&gt;Footnotes&lt;/div&gt;

&lt;div class=&quot;footnote&quot; id=&quot;tailnote&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#tailnote&quot;&gt;[1]&lt;/a&gt; There are &lt;a href=&quot;http://highscalability.com/latency-everywhere-and-it-costs-you-sales-how-crush-it&quot;&gt;many
studies&lt;/a&gt; of the importance of latency for different services. For
example, an &lt;a href=&quot;http://www.scribd.com/doc/4970486/Make-Data-Useful-by-Greg-Linden-Amazoncom&quot;&gt;often-cited
statistic&lt;/a&gt; is that an additional 100ms of latency cost Amazon 1% of
sales. In the systems community, David Anderson attributes this
sensitivity to tail latencies to both the original Dynamo paper and
Werner Vogels&#39;s subsequent evangelizing (buried in the comments &lt;a href=&quot;https://plus.google.com/115237092509505721130/posts/cf4sbedNd2W&quot;&gt;here&lt;/a&gt;).&lt;/div&gt;

&lt;div class=&quot;footnote&quot; id=&quot;poweroftwonote&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#poweroftwonote&quot;&gt;[2]&lt;/a&gt; There&#39;s a large body of &lt;a href=&quot;http://www.eecs.harvard.edu/~michaelm/postscripts/handbook2001.pdf&quot;&gt;theoretical
research&lt;/a&gt; on &quot;the power of two choices&quot; that&#39;s related to this
phenomenon: if you select the less loaded of two randomly chosen
servers instead of randomly picking one, you can exponentially improve
a cluster&#39;s load balance. Theoreticians might quibble with this
analogy: after all, here, we&#39;re usually still sending requests to both
of the servers, and the original power of two work focuses on only
sending requests to the lighter-loaded server. However, this research
is still interesting to consider as a precursor to many of the
techniques here and as the start of a more rigorous understanding of
these trade-offs.&lt;/div&gt;

&lt;div class=&quot;footnote&quot; id=&quot;deannote&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#deannote&quot;&gt;[3]&lt;/a&gt; Dean also describes the effect of different
delay and cancellation mechanisms. Cancellation seems tricky to get
right depending on the application. Intercepting and cancelling a
lightweight read request is harder (i.e., needs to be faster) than,
say, canceling a slower, more complex query. &lt;a href=&quot;https://www.usenix.org/conference/hotcloud12/why-let-resources-idle-aggressive-cloning-jobs-dolly&quot;&gt;Recent
work&lt;/a&gt; from some of my colleagues at UC Berkeley demonstrates
alternative algorithms for limiting the overhead of redundant work in
general-purpose cluster computing frameworks like Hadoop.&lt;/div&gt;

&lt;div class=&quot;footnote&quot; id=&quot;consistencynote&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#consistencynote&quot;&gt;[4]&lt;/a&gt; Determining the minimum number of
replicas to read from depends on the desired consistency of the
operation. This is a complicated subject and is related to (but too
involved for) the main discussion here. In short, if we denote the
number of replicas as &lt;em&gt;N&lt;/em&gt;, the number of replicas to block for
during reads as &lt;em&gt;R&lt;/em&gt; and equivalently for writes as &lt;em&gt;W&lt;/em&gt;,
&lt;em&gt;R&lt;/em&gt;+&lt;em&gt;W&lt;/em&gt; &amp;gt; &lt;em&gt;N&lt;/em&gt; means you&#39;ll read your writes (each
key acts as a &lt;a href=&quot;http://stackoverflow.com/a/8872960&quot;&gt;regular
register&lt;/a&gt;), while anything less gives you weak guarantees. For more
info, check out &lt;a href=&quot;http://pbs.cs.berkeley.edu/#demo&quot;&gt;some
research we recently did&lt;/a&gt; on these latency-consistency
trade-offs.&lt;/div&gt;

&lt;div class=&quot;footnote&quot; id=&quot;cassandrainternalsnote&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#cassandrainternalsnote&quot;&gt;[5]&lt;/a&gt; For readers
into hardcore Cassandra internals: currently, Cassandra fetches the
data from one server and requests &quot;digests,&quot; or value hashes, from the
remaining (&lt;em&gt;R-1&lt;/em&gt;) servers. In the case that the digests don&#39;t
match, Cassandra will perform another round of requests for the actual
values. Now, if a mechanism called &lt;a href=&quot;http://wiki.apache.org/cassandra/ReadRepair&quot;&gt;&lt;em&gt;read
repair&lt;/em&gt;&lt;/a&gt; is enabled, then Cassandra will randomly (as a
configurable parameter) send digest requests to all replicas (as
opposed to just the &lt;em&gt;R-1&lt;/em&gt;). In the original patch, the
&lt;em&gt;default&lt;/em&gt; probability of sending to all &lt;a href=&quot;https://github.com/apache/cassandra/blob/e5477338458c3a0229d4fbe659231002ac154583/src/java/org/apache/cassandra/config/CFMetaData.java#L52&quot;&gt;was
100%&lt;/a&gt; (all the time); &lt;a href=&quot;https://github.com/apache/cassandra/blob/a500e2835748f19d0c11bc3dfcecc71c50d9cf7e/src/java/org/apache/cassandra/config/CFMetaData.java#L67&quot;&gt;it
is now 10%&lt;/a&gt; (due&amp;#8212;somewhat cryptically&amp;#8212;to &lt;a href=&quot;https://issues.apache.org/jira/browse/CASSANDRA-3169&quot;&gt;CASSANDRA-3169&lt;/a&gt;). (Sources:
original patch &lt;a href=&quot;https://github.com/apache/cassandra/commit/e5477338458c3a0229d4fbe659231002ac154583#L5L394&quot;&gt;here&lt;/a&gt;;
latest code &lt;a href=&quot;https://github.com/apache/cassandra/blob/b38ca2879cf1cbf5de17e1912772b6588eaa7de6/src/java/org/apache/cassandra/service/StorageProxy.java#L859&quot;&gt;here&lt;/a&gt;;
filtering of endpoints done &lt;a href=&quot;https://github.com/apache/cassandra/blob/3a2faf9424769cfee5fdad25f4513611820ca980/src/java/org/apache/cassandra/service/ReadCallback.java#L97&quot;&gt;here&lt;/a&gt;)
This digest-based scheme is a substantial deviation from the Dynamo
design and exposes a number of other interesting trade-offs that
probably deserve further examination.&lt;/div&gt;

&lt;div class=&quot;footnote&quot; id=&quot;cassandrasnitchnote&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#cassandrainternalsnote&quot;&gt;[6]&lt;/a&gt; For example, Cassandra&#39;s
&quot;endpoint snitches&quot; keep track of which nodes are &quot;closest&quot; according
to several different, configurable dimensions, including &lt;a href=&quot;https://github.com/apache/cassandra/blob/a500e2835748f19d0c11bc3dfcecc71c50d9cf7e/src/java/org/apache/cassandra/locator/DynamicEndpointSnitch.java&quot;&gt;historical
latencies&lt;/a&gt;. Depending on the configuration, if a given node is slow
or overloaded, Cassandra may choose not to read from it. I haven&#39;t
seen a performance analysis of this strategy, but, at first glance, it
seems reasonable.&lt;/div&gt;

</content>
 </entry>
 
 <entry>
   <title>Safety and Liveness: Eventual Consistency Is Not Safe</title>
   <link href="http://bailis.org/blog//safety-and-liveness-eventual-consistency-is-not-safe/"/>
   <updated>2012-03-27T00:00:00-07:00</updated>
   <id>http://bailis.org/blog//safety-and-liveness-eventual-consistency-is-not-safe</id>
   <content type="html">&lt;p&gt;&lt;em&gt;tl;dr: Eventual consistency is a liveness property—not a safety property—and is trivially satisfiable by itself. Liveness and safety properties should be taken together.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Safety and liveness are two important kinds of properties provided by &lt;a href=&quot;http://pi1.informatik.uni-mannheim.de/filepool/teaching/dependablesystems-2007/PDS_20070306.pdf&quot;&gt;all distributed systems&lt;/a&gt;. Informally, safety guarantees promise that nothing bad happens, while liveness guarantees promise that something good eventually happens. Every distributed system makes some form of safety and liveness guarantees, and some are stronger than others. For example, &lt;a href=&quot;http://en.wikipedia.org/wiki/Linearizability&quot;&gt;atomic consistency&lt;/a&gt; guarantees that operations will appear to happen instantaneously across the system (safety) but operations won’t always succeed in the presence of network partitions (liveness, in the form of availability).&lt;/p&gt;

&lt;p&gt;Many of today’s distributed systems promise &lt;a href=&quot;http://en.wikipedia.org/wiki/Eventual_consistency&quot;&gt;eventual consistency&lt;/a&gt;: after some period of time, all participants in the system agree on the same value. This is a useful property: good things will eventually happen without the need for intervention, even in the presence of partitions. However, under our definitions of safety and liveness, eventual consistency only provides liveness guarantees, not safety: Which value is eventually chosen? What values may be returned before participants “eventually” agree?&lt;/p&gt;

&lt;p&gt;As &lt;a href=&quot;http://www.cs.utexas.edu/users/princem/papers/cac-tr.pdf&quot;&gt;recent work from UT Austin&lt;/a&gt; points out, it’s easy to satisfy liveness without being useful. If all replicas  always return the initial state, the system is eventually consistent. If all replicas return the value 42 in response to every request (even if you didn’t write the value of 42), the system is eventually consistent. If replicas accept every thousandth write, the system is eventually consistent. These guarantees are somehow not what we would like, but they satisfy our definition of eventual consistency. Moreover, as the authors explain, accepting more read/write combinations doesn’t necessarily translate to &lt;em&gt;stronger consistency&lt;/em&gt;.  We’d like some notion of &lt;em&gt;convergence&lt;/em&gt; that captures both agreement on a common shared state and exchanging of writes.&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#strengthnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;Today’s eventually consistent systems do provide some form of safety properties, even if they don’t say so explicitly. For instance, in Riak, Cassandra, and DynamoDB, timestamp ordering is often used to decide which version of a data item to keep. Moreover, these data stores won’t return any values you haven’t written to them, and replicas will converge to the last written value for each key. In short, many “eventually consistent” stores really offer something like “eventually last-writer-wins, and read-the-last-observed-value in the meantime” consistency. This is both more descriptive and more useful than a vanilla “eventual consistency” guarantee.&lt;sup&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#vendornote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;It’s worth noting that safety without convergence also leads to problems. Read-your-writes, PRAM/monotonic writes, and causal consistency guarantees are trivially achievable using only local storage and no communication: simply keep a local copy of every key that you update and read from for every operation. This is not a convergent implementation. However, it satisfies &lt;a href=&quot;http://www.allthingsdistributed.com/2008/12/eventually_consistent.html&quot;&gt;each of these consistency models&lt;/a&gt; because they make safety but not liveness guarantees. If we were to add in our liveness requirement of convergence, our implementation would have to propagate writes between replicas.&lt;/p&gt;

&lt;p&gt;Next time someone tells you their system is “eventually consistent,” ask them two questions: What versions of a data item can be returned at any time? What version will the system eventually choose to return? And remember: consider safety and liveness properties together. Otherwise, you probably have a trivially satisfiable requirement.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was influenced in large part by discussions with &lt;a href=&quot;http://www.sics.se/~ali/&quot;&gt;Ali Ghodsi&lt;/a&gt;, &lt;a href=&quot;http://db.cs.berkeley.edu/jmh/&quot;&gt;Joe Hellerstein&lt;/a&gt;, and &lt;a href=&quot;http://www.cs.berkeley.edu/~istoica/&quot;&gt;Ion Stoica&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;div id=&quot;footnotetitle&quot;&gt;Footnotes&lt;/div&gt;

&lt;div class=&quot;footnote&quot; id=&quot;strengthnote&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#strengthnote&quot;&gt;[1]&lt;/a&gt; Eventual convergence is likely the strongest convergence property we can guarantee given unbounded partition durations. Any system guaranteeing non-trivial convergence within a fixed amount of time would violate its liveness guarantees if partitioned for a longer period of time.&lt;/div&gt;

&lt;div class=&quot;footnote&quot; id=&quot;vendornote&quot;&gt;&lt;a class=&quot;no-decorate&quot; href=&quot;#strengthnote&quot;&gt;[2]&lt;/a&gt; In their technical documentation, vendors are usually forthcoming about these details, though they can be difficult to pick out. However, in promotional material and especially when making superficial comparisons, these distinctions are often omitted or glossed over.&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>A Running List: Writing, Speaking, and Research Advice</title>
   <link href="http://bailis.org/blog//a-running-list-writing-talking-and-research-advice/"/>
   <updated>2012-03-17T00:00:00-07:00</updated>
   <id>http://bailis.org/blog//a-running-list-writing-talking-and-research-advice</id>
   <content type="html">&lt;p&gt;This is a growing list of other people’s advice that I’ve found
useful, posted here mostly for my own reference. (As if there
weren’t enough of these already.)&lt;/p&gt;

&lt;h2 id=&quot;on-writing&quot;&gt;On Writing&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://en.wikipedia.org/wiki/The_Elements_of_Style&quot;&gt;The Elements of Style&lt;/a&gt; by Strunk and White&lt;br /&gt;
&lt;em&gt;tl;dr:&lt;/em&gt; A classic on good prose and English writing.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://www.eecs.harvard.edu/margo/writing.html&quot;&gt;Pet Peeves for Writing&lt;/a&gt; by Margo Seltzer&lt;br /&gt;
&lt;em&gt;tl;dr:&lt;/em&gt; Read it. It’s short. Don’t do these things.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://www.cs.ucla.edu/~kohler/latex.html&quot;&gt;LaTeX Usage Notes&lt;/a&gt; by Eddie Kohler&lt;br /&gt;
&lt;em&gt;tl;dr:&lt;/em&gt; A (beautifully typeset) document on proper LaTeX formatting, typography, and writing tips.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://matt.might.net/articles/shell-scripts-for-passive-voice-weasel-words-duplicates/&quot;&gt;Shell Scripts for Editing&lt;/a&gt; by Matt Might&lt;br /&gt;
&lt;em&gt;tl;dr:&lt;/em&gt; Automatically highlight passive voice, weasel words, and lexical illusions using these scripts. (&lt;a href=&quot;https://github.com/devd/Academic-Writing-Check&quot;&gt;Also good&lt;/a&gt;) Thanks to Colin Scott and Shaddi Hasan.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;on-speaking&quot;&gt;On Speaking&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://zachholman.com/posts/slide-design-for-developers/&quot;&gt;Slide Design for Developers&lt;/a&gt; by Zach Holman&lt;br /&gt;
&lt;em&gt;tl;dr:&lt;/em&gt; Practical tips for making your slides better: use color and huge fonts, and treat your slides as prop, not as a crutch.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://www.wired.com/wired/archive/11.09/ppt2.html&quot;&gt;The Cognitive Style of Powerpoint&lt;/a&gt; by Edward Tufte&lt;br /&gt;
&lt;em&gt;tl;dr:&lt;/em&gt; Tufte provides many examples of how not to make slides, and his perspective on the “projector operating system” is valuable.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;on-research&quot;&gt;On Research&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://www.cs.virginia.edu/~robins/YouAndYourResearch.html&quot;&gt;You and Your Research&lt;/a&gt; by Edward Hamming&lt;br /&gt;
&lt;em&gt;tl;dr:&lt;/em&gt; How to ask the right questions and scope your research for maximum impact by a guy who did both.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://www.lhup.edu/~DSIMANEK/cargocul.htm&quot;&gt;Cargo Cult Science&lt;/a&gt; by Richard Feynman&lt;br /&gt;
&lt;em&gt;tl;dr:&lt;/em&gt; Don’t let PR, hype, or zeitgeist interfere with real science.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://cs.berkeley.edu/~pattrsn/talks/BadCareer.pdf&quot;&gt;How to Have a Bad Career in Research/Academia&lt;/a&gt; by Dave Patterson&lt;br /&gt;
&lt;em&gt;tl;dr:&lt;/em&gt; Good advice on pitfalls of graduate school/academia and how to avoid them.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://matt-welsh.blogspot.com/2011/11/software-is-not-science.html&quot;&gt;Software is not science&lt;/a&gt; by Matt Welsh&lt;br /&gt;
&lt;em&gt;tl;dr:&lt;/em&gt; Systems research is about principles, not artifacts.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.1614&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Database Metatheory: Asking the Big Queries&lt;/a&gt;
by Christos Papadimitriou&lt;br /&gt;
&lt;em&gt;tl;dr:&lt;/em&gt; An awesome reflection on theory versus practice by one of the great CS theoreticians.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;
</content>
 </entry>
 
 <entry>
   <title>What's Wrong with Amazon's DynamoDB Pricing?</title>
   <link href="http://bailis.org/blog//whats-wrong-with-amazons-dynamodb-pricing/"/>
   <updated>2012-03-04T00:00:00-08:00</updated>
   <id>http://bailis.org/blog//whats-wrong-with-amazons-dynamodb-pricing</id>
   <content type="html">&lt;p&gt;&lt;em&gt;tl;dr: There’s no good reason why strong consistency should cost double what eventual consistency costs. Strong consistency in DynamoDB shouldn’t cost Amazon anywhere near double and wouldn’t cost you double if you ran your own data store. While the benefit to your application may not be worth this high price, hiding consistency statistics encourages you to overpay.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Amazon recently released an open beta for &lt;a href=&quot;http://aws.amazon.com/dynamodb/&quot;&gt;DynamoDB&lt;/a&gt;, a hosted, “fully managed NoSQL database service.”  DynamoDB automatically handles database scaling, configuration, and replica maintenance under a pay-for-throughput model. Interestingly, DynamoDB costs &lt;a href=&quot;http://aws.amazon.com/dynamodb/pricing/&quot;&gt;twice as much for consistent reads&lt;/a&gt; compared to &lt;a href=&quot;http://en.wikipedia.org/wiki/Eventual_consistency&quot;&gt;eventually consistent&lt;/a&gt; reads. This means that if you want to be guaranteed to read the data you last wrote, you need to pay double what you could be paying otherwise.&lt;/p&gt;

&lt;p&gt;Now, I can’t come up with any good &lt;em&gt;technical&lt;/em&gt; explanation for why consistent reads cost 2x (and this is what I’m &lt;a href=&quot;http://bailis.org/research.html&quot;&gt;handsomely compensated&lt;/a&gt; to do all day). As far as I can tell, this is purely a &lt;em&gt;business&lt;/em&gt; decision that’s &lt;strong&gt;bad for users&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The cost of strong consistency to Amazon is low, if not zero. To you? 2x.&lt;/li&gt;
  &lt;li&gt;If you were to run your own distributed database, you wouldn’t incur this cost (although you’d have to factor in hardware and ops costs).&lt;/li&gt;
  &lt;li&gt;Offering a “consistent write” option instead would save you money and latency.&lt;/li&gt;
  &lt;li&gt;If Amazon provided SLAs so users knew how well eventual consistency worked, users could make more informed decisions about their app requirements and DynamoDB. However, Amazon probably wouldn’t be able to charge so much for strong consistency. &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I’d love if someone (read: from Amazon) would tell me why I’m wrong.&lt;/p&gt;

&lt;h2 id=&quot;the-cost-of-consistency-for-amazon-is-low-if-not-zero-to-you-2x&quot;&gt;The cost of consistency for Amazon is low, if not zero. To you? 2x.&lt;/h2&gt;

&lt;p&gt;Regardless of the replication model Amazon chose, I don’t think it’s possible that it’s costing them 2x to perform consistent reads compared to eventually consistent reads:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamo-style replication.&lt;/strong&gt; If Amazon decided to stick to their original &lt;a href=&quot;http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf&quot;&gt;Dynamo&lt;/a&gt; architecture, then all reads and writes are sent to all replicas in the system. When you perform an eventually consistent read, it means that DynamoDB gives you an answer when the first replica (or first few replicas) reply. This cuts down on latency: after all, waiting for the first of three replicas to reply is faster than waiting for a single replica to respond! However, all of the replicas will eventually respond to your request, whether you wait for them or not.&lt;/p&gt;

&lt;p&gt;This means that eventually consistent reads in DynamoDB would use the same number of messages, amount of bandwidth, and processing power as strongly consistent reads.  It shouldn’t cost &lt;em&gt;Amazon&lt;/em&gt; any more to perform consistent reads, but it costs &lt;em&gt;you&lt;/em&gt; double.&lt;/p&gt;

&lt;p&gt;If you ran your own Dynamo-style NoSQL database, like &lt;a href=&quot;http://basho.com/&quot;&gt;Riak&lt;/a&gt;, &lt;a href=&quot;http://cassandra.apache.org/&quot;&gt;Cassandra&lt;/a&gt;, or &lt;a href=&quot;http://project-voldemort.com/&quot;&gt;Voldemort&lt;/a&gt;, you wouldn’t have this artificial cost increase when choosing between eventual and strongly consistent reads.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Possible explanations:&lt;/em&gt; Maybe Amazon didn’t adopt the Dynamo design’s “send-to-all” semantics. This would save bandwidth but cause a  &lt;em&gt;significant&lt;/em&gt; latency hit. However, Amazon might save in terms of messages and I/O without compromising latency if they chose not to send read requests to remote sites (e.g., across availability zones) for eventually consistent reads . Another possibility is that, because  eventual consistency is faster, transient effects like queuing are  reduced, which helps with their back-end provisioning. Regardless, none of these possibilities necessitate a 2x price increase. Note that the Dynamo model covers most variants of quorum replication plus anti-entropy. &lt;strong&gt;edit: As I mentioned, it’s possible that weak consistency saves an extra I/O per read. However, I’m still unconvinced that this leads to a 2x TCO increase.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cost to Amazon: probably zero&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Master-slave replication.&lt;/strong&gt; If Amazon has chosen a master-slave replication model in DynamoDB, eventually consistent reads could go to a slave, reducing load on the master. However, this would mean Amazon is performing some kind of &lt;em&gt;monetary-cost-based&lt;/em&gt; load-balancing, which seems strange and hard to enforce. Even if this is the case, is doubling the price really the right setting for proper load-balancing at all times and for all data? Is the 2x cost bump really necessary? I’m not convinced.&lt;/p&gt;

&lt;p&gt;If you ran your own NoSQL store that used master-slave replication, like &lt;a href=&quot;http://hbase.apache.org/&quot;&gt;HBase&lt;/a&gt; or &lt;a href=&quot;http://www.mongodb.org/&quot;&gt;MongoDB&lt;/a&gt;, you wouldn’t be faced with this 2x cost increase for strong consistency.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cost to Amazon: increased master load. 2x load? Probably not, given proper load balancing.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;read-vs-write-cost&quot;&gt;Read vs. Write Cost&lt;/h2&gt;

&lt;p&gt;Amazon decided to place this extra charge on the read path. Instead, they could have offered a “consistent write” option, where all subsequent reads would return the data you wrote. This would slow down consistent writes (and speed up reads). I’d wager that a vast majority of DynamoDB operations are reads, so this “consistent write” option would decrease revenue compared to the current “consistent read” option. So, compared to a consistent write option, you’re currently getting charged more, and the majority of your DynamoDB operations are slower.&lt;/p&gt;

&lt;h2 id=&quot;fud-helps-sell-warranties&quot;&gt;FUD Helps Sell Warranties&lt;/h2&gt;

&lt;p&gt;Amazon is vague about eventual consistency in DynamoDB. Amazon &lt;a href=&quot;http://aws.amazon.com/dynamodb/faqs/#What_is_the_consistency_model_of_Amazon_DynamoDB&quot;&gt;says that&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Consistency across all copies of data is usually reached within a second. Repeating a[n eventually consistent] read after a short time should return the updated data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What does “usually” mean? What’s “a short time”? These are imprecise metrics, and they certaintly aren’t guarantees. This may be intentional.&lt;/p&gt;

&lt;p&gt;Best Buy tries to sell you an in-house extended warranty when you buy a new gadget. Most of the time, you don’t buy the warranty because you make a judgment that the chance of failure isn’t worth the price of the warranty. With DynamoDB, you have no idea what the likelihood of inconsistency is, except that “a second” is “usually” long enough. What if you don’t want to wait?  What about the best case? Or the worst? Amazon &lt;em&gt;could&lt;/em&gt; release a distribution for these delays so you could make an informed decision, but they don’t. Why not?&lt;/p&gt;

&lt;p&gt;It’s not for technical reasons. Amazon has to know this distribution. I’ve spent the last six months thinking about &lt;a href=&quot;http://bailis.org/projects/pbs/#demo&quot;&gt;how to provide SLAs for consistency&lt;/a&gt;. Our new &lt;a href=&quot;http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-4.pdf&quot;&gt;techniques&lt;/a&gt; show that you can make predictions with arbitrary precision simply by measuring message delays. Even without our work, I’d bet all my chips on the fact that Amazon has tested the hell out of their service and know exactly how &lt;em&gt;eventual&lt;/em&gt; eventual consistency is in DynamoDB. After all, if the window of inconsistency was prohibitively long, it’s unlikely that Amazon would offer eventual consistency as an option in the first place.&lt;/p&gt;

&lt;h2 id=&quot;with-more-information-most-customers-probably-wouldnt-pay-2x&quot;&gt;With more information, most customers probably wouldn’t pay 2x&lt;/h2&gt;

&lt;p&gt;So why doesn’t Amazon release this data to customers? There are several business reasons that disincentivize them from doing so, but the basic idea is that &lt;em&gt;it’s not clear that strong consistency delivers a 2x value to the customer in most cases&lt;/em&gt;. However, without the data, customers can’t make this call for themselves without a lot of benchmarking. And profiling doesn’t buy you any guarantees like an SLA.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;If users knew what “usually” really meant, they could make intelligent decisions about pricing. How many people would pay double for strong consistency if they’d only have to wait a few tens of milliseconds on &lt;em&gt;average&lt;/em&gt; for consistent read, or, conversely, that data would be at most a few tens of milliseconds stale? What if 99.9% of reads were fresh within 100ms? 200ms? 500ms? 1000ms? Without the distribution, it’s hard to judge whether eventual consistency works for a given app.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Related: if your probability of consistent reads is sufficiently high (say, 99.999%) after normal client-side round-trip times (which are often long), do you really care about strong consistency? If you did, you could also intentionally block your readers until they would read consistent data with a high enough probability.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If you have “&lt;a href=&quot;http://www.allthingsdistributed.com/2008/12/eventually_consistent.html&quot;&gt;stickiness&lt;/a&gt;” and your requests are always sent to a given DynamoDB instance (which &lt;em&gt;you&lt;/em&gt; can’t enforce but is a likely design choice!), then you may &lt;em&gt;never&lt;/em&gt; read inconsistent data, even under eventual consistency. Even if this happens only some of the time, you’ll see less stale data. This is due to what are known as &lt;a href=&quot;http://www.cs.utexas.edu/~dahlin/Classes/GradOS/papers/SessionGuaranteesPDIS.pdf&quot;&gt;session guarantees&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;It’s not specified how much slower a consistent read is compared to an eventually consistent read, so it’s possible that the wait time until (a high probability of) consistency plus the latency of an eventually consistent read is actually &lt;em&gt;lower&lt;/em&gt; than the latency of a strongly consistent read.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If you store timestamps with your data, you can guarantee you don’t read old data (another session guarantee) and, if you’re dissatisfied, you can repeat your eventually consistent read. You can calculate the expected cost of reading consistent data across multiple eventually consistent reads—if you have this data.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I don’t think many people would pay 2x for their reads if they were provided with this data or some consistency SLA.  Maybe some total &lt;em&gt;rockstars&lt;/em&gt; are okay with this vagueness, but it sure seems hard to design a reliable service without any consistency guarantees from your backend. By only giving users an extremely conservative upper bound on the &lt;em&gt;eventuality&lt;/em&gt; of reads (say, if your clients switch between data centers between reads and writes &lt;em&gt;and&lt;/em&gt; DynamoDB experiences extremely high message delays), Amazon may be scaring the average (prudent) user into paying more than they probably should. &lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Strong consistency in DynamoDB probably doesn’t cost Amazon much (if anything) but it costs you twice as much as eventual consistency. It’s highly likely that Amazon has quantitative metrics for eventual consistency in DynamoDB, but keeping those numbers private makes you more likely to pay extra for guaranteed consistency. Without those numbers, it’s much harder for you to reason about your application’s behavior under average-case eventual consistency.&lt;/p&gt;

&lt;p&gt;Sometimes you absolutely need strong consistency. Pay in those cases. However, especially for web data, eventual is often good enough. The current problem with DynamoDB is that, because customers don’t have access to quantitative metrics about their window of inconsistency, it’s easy for Amazon to set prices irrationally high.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer: I (perhaps obviously) have no privileged knowledge of Amazon’s AWS architecture (or any other parts of Amazon, for that matter). &lt;a href=&quot;http://bailis.org/research.html&quot;&gt;I just happen to spend a lot of time thinking about and working on cool distributed systems problems&lt;/a&gt;.&lt;/strong&gt;&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Huchra's Seven Characteristics of a Successful Scientist</title>
   <link href="http://bailis.org/blog//huchra-s-seven-characteristics-of-a-successful-scientist/"/>
   <updated>2012-02-10T00:00:00-08:00</updated>
   <id>http://bailis.org/blog//huchra-s-seven-characteristics-of-a-successful-scientist</id>
   <content type="html">In my freshman fall at Harvard, I was fortunate to enroll in a seminar on &lt;a href=&quot;http://en.wikipedia.org/wiki/Cosmology&quot;&gt;cosmology&lt;/a&gt; with &lt;a href=&quot;http://en.wikipedia.org/wiki/John_Huchra&quot;&gt;John Huchra&lt;/a&gt;. In case you&#39;re not an astrophysics buff, among other things, Huchra was one of the first astronomers to experimentally observe&amp;nbsp;&lt;a href=&quot;http://en.wikipedia.org/wiki/Great_Wall_(astronomy)&quot;&gt;large-scale structures in the universe&lt;/a&gt;&amp;nbsp;using wide-ranging galaxy surveys and, until shortly before his &lt;a href=&quot;http://www.nytimes.com/2010/10/14/us/14huchra.html&quot;&gt;untimely death&lt;/a&gt;, he served as the president of the &lt;a href=&quot;http://en.wikipedia.org/wiki/American_Astronomical_Society&quot;&gt;American Astronomical Society&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;While my formal study of astronomy ended after Huchra&#39;s seminar (although I do lug my telescope out when visiting my parents in Nebraska), Huchra left a strong impression on me. There are certain individuals who exemplify my ideal of a great scientist, who are not only successful but demonstrate a sense of real&amp;nbsp;&lt;i&gt;purpose&lt;/i&gt; in their research and enthusiasm for their subject. Huchra&#39;s weekly seminar and our often extended discussions afterward were probably my first extended contact with a real-life, practicing professional scientist. Huchra&#39;s excitement about his field, his ability to communicate heady concepts such as galaxy formation and cosmic microwave background radiation to a group of university freshmen, and his passion for science heavily influenced my own career path, even if I didn&#39;t realize it at the time.&lt;br /&gt;&lt;br /&gt;Some time ago, I found a highly accessible (draft) essay Huchra wrote titled &lt;a href=&quot;https://www.cfa.harvard.edu/~dfabricant/huchra/mapmaker.pdf&quot;&gt;&quot;Mapmaker, Mapmaker Make Me a Map&quot;&lt;/a&gt;, describing his career path and research. Huchra&#39;s writing captures his personality well and is worth a read in itself, but I like the essay also for his formulation of the characteristics of a &quot;successful scientist&quot;:&lt;br /&gt;&lt;blockquote class=&quot;tr_bq&quot;&gt;…an individual&#39;s success in the [science] game could be predicted by their characteristics in a seven vector space. Each vector measured a critical personal characteristic or set of characteristics such as intelligence, taste and luck, and the ability to tell one&#39;s story (public relations)…&lt;/blockquote&gt;&lt;blockquote class=&quot;tr_bq&quot;&gt;Looking back on this I&#39;ve come to realize that being nearly a unit vector in any one of the important characteristics pretty much guarantee&#39;s[sic] you a tenured job, two are good for membership in the National Academy, and three put you in contention for the Nobel Prize.&lt;/blockquote&gt;I haven&#39;t seen Huchra&#39;s seven characteristics elsewhere, but I find them particularly interesting:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Raw Intelligence&lt;/li&gt;&lt;li&gt;Knowledge&lt;/li&gt;&lt;li&gt;Public Relations&lt;/li&gt;&lt;li&gt;Creativity&lt;/li&gt;&lt;li&gt;Taste&lt;/li&gt;&lt;li&gt;Effectiveness&lt;/li&gt;&lt;li&gt;Competence&lt;/li&gt;&lt;/ul&gt;Huchra also adds that &quot;many people would want to add &#39;luck&#39; to the list, but our learned conclusion was that luck is a product of at least three of the above vectors and not an attribute in and of itself.&quot;&amp;nbsp;&amp;nbsp;Moreover, &quot;some vectors are worth more than others, [sic] for example [sic] taste and creativity are probably more important than knowledge.&quot;&lt;br /&gt;&lt;br /&gt;Now, as a greenhorn in Computer Science, I&#39;ll not speculate like Huchra does in his essay as to who in my own field constitutes a &quot;unit vector&quot; for each characteristic, nor will I attempt to publicly analyze my own strengths and weaknesses. However, I think this is a thought-provoking taxonomy that deserves reflection. After all, being the smartest person in the room can get you pretty far, but, as with most life pursuits, (perhaps thankfully for most of us) there&#39;s often a lot more to success than raw intelligence by itself.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;A version of Huchra&#39;s essay also appears as a chapter of&amp;nbsp;&lt;u&gt;Our Universe: The Thrill of Extragalactic Exploration as Told by Leading Experts&lt;/u&gt;, edited by Alan Stern (Cambridge University Press, 2001).&lt;/i&gt;
</content>
 </entry>
 
 <entry>
   <title>CCC Post: Why am I in graduate school?</title>
   <link href="http://bailis.org/blog//ccc-post-why-am-i-in-graduate-school/"/>
   <updated>2011-12-19T00:00:00-08:00</updated>
   <id>http://bailis.org/blog//ccc-post-why-am-i-in-graduate-school</id>
   <content type="html">&lt;div&gt;&lt;i&gt;For the 2011-12 school year, I&#39;m serving as a student blogger for the Computing Community Consortium&#39;s &quot;A Day in the Life of a Computer Science Graduate Student&quot; blog. &amp;nbsp;In the interest of archiving my posts, I&#39;m cross-posting them here.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href=&quot;http://cra-ccc.org/lifeofacsgs/why-am-i-in-graduate-school&quot;&gt;Link to original post&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Why am I in graduate school? &amp;nbsp;Different students have different answers to this question--passion, research, lifestyle, startup incubation, you name it--and others may not have an answer. &amp;nbsp;In the final lecture of CS262, our entry-level graduate systems seminar, our professor, Eric Brewer, emphasized the importance of having a purpose in one&#39;s graduate education. &amp;nbsp;I agree, and I also think it&#39;s important for someone considering graduate studies to think about why they&#39;d like to spend five or more years on research.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Graduate school may be the logical &quot;next step&quot; for you, but, if you&#39;re considering a Ph.D. in Computer Science, you likely have other options. &amp;nbsp;Someone aptly described the financial opportunity cost of five years of graduate school to me as roughly costing a house. &amp;nbsp;From a strictly financial perspective, a grad student makes $30,000 or less per year (not including summer internships and external employment), while the going rate for engineers in the Bay Area is easily $80-100,000+ per year. &amp;nbsp;Over five years, that&#39;s at least $250,000, and that&#39;s not including salary increases, benefits, stock appreciation, and bonuses. &amp;nbsp;At the least, that number alone should make you think twice about grad school. &amp;nbsp;Of course, there are salary benefits from getting a Ph.D., but it&#39;s doubtful whether they&#39;re financially worthwhile (see the link to Matt&#39;s blog at end of this post).&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I thought I&#39;d share three of my personal reasons here. &amp;nbsp;Maybe after five years I&#39;ll be jaded, but I hope to retain my optimism and will do my best to make it last.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;I get to work on (almost) whatever I want.&lt;/b&gt; As a grad student, you can work on whatever you can convince your advisor to let you work on. If a project has research potential and intellectual merit, you can probably convince someone to let you work on it. If you&#39;re able to choose your advisor, your options are as broad as your department. Moreover, while your advisor typically funds you, external fellowships and funding give you additional freedom. &amp;nbsp;This freedom is a double-edged sword: if you don&#39;t like what you&#39;re doing, it means you (likely) chose poorly! &amp;nbsp;I consider myself lucky because there are a ton of exciting projects at Berkeley and because I received external funding. &amp;nbsp;In fact, I changed my initial plans to defer enrollment for a year (and, in doing so, turned down enticing compensation from an industry position) to keep my external funding and the freedom it provides me.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;The people are great.&lt;/b&gt; Being around passionate people is invigorating. &amp;nbsp;Learning from and debating with other people in my field who are thinking about problems I also find stimulating is irreplaceable. &amp;nbsp;I think you can find this kind of concentrated community in many areas of society, but, at its core, I see science as a collective effort towards greater human knowledge that (at its best) lends itself to interaction and collaboration. &amp;nbsp;In CS in particular, reseachers have chosen to sacrifice monetary gain (especially at the top of the field, often working as hard or harder than they would in industry) to be in academia. &amp;nbsp;Maybe this means we&#39;re all a little crazy, but my colleagues give me a lot of energy.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;I get to (try to) change the world.&lt;/b&gt; This is quite optimistic, but, in the very &lt;a href=&quot;http://en.wikipedia.org/wiki/Turing_Award#Recipients&quot;&gt;long&lt;/a&gt; &lt;a href=&quot;http://en.wikipedia.org/wiki/List_of_Nobel_laureates#Laureates&quot;&gt;tail&lt;/a&gt; of the distribution of researchers, you&#39;ll find people who changed the way we view and interact with the world. I don&#39;t anticipate being one of these select few, but I want to try my best to make a difference. Academic CS is in an odd position where there&#39;s an industry making money on the field as well, often coming up with better ideas than academia does. &amp;nbsp;However, our goals are different: in academia, we don&#39;t need to make a profit, meet a bottom line, or ship production code. &amp;nbsp;I think this affords us the opportunity for longer-term research and impact. As academics, we don&#39;t always get this right, but I think the struggle is worthwhile. &amp;nbsp;I do believe that one can make meaningful contributions and have real-world impact in industry--it&#39;s probably easier to do so there as well! &amp;nbsp;In fact, having good entrepreneurship skills is quite similar to having good research taste and execution ability. &amp;nbsp;However, even if I don&#39;t change the world directly, I have faith that my (would-be) future students will!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I&#39;m sure these reasons will evolve as I continue in graduate school, but, to me, having reasons and understanding why I&#39;m here is fundamental to getting what I want out of the experience.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Matt Welsh has a particularly good overview of some additional reasons why or why not to go to graduate school on his &lt;a href=&quot;http://matt-welsh.blogspot.com/2010/09/so-you-want-to-go-to-grad-school.html&quot;&gt;blog&lt;/a&gt;.&lt;/div&gt;
</content>
 </entry>
 
 <entry>
   <title>Tasting My Proverbial Academic Foot</title>
   <link href="http://bailis.org/blog//tasting-my-proverbial-academic-foot/"/>
   <updated>2011-12-01T00:00:00-08:00</updated>
   <id>http://bailis.org/blog//tasting-my-proverbial-academic-foot</id>
   <content type="html">&lt;b&gt;tl;dr: Researchers I really respect read a rambling, ranting, accidentally-public review of their paper I wrote for class and then blogged about it.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Apparently private-but-really-public memos are&amp;nbsp;&lt;a href=&quot;http://eng.yammer.com/blog/2011/11/30/scala-at-yammer.html&quot;&gt;all the rage&lt;/a&gt;&amp;nbsp;this week, so I shouldn&#39;t have been surprised when&amp;nbsp;I received an interesting&amp;nbsp;&lt;a href=&quot;https://twitter.com/#!/michaelfreedman/status/142128313454952448&quot;&gt;tweet&lt;/a&gt;&amp;nbsp;from&amp;nbsp;&lt;a href=&quot;http://www.cs.princeton.edu/~mfreed/&quot;&gt;Mike Freedman&lt;/a&gt;&amp;nbsp;last night. &amp;nbsp;Mike informed me that&amp;nbsp;&lt;a href=&quot;http://www.cs.cmu.edu/~dga/&quot;&gt;Dave Andersen&lt;/a&gt;&amp;nbsp;had posted a &lt;a href=&quot;http://dga.livejournal.com/45496.html&quot;&gt;reply&lt;/a&gt;&amp;nbsp;to a paper review I&#39;d recently posted on a private blog as homework for &lt;a href=&quot;http://www.cs.berkeley.edu/~istoica/&quot;&gt;Ion Stoica&lt;/a&gt;&#39;s &lt;a href=&quot;http://www.cs.berkeley.edu/~istoica/classes/cs294/11/&quot;&gt;cloud computing seminar&lt;/a&gt;&amp;nbsp;I&#39;m enrolled in at Berkeley. &amp;nbsp;This was surprising in part because I wasn&#39;t aware of any public links to my personal review blog (having disabled Google indexing and public references) except for the link apparently buried on the course site. Apparently, my roommate mentioned our class to Mike, his old advisor, and I&#39;d recently de-anonymized myself when tweaking my Blogger profile for this blog. However, the tweet was even more surprising because Dave and Mike were co-authors on the&amp;nbsp;&lt;a href=&quot;http://www-cgi.cs.cmu.edu/~dga/papers/cops-sosp2011.pdf&quot;&gt;paper&lt;/a&gt;&amp;nbsp;that Dave blogged about, which I had reviewed unnecessarily harshly. &amp;nbsp;My review of this paper (describing a system called COPS that provides a variant of causal consistency called causal+ consistency) was definitely not intended for public consumption--and certainly not by the authors, who I genuinely greatly respect.&lt;br /&gt;&lt;br /&gt;To comment directly on Dave&#39;s post, the substantive critical portions of my review stemmed from what I incorrectly perceived as a claimed major contribution of the paper: the development of causal+ consistency, which has existed in various flavors in related work but wasn&#39;t previously formally defined. &amp;nbsp;As Dave clarifies in his post, defining causal+ is a&amp;nbsp;contribution, but the paper&#39;s main contribution is the design of a causal+ system, including technical details like garbage collection and making it run fast. &amp;nbsp;This is an interesting distinction that I&#39;m slowly learning about between the systems community, with which I am more familiar, and the database community. &amp;nbsp;To paraphrase&amp;nbsp;&lt;a href=&quot;http://www.cs.berkeley.edu/~brewer/&quot;&gt;Eric Brewer&lt;/a&gt;&amp;nbsp;in&amp;nbsp;&lt;a href=&quot;http://www.cs.berkeley.edu/~brewer/cs262/&quot;&gt;CS262 lecture&lt;/a&gt;, traditionally, the database community focuses on providing useful invariants, then building systems that obey those invariants. &amp;nbsp;On the other hand, the OS/systems community builds from the ground up, providing useful abstraction and modularity along the way. &amp;nbsp;Brewer and&amp;nbsp;&lt;a href=&quot;http://db.cs.berkeley.edu/jmh/&quot;&gt;Joe Hellerstein&lt;/a&gt;&amp;nbsp;make a rough&amp;nbsp;&lt;a href=&quot;http://www.cs.berkeley.edu/~brewer/cs262/systemr.html&quot;&gt;categorization&lt;/a&gt;&amp;nbsp;between OS/systems research (&quot;elegance of system&quot;) and database research (&quot;elegance of semantics&quot;) styles along these lines. &amp;nbsp;COPS is a huge win in terms of system elegance compared to prior work--it screams, it&#39;s scalable, and it&#39;s practical--even if the semantics which derailed my brief review have been previously explored (if not formally stated). &amp;nbsp;As I determine a field for my dissertation, it&#39;s important that I continue aligning and calibrating my rapidly-evolving tastes with the values of the research community I will join.&lt;br /&gt;&lt;br /&gt;Dave quotes me mercifully in his post, but I had some unnecessarily strong words in my &quot;(surprisingly angry)&quot; review that I wouldn&#39;t stand by publicly or even privately. &amp;nbsp;In my (would-be) private class reviews and discussions--often written quickly, in the eleventh hour--I&#39;m typically pretty harsh and at times deliberately inflammatory. This helps me get feedback from my advisors and peers about the uncertainties and opinions I have about research in addition to encouraging debate--in private. I&#39;m an admirer of the philosophy that one should put a stake firmly in the ground and see what&#39;s there, but be open and ready to move the stake just as easily. This is one of the great parts of academia: ideas are currency, yet judgments are rarely irreversible, so making judgments about ideas and rethinking them is both fun &lt;i&gt;and&lt;/i&gt;&amp;nbsp;encouraged! &amp;nbsp;In the context of a write-once-read-never paper review between me and my instructor, I lean towards making over-the-top statements and often voice my doubts about our field and what to expect from my own research. Additionally, in full disclosure, the fact that&amp;nbsp;I&#39;m working on similar data consistency issues to their paper probably didn&#39;t help this particular review (although, to be fair, we&#39;re looking at how far we can push existing models--but more on that soon). &amp;nbsp;I&#39;m not the most socially adept person I know, but I definitely wouldn&#39;t have knowingly publicly published this review. &amp;nbsp;You can imagine my horror&amp;nbsp;when I learned the COPS team had read it.&lt;br /&gt;&lt;br /&gt;Public content on the internet is serious business. &amp;nbsp;There are too many examples of public content being posted that really shouldn&#39;t be posted (even if it&#39;s a conscious decision: in CS,&amp;nbsp;&lt;a href=&quot;http://infoweekly.blogspot.com/2009/09/loser-awards.html&quot;&gt;Mihai Patrascu&#39;s job offer debacle&lt;/a&gt; comes shudderingly to mind). &amp;nbsp;In the internet era, the &quot;newspaper rule&quot; of not doing anything you wouldn&#39;t want on the front page of the New York Times applies to all content you post; a hyperlink can spread virally. &amp;nbsp;Moreover, the internet (more or less) lasts forever, so there&#39;s no easy &quot;undo&quot; or &quot;delete&quot; operations (unless we eventually have some combination of ubiquitous&amp;nbsp;&lt;a href=&quot;http://www.cs.cornell.edu/~djwill/pubs/nexus-sosp.pdf&quot;&gt;attestations/TPMs&lt;/a&gt;, &lt;a href=&quot;http://people.csail.mit.edu/nickolai/papers/zeldovich-dstar.pdf&quot;&gt;DIFC&lt;/a&gt;, and &lt;a href=&quot;http://www.stanford.edu/class/cs240/readings/89-leases.pdf&quot;&gt;leases&lt;/a&gt;).&amp;nbsp;&amp;nbsp;I personally make a concerted effort to control what content I release publicly, and this error is not one I&#39;ll forget. I&#39;m well-aware that&amp;nbsp;&lt;a href=&quot;http://www.schneier.com/essay-056.html&quot;&gt;security through obscurity isn&#39;t security&lt;/a&gt;, but this experience truly brings the lesson home.&lt;br /&gt;&lt;br /&gt;Ironically, this semester&#39;s classes ended yesterday, and I was planning to shut down the blog shortly thereafter. &amp;nbsp;After this incident,&amp;nbsp;I considered keeping the post publicly readable, but Dave has picked out the salient points and re-posted what little is worth reposting (in addition to answering my questions!). &amp;nbsp;I&#39;m a big proponent of publicly baring weaknesses in one&#39;s own research and have recently thought hard about posting the blind paper reviews that my work receives and independently providing lists of deficiencies for each publication I write. &amp;nbsp;However, because this review wasn&#39;t intended to be public (and, of course, for the sake of my academic career), I&#39;ve decided to take it down. (Dave has graciously agreed to remove the copy of my post that he uploaded after I removed my blog.)&lt;br /&gt;&lt;br /&gt;After a panicked late-night phone call with my advisor, a powwow with two wonderful, anonymous late-night inhabitants of Soda Hall, and some sleep, I still feel embarrassed, but I think I&#39;ve learned a few good lessons here. &amp;nbsp;I trust that Dave, Mike, and the COPS team won&#39;t hold this against me in five years if I make a trip around the academic job circuit. &amp;nbsp;And, secretly, I look forward to the day when naïve first-year Ph.D. students write ridiculous commentaries on my work, especially if they&#39;re as bombastic I was in my review.
</content>
 </entry>
 
 <entry>
   <title>NoseSQL and SenseDB: New Paradigms for Crowdsourced Databases</title>
   <link href="http://bailis.org/blog//nosesql-and-sensedb-new-paradigms-for-crowdsourced-databases/"/>
   <updated>2011-11-11T00:00:00-08:00</updated>
   <id>http://bailis.org/blog//nosesql-and-sensedb-new-paradigms-for-crowdsourced-databases</id>
   <content type="html">Introductory note: Given the high &lt;a href=&quot;http://business.financialpost.com/2011/11/09/the-last-frontier-for-computers-scent/&quot;&gt;risk&lt;/a&gt; of being &lt;a href=&quot;https://twitter.com/#!/marcua/status/134837175580758017&quot;&gt;scooped&lt;/a&gt;, I&#39;ve decided to unveil my vision for the future of human computer interaction and crowdsourced databases. &amp;nbsp;In light of the impending explosion of research in this area, I have deviated from my plans to submit to a proper database conference and have instead chosen to publicly lay claim to these ideas in this post. &amp;nbsp;Because you&#39;ll wonder, I am serious about these ideas and believe there are interesting problems in this space. &amp;nbsp;I&#39;d even entertain proposals for collaboration.&lt;br /&gt;&lt;br /&gt;(Edit: To be clear, I&#39;m kidding about getting scooped. &amp;nbsp;This isn&#39;t my day-to-day&amp;nbsp;&lt;a href=&quot;http://cs.berkeley.edu/~pbailis/research.html&quot;&gt;research&lt;/a&gt;, but I do like the idea.)&lt;br /&gt;&lt;br /&gt;Current crowdsourced databases are incapable of answering three major classes of queries. &amp;nbsp;Database systems such as&amp;nbsp;&lt;a href=&quot;http://www.eecs.berkeley.edu/~rxin/papers/crowddb_sigmod2011.pdf&quot;&gt;CrowdDB&lt;/a&gt; and &lt;a href=&quot;http://db.csail.mit.edu/pubs/mturk.pdf&quot;&gt;Qurk&lt;/a&gt;&amp;nbsp;leverage human-powered computation to answer queries that computers cannot typically answer, such as performing complex image classification or processing uncertain or underspecified queries. &amp;nbsp;These systems are generic but have thus far focused on processing known&amp;nbsp;&lt;i&gt;information&lt;/i&gt;&amp;nbsp;about entities in the outside world. &amp;nbsp;However, to the best of my knowledge, crowdsourced databases have overlooked a large part of the human experience: our &lt;i&gt;senses&lt;/i&gt;. &amp;nbsp;In the remainder of this post, I will outline crowdsourcing extensions that represent an improvement over existing databases: the ability to query over scents, tastes, and tactile sensations.&lt;br /&gt;&lt;br /&gt;&lt;a href=&quot;http://www.nist.gov/public_affairs/techbeat/tb2008_1028.htm#nose&quot;&gt;Olfaction&lt;/a&gt;, &lt;a href=&quot;http://www.digikey.com/us/en/techzone/sensors/resources/articles/The-Five-Senses-of-SensorsTaste.html&quot;&gt;taste&lt;/a&gt;, and &lt;a href=&quot;http://persci.mit.edu/pub_pdfs/retrographic_sensor.pdf&quot;&gt;touch/texture &lt;/a&gt;sensors are immature and are relatively specialized. Computers cannot reliably answer a wide range of pressing questions about raw sensory input and our interactions with the physical world. &amp;nbsp;Electronic sensors can detect specialized inputs, such as chemical presence (e.g., &lt;a href=&quot;http://www.aps.org/about/governance/task-force/counter-terrorism/coffey.cfm&quot;&gt;explosives&lt;/a&gt;) and some flavors (e.g., selected features of&amp;nbsp;&lt;a href=&quot;http://www.sciencedirect.com/science/article/pii/S0925400500003786&quot;&gt;wine&lt;/a&gt;) but, to the best of my knowledge, are not generally applicable or widely available.&lt;br /&gt;&lt;br /&gt;Online databases can answer questions about particular sensing domains such as &lt;a href=&quot;http://beeradvocate.com/&quot;&gt;beer&lt;/a&gt;&amp;nbsp;and &lt;a href=&quot;http://www.dishola.com/&quot;&gt;food&lt;/a&gt;&amp;nbsp;tasting and &lt;a href=&quot;http://en.wikipedia.org/wiki/Pandora_Radio&quot;&gt;musical preferences&lt;/a&gt;. &amp;nbsp;These databases contain knowledge of high-level, narrowly-constrained semantic interpretations of the raw sensory data. &amp;nbsp;A beer rating is a condensation of multiple factors, many of which are reflections on the beer&#39;s taste, nose, and mouthfeel--but the raw taste, scent, and mouthfeel data is not available.&lt;br /&gt;&lt;br /&gt;Operating on raw data allows greater query expressivity and insight than operating on&amp;nbsp;&lt;a href=&quot;http://digitalmedia.oreilly.com/2006/08/17/inside-pandora-web-radio.html&quot;&gt;a set of features describing the data&lt;/a&gt;. We can view preferences regarding senses as functions over the set of raw stimuli in the world. &amp;nbsp;Without sensory data, it is difficult to infer connections between sensations, such as why we like taste of&amp;nbsp;&lt;a href=&quot;http://en.wikipedia.org/wiki/Peanut_butter,_banana_and_bacon_sandwich&quot;&gt;peanut butter, banana, and bacon in sandwiches&lt;/a&gt;, the smell of&amp;nbsp;&lt;a href=&quot;http://en.wikipedia.org/wiki/Good_%26_Plenty&quot;&gt;cucumber and Good &amp;amp; Plentys&lt;/a&gt;, and the seemingly culturally universal combination of heat and steam in a&amp;nbsp;&lt;a href=&quot;http://en.wikipedia.org/wiki/Sauna&quot;&gt;sweat lodge&lt;/a&gt;. &amp;nbsp;We cannot easily make connections between even somewhat similar domains. &amp;nbsp;For example, answering questions about wine and recipe pairings requires either additional cross-domain knowledge (a database of explicit beer and recipe pairings)&amp;nbsp;&lt;i&gt;or&lt;/i&gt;&amp;nbsp;lower-level sensory data (what flavors are in each beer?) paired with filters on this data. &amp;nbsp;These solutions appear similar, however the latter scales to more domains without requiring additional external expert input.&lt;br /&gt;&lt;br /&gt;While computers are deficient at answering sense-based queries, thankfully (and by definition), most humans come complete with detectors for all five of our senses. &amp;nbsp;Employing humans to power&amp;nbsp;&lt;i&gt;general-purpose&lt;/i&gt; sensory databases is a natural extension of crowdsourcing technology. &amp;nbsp;Compared to a specialized mechanical solution such as a chemical-specific detector, a human crowd is more general and likely less expensive than highly-specialized equipment when answering a wide range of queries. &amp;nbsp;Similarly, humans can be used for both lower-level sensory analysis and broader semantic-level comparisons than narrowly scoped online information aggregation sites. &amp;nbsp;Accordingly, I propose the development of a crowdsourced sense-oriented database,&amp;nbsp;&lt;b&gt;SenseDB&lt;/b&gt;. &amp;nbsp;This database no doubt needs a query language for user-defined functions, which will consist of embedded DSLs for scent, taste, and touch queries, or&amp;nbsp;&lt;b&gt;NoseSQL&lt;/b&gt;,&amp;nbsp;&lt;b&gt;FlavorSQL&lt;/b&gt;, and&amp;nbsp;&lt;b&gt;FeelSQL&lt;/b&gt;, respectively.&lt;br /&gt;&lt;br /&gt;Harnessing the power of human-powered sense-based query processing leads to several research questions:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Raw versus semantically-rich data.&lt;/b&gt;&amp;nbsp;To what extent does encoding raw sensory data aid in query processing? &amp;nbsp;Does querying a (logical) database of taste, scent, and touch details provide higher accuracy, speed, or throughput than simply presenting the question to a crowd from a high semantic perspective? &amp;nbsp;Can we better re-use raw data between queries? Do semantically rich queries impact the bias of the results?&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Encoding&lt;/b&gt;.&amp;nbsp;How do we encode the sensory details required to answer a query? Which aspects of the sensory experience are required to answer a query?&amp;nbsp;The degree of specificity in formulating a particular query limits the applicability of the results for future queries and analysis.&amp;nbsp;There are &lt;a href=&quot;http://www.iso.org/iso/iso_catalogue/catalogue_ics/catalogue_ics_browse.htm?ICS1=67&amp;amp;ICS2=240&quot;&gt;many published ISO standards&lt;/a&gt; governing sensory analysis (including which &lt;a href=&quot;http://www.iso.org/iso/iso_catalogue/catalogue_ics/catalogue_detail_ics.htm?ics1=67&amp;amp;ics2=240&amp;amp;ics3=&amp;amp;csnumber=38031&quot;&gt;tasting glasses&lt;/a&gt; to use with olive oil), but applying these standards to a general-purpose crowdsourced query processing system remains an open problem.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;b&gt;Transmission.&amp;nbsp;&lt;/b&gt;Sights and sounds can be easily recorded and transmitted for processing, but we lack mechanisms for reliably communicating touch, taste, and smell stimuli. &amp;nbsp;One option is to use a crowd that is physically co-located with the set of objects to be queried, but this does not scale in the size of the set of objects or in the number of queries.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;b&gt;Non-human processing.&lt;/b&gt;&amp;nbsp;Are humans the most efficient computation engine for sensory queries? &amp;nbsp;Can we humanely use canines or other macrobiotic organisms to process these queries instead? &amp;nbsp;How does the throughput of a non-human compute engine compare to a human compute engine? &amp;nbsp;What about queries per second, queries per dollar, or total cost of ownership? Both &lt;a href=&quot;http://www.deanfriedman.com/zine-state/zine_stateoftheuniverse-02.html&quot;&gt;rats&lt;/a&gt; and &lt;a href=&quot;http://maic.jmu.edu/journal/7.3/focus/townsend2/townsend2.htm&quot;&gt;pigs&lt;/a&gt; have been successfully employed in demining scenarios, however the generality of these mechanisms is unclear.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;These challenges are only a subset of the problems inherent in developing a sense-oriented query database. &amp;nbsp;However, given the apparent advantages of ScentDB, I believe the database community will rise to the occasion and take crowdsourcing to the next level, providing valuable insights into the human condition along the way.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;I would like to thank Joe Hellerstein, Mike Franklin, the BOOM team, and those explicitly not mentioned here for their feedback on these ideas.&lt;/i&gt;&lt;/div&gt;
</content>
 </entry>
 
 
</feed>
