Timilearning - A blog by Timi Adeniran

A Library for Incremental Computing

2022-07-26T20:31:30-00:00

Late last year, while working on my C++ series and on the lookout for a project to build in C++, I came across "How to Recalculate a Spreadsheet", which inspired me to build Anchors — a C++ library for incremental computing.

I highly recommend reading the post on lord.io if you want to learn about incremental computing, and perhaps return here if you're interested in some implementation details.

But if you have just thought to yourself, "I'm not sure I want to read more than one article on this topic, so I'd rather not open a new tab", I will also summarize incremental computing before getting into the implementation.

Table of Contents

Incremental Computing
- Modelling computation as a graph
  - Dirty Marking
  - Topological Sorting
- Demand-driven Incremental Computing
Anchors
- The Anchor class
- The Engine class
Closing Thoughts
Further Reading

Incremental Computing

Imagine we find ourselves working on a program to calculate the answer to something as difficult as “the ultimate question of life, the universe, and everything”. Typically, a question so grand will require a complex formula to solve. But in this alternate universe, we have been told that we can solve this by simply plugging in values for x and y into the formula: z = (x + y) * 42.

We sigh with relief as this problem is now simpler. But our relief is short-lived: we are told that the values of x and y each take up to a day to compute, because their formulas are so complex, and our program will have to wait.

Still, we're happy to wait and eventually, we receive our values, plug them into the formula, and go out to celebrate our new discovery.

Midway through our celebrations, we get interrupted and are told that the universe has received an update to x's formula, so we would need to recompute it, as well as the value of z.

Since only the value of x has changed and y hasn't, we know we can reuse the previously computed value of y and simply plug the new x value into our formula to get the result.

But while it may seem obvious to us that there's no need to recompute y because its formula has not changed, early models of computing were not so smart. They were more likely to recompute the values of all of z's inputs before recomputing z, even if an input had received no updates.

This mental model we have, where we know to only recompute what depends on either a changed formula (x in our example) or a changed input (z after x changed), is the concept of incremental computing. From Wikipedia,

Incremental computing is a software feature which, whenever a piece of data changes, attempts to save time by only recomputing those outputs which depend on the changed data.

I've used a trivial (and far-fetched) example, but this computing model generalises to different applications, ranging from spreadsheets to complex financial applications to rendering GUIs.

Modelling computation as a graph

We can achieve incremental computing by modelling data as a directed acyclic graph, where each data element is a node in the graph, and there is an edge from a node A to node B if A is an input to B. That is, node B depends on node A.

We can represent our alternate universe example as the graph:

Now, let's use a less trivial example with a larger graph. Recall the quadratic formula for solving equations of the form ax² + bx + c = 0:

We can represent this as a computation graph with three initial inputs, a, b, and c:

In this graph, there is an outgoing edge from a node if the node is an input to another node.

If any of the inputs, a, b, or c changes, we would need to recalculate the outputs, x1 and x2.

But if only b changes, we should recalculate only the nodes that depend on b either directly or indirectly, and we should calculate them in the right order: b^2 before sq, sq before -b + sq, and -b + sq before x1, for example.

We do not need to recalculate nodes 4ac and 2a, which do not depend on b.

From the structure of the graph, we can answer two questions:

What nodes should we update if an input changes?
In what order should we update them?

Answering these questions, while they may seem obvious to us, are two major challenges in writing a program to perform incremental computations.

Thankfully, some smart people have gone before us and offered two solutions to these questions: dirty marking and topological sorting, which I will describe next.

Dirty Marking

Dirty marking works as follows:

When a node's value or formula changes, the program marks all the nodes that depend on it (directly or indirectly) as dirty.

In the quadratic formula example, it will start with b and go down to the leaf nodes.
To bring the graph up to date, in a loop, the program finds a dirty node that has no dirty inputs (or dependencies) and recomputes its value — making it clean.
It continues the loop until there are no more dirty nodes.

Dirty marking answers our two questions:

What nodes should the program update if an input changes?

Only those it has marked as dirty.
In what order should the program update them?

It should compute dirty nodes with no dirty inputs before computing dirty nodes with dirty inputs, which ensures that it computes dependencies before their dependants.

Topological Sorting

With topological sorting, the program gives each node a height and uses this height and a minimum heap to answer our questions. It works as follows:

The program gives a node with no inputs a height of 0.
If a node has inputs, its height is max(height of inputs) + 1, which guarantees that a node will always have a greater height than its inputs' heights.
When a node's value or formula changes, the program adds the node to the minimum heap.
To bring the graph up to date, the program removes the node with the smallest height from the heap and recomputes it. If the node's value has changed after recomputing, it adds the node's dependants to the heap.
It continues with the previous step until the heap is empty.

Topological sorting answers our questions in the following ways:

What nodes should the program update if an input changes?

Only those it has added to the heap.
In what order should the program update them?

It should recompute nodes with a smaller height before those with a large height, ensuring that it recomputes a node's inputs before the node itself.

Demand-driven Incremental Computing

So far, I have described a form of incremental computing in which a program recomputes the values of all affected nodes in the graph to bring them up to date when an input changes.

Let's say we have a scenario where a node A is an input to many nodes, but we are only interested in the value of one of these nodes, B, at a time. In the model I have described, when node A changes, the program will recompute node B and any other nodes that depend on A, even though we don't care about them.

In a graph with complex formulas, performing these unnecessary computations could be costly.

Ideally, we would want to specify that a program should only perform computations that are necessary for node(s) we are interested in. We want computations to be demand-driven or lazy.

This requirement has led to the creation of libraries for demand-driven incremental computing, such as Adapton, which uses dirty marking and Incremental, which uses topological sorting as its algorithm.

They support lazy computations by allowing clients to observe nodes they're interested in. When an input changes and a client wants to bring observed nodes up to date, the programs only recompute nodes that affect the observed nodes.

The next section covers how you might build a program for demand-driven incremental computing.

Anchors

Anchors is a C++ library for demand-driven incremental computing, inspired by the Rust library of the same name—yes, I got permission before using the name.

The Rust version implements a hybrid algorithm combining dirty marking and topological sorting described here, but the C++ version currently only implements topological sorting. Implementing the hybrid algorithm is on the roadmap.

The rest of this section will cover the C++ implementation, which is based on Jane Street's Incremental library.

Two classes make up the core of Anchors: an Anchor class, which represents a node in the graph, and an Engine, which handles the logic. A simple example of their usage is this program below, which performs a simple addition:

// First create an Engine object
Engine d_engine; 

// Create two input Anchors A and B
auto anchorA(Anchors::create(2));
auto anchorB(Anchors::create(3));

// Create the function to map from the inputs to the output
auto sum = [](int a, int b) { return a + b; };
    
// Create an Anchor C, using A and B as inputs, 
// whose value is the sum of the input Anchors
auto anchorC(Anchors::map2<int>(anchorA, anchorB, sum));

// Observe Anchor C and verify that its value is correct
d_engine.observe(anchorC);
EXPECT_EQ(d_engine.get(anchorC), 5);

// Update one of the inputs, A, and verify that Anchor C is kept up to date.
d_engine.set(anchorA, 10);
EXPECT_EQ(d_engine.get(anchorC), 13);

The Anchor class

An Anchor represents a node in the computation graph. As shown in the example above, you can create an Anchor with a value:

auto anchorA = Anchors::create(2);

Or with a function that takes one or more input Anchor objects and a function to map from the inputs to an output value:

...
auto anchorC = Anchors::map2<int>(anchorA, anchorB, sum);
...

An Anchor's state includes:

it's current value.
height — 0 if it has no inputs, max(height of inputs) + 1 otherwise.
whether it is necessary. An Anchor is necessary if a client marks it as observed, or if it is a dependency (direct or indirect) of an observed Anchor.
a recompute id, which indicates when last the Anchor was brought up to date.
a change id, representing when the Anchor's value last changed. An Anchor can be brought up to date without its value changing.
its dependencies (inputs) and dependants, if any.
an updater function to compute the Anchor's value from its dependencies.

The class exposes getters and setters for these elements, as well as a compute(id) function to the Engine class. I'll describe how the Engine class derives the id it uses as the function argument in the next section.

When called, the compute(id) function invokes the updater function, passing the Anchor's dependencies as arguments to bring the Anchor's value up to date. It also sets the recompute id to id.

If the newly computed value differs from the old, the compute(id) function sets the change id of the Anchor to id.

The Engine class

The Engine is the brain of the Anchors library, through which clients interact with Anchor objects. Its state includes:

a recompute heap, containing the Anchor objects that need to be recomputed, ordered by height.
a recompute set, to prevent adding duplicates to the heap. An Anchor could potentially keep track of whether it's in the recompute heap, but that's not a solution I've explored yet.
a set of observed nodes.
a monotonically increasing stabilization number, which indicates when the Anchor objects were last brought up to date, or when an Anchor's value last changed. It passes this number as an argument to the compute(id) function.

Its API exposes functions to observe() an Anchor, as well as get() and set() an Anchor's value.

Observing an Anchor

The observe(anchor) function does the following:

Add the Anchor to the set of observed nodes and mark it as necessary.
If the Anchor is stale, add it to the recompute heap. An Anchor is stale if it has never been computed, or if its recompute id is less than the change id of one of its dependencies. That is, if the Engine has not recomputed the Anchor since any of its dependencies changed.
Walk the graph to mark the necessary Anchor nodes, including those that the Engine should recompute. For each dependency of the given Anchor:
- Add the Anchor as a dependant.
- Mark the dependency as necessary.
- If the dependency is stale and not already in the recompute heap, add it to the recompute heap.
- Repeat step 3 using this dependency as the new "given" Anchor until there are no more dependencies to compute.

Setting an Anchor's value

You can update an Anchor's value using set(anchor, newValue) on the Engine class, which does the following:

Update the Anchor's value using the setValue() function in the Anchor class. The Anchor class exposes this function only to the Engine class.

Increase the stabilization number and set the change id of the given Anchor to the new number if the Anchor's value has changed.

If the Anchor is necessary, add all its dependants to the recompute heap.

Reading an Anchor's value

The Engine class exposes a get(anchor) function to read an Anchors value. Anchors guarantees that reading an observed Anchor will return its most up-to-date value.

It achieves this through a process called stabilization, which involves the following:

Increase the current stabilization number.
Remove the Anchor with the smallest height from the recompute heap.
If the Anchor is stale, recompute its value by calling compute(id) on the Anchor, passing the new stabilization number as the argument.
If the value of the current Anchor changed after recomputing, i.e. its change id is equal to the stabilization number, add its dependants to the recompute heap.
Repeat steps 2-5 until the recompute heap is empty.

When you call get(anchor) on any observed Anchor, the Engine class will run the stabilization process provided the recompute heap is not empty.

To summarize:

Anchors minimizes wasteful computations by only adding necessary nodes to the recompute heap.
By computing nodes in increasing order of height, Anchors ensures that it brings a node's dependencies up to date before the node itself.

Closing Thoughts

The examples in this article are trivial compared to real-world use cases of incremental computing, where nodes typically have many more inputs and more complex formulas. An example of how one can use an incremental computing library to value a portfolio in finance is here.

Anchors is still a work in progress and I intend to bring its algorithm closer to the hybrid one in the Rust library it is based on, eventually.

The hardest part of building this so far (conveniently ignoring the time I spent deciphering some C++ error messages) has been getting started: figuring out the API I wanted for the library and outlining the implementation details. If you're thinking of building something similar, I hope this post makes it a little easier for you.

Finally, part of my motivation for writing this post was to share the code and get feedback. So, if you go through the code and have any suggestions you think I'll find interesting, please let me know either through the form below or by creating a pull request.

Learning C++ from Java - Pointers and References

2021-12-23T15:53:49-00:00

This is a continuation of the series on C++ topics I've found interesting. You can read the earlier parts here and here.

Before I started learning C++, I had read about Java being pass-by-value rather than pass-by-reference, but I found it easier to think I was passing objects around by reference.

This post begins by summarizing pointers and references in C++, before describing the different semantics for passing arguments to a function.

Table of Contents

Pointers
References
Passing arguments to a function
Further Reading

Pointers

A pointer in C++ is a variable that holds the address of another variable. We use the address-of(&) operator to get the address of a variable, and can store that address in a pointer.

For example, we can declare an integer and print its memory address, as shown below.

#include <iostream>

int main()
{
    int width = 40;
    std::cout << "The address of the width variable is " << &width << "\n";

    int *widthPtr = &width; // We use the asterisk to declare a pointer.
    std::cout << "The address of the width variable is "
              << widthPtr << "\n";
}

In the example, we print the address of the width variable using the address-of operator directly and the widthPtr pointer. This prints:

The address of the width variable is 0x7ffcaf841c3c
The address of the width variable is 0x7ffcaf841c3c

Now that we have declared a pointer, we may want to access the value stored at the address it holds. To do that, we use the indirection or dereference (*) operator:

#include <iostream>

int main()
{
    int width = 40;
    int *widthPtr = &width;

    std::cout << "The value of the width variable is " << width << "\n";
    std::cout << "The address of the width variable is: " << widthPtr << "\n";

    *widthPtr = 93;

    std::cout << "The value of the width variable is " << width << "\n";
    std::cout << "The value of the width variable accessed via the pointer is "
              << *widthPtr << "\n";
}

Which gives the output below:

The value of the width variable is 40
The address of the width variable is: 0x7ffc882a0f0c
The value of the width variable is 93
The value of the width variable accessed via the pointer is 93

It might confuse you to see the same asterisk * operator used to both declare and dereference a pointer. To clarify this, a rule of thumb is you are declaring a pointer when you use the operator after specifying a type—as it is with declaring a regular variable—and dereferencing one when there's no type before it.

References

A reference variable is an alias to an existing variable. You use the ampersand & symbol to declare a reference. For example, widthRef is an alias to width in the example below:

int main()
{
    int width = 40;
    int &widthRef = width;
}

Here, the ampersand & symbol means "reference to" and not "address of". A rule of thumb here is the operator always means "reference to" when on the left-hand side of the equals sign and means "address of" when on the right.

After creating a reference, you can use it to access the variable it is aliasing:

#include <iostream>

int main()
{
    int width = 40;
    int &widthRef = width;

    std::cout << "The value of the width variable is " << width << "\n";

    widthRef = 93; // Notice the missing ampersand

    std::cout << "The value of the width variable is " << width << "\n";
    std::cout << "The value of the width variable accessed via the reference is " 
    << widthRef << "\n";
}

Which prints:

The value of the width variable is 40
The value of the width variable is 93
The value of the width variable accessed via the reference is 93

There are some rules for using references which make them safer to use than pointers, such as that unlike pointers, they cannot hold a null value. Also, you must always initialize a reference when you declare it and, once initialized, you cannot reassign it.

Using an example from Learn C++:

int main()
{
    int value1{5};
    int value2{6};

    int &ref{value1}; // okay, ref is now an alias for value1
    ref = value2;     // assigns 6 (the value of value2) to value1 -- does NOT change the reference!
}

Passing arguments to a function

There are three ways of passing arguments to C++ functions:

Pass-by-Value
Pass-by-Reference
Pass-by-Address

To show these methods, I'll use a simple Book class, which takes a title in its constructor and has functions to access the book title:

#include <string>

class Book
{
public:
    Book(std::string title)
    {
        d_title = title;
    }

    std::string getTitle()
    {
        return d_title;
    }

    void setTitle(std::string newTitle)
    {
        d_title = newTitle;
    }

private:
    std::string d_title;
};

Pass-by-Value

When you pass an argument by value, the function simply takes a copy of the argument. This means if you modify the parameter in the function, it will leave the original argument unchanged. For example:

#include <string>
#include <iostream>

void passValue(Book b)
{
    b.setTitle("The Thing Around Your Neck");
}

int main()
{
    Book b1 = Book("Americanah");

    std::cout << "b1 has the title: " << b1.getTitle() << "\n";

    passValue(b1);

    std::cout << "After calling passValue(), b1 has the title: " << b1.getTitle() << "\n";
}

Here, passValue() takes a copy of b1 and modifies the copy, resulting in the output:

b1 has the title: Americanah
After calling passValue(), b1 has the title: Americanah

Use pass-by-value for simple data types like int and float, where the cost of copying the argument is low.

Pass-by-Reference

With pass-by-reference, you are using the parameter as an alias to the argument variable. This means you can change the value of the variable being referenced directly. For example:

#include <string>
#include <iostream>

void passRef(Book &bk)
{

    std::string newTitle = "Half of a Yellow Sun";
    bk = Book(newTitle);
}

int main()
{
    Book b1 = Book("Americanah");
    
    std::cout << "b1 has the title: " << b1.getTitle() << "\n";

    passRef(b1); // Notice how we pass b1 as normal

    std::cout << "After calling passRef(), b1 has the title: " << b1.getTitle() << "\n";
}

In passRef(), we are not reassigning bk to reference a new variable—recall that we cannot reassign references. Instead, we are changing the value of the variable it is currently referencing, b1 in our example.

This gives the output:

b1 has the title: Americanah
After calling passRef(), b1 has the title: Half of a Yellow Sun

Pass-by-Address

With this approach, you are passing the address of the argument rather than the argument value itself. You use a pointer as the function parameter to store the address, and can dereference the pointer to access the value at the address.

When you pass-by-address, the compiler actually passes the address by value, i.e., the function only gets a copy of the address. This means if you reassign the parameter in the function, you are only telling it to point to a new memory address.

#include <string>
#include <iostream>

void passAddress(Book *bk)
{
    bk = new Book("Half of a Yellow Sun"); // bk now points to a new address
}

int main()
{
    Book b1 = Book("Americanah");

    std::cout << "b1 has the title: " << b1.getTitle() << "\n";

    passAddress(&b1);

    std::cout << "After calling passAddress(), b1 has the title: " << b1.getTitle() << "\n";
}

Which produces:

b1 has the title: Americanah
After calling passAddress(), b1 has the title: Americanah

In the example, we are assigning a new address to bk in passAddress and leaving the value of b1 untouched.

If we want to access the members of b1 in passAddress(), we can use the dereference(*) operator or the shorthand arrow (->) operator:

void passAddressAlt(Book *bk)
{
    std::cout << "bk has the title: " << (*bk).getTitle() << "\n";
    std::cout << "bk has the title: " << bk->getTitle() << "\n"; 
}

An aside

When you create an object in C++ using the new keyword, you must explicitly deallocate it from the heap using the delete keyword to prevent a memory leak.

This means the passAddress() function above will cause a memory leak, since there is no way to access the new object created in the function after it returns, but I'm keeping things simple for demonstration.

Modern C++ introduced smart pointers, which can auto-delete objects not being referenced, but I'm leaving that out in this post.

Pass-by-Reference or Pass-by-Address

For complex data types where copying may be too expensive, the choice between pass-by-reference and pass-by-address is not straightforward.

Learn C++, for example, recommends sticking with pass-by-reference, but a more common recommendation I have found is to use pass-by-address when you are modifying the parameter in the function and pass by const reference otherwise.

By using the latter approach, you are making it more explicit that an argument may be modified.

Java is Pass-by-Value

In Java, you only pass arguments by value, but this has the same semantics as pass-by-address in C++.

I mentioned earlier that with pass-by-address, you are actually passing the argument's address by value, i.e. passing a copy of the address. This is the same in Java: you are passing a copy of a variable's address.

If you have the following method signature in your Java program:

void passAddress(Book b);

It is equivalent to the below in C++:

void passAddress(Book* b);

Which is why if you implement the Java method like below, it will leave the original argument unchanged.

void passAddress(Book b)
{
    b = new Book("Purple Hibiscus");
}

Learning C++ from Java - Header files

2021-12-04T18:45:32-00:00

This is a continuation of the series on C++ topics that I've found interesting, coming from a Java background. You can read the first post here.

I'll start this post by describing forward declarations in C++ before talking about header files.

Table of Contents

Forward declarations
- Working with multiple files
- Multiple declarations, single definition
Header files
- Header guards
- Header files and linkage
Further Reading

Say we write the program below to print the elements in a list:

#include <iostream>
#include <vector>

int main()
{
    std::vector<int> sizes{12, 23, 32, 39, 43, 40};
    print(sizes);
}

void print(std::vector<int> list)
{
    std::cout << "Your integer list contains the numbers: "
              << "\n";

    for (int listItem : list)
    {
        std::cout << listItem << "\n";
    }
}

This simple program will not compile. My compiler produces the error:

‘print’ was not declared in this scope; did you mean ‘printf’?

This is because the compiler, which parses code from the top down, needs to know about an identifier before it encounters the identifier's usage.

In the above example, the compiler encounters print()'s usage in main() before its declaration (and definition), and does not know what print() is yet.

There are two ways to fix this:

Reorder the function definitions so that print() comes before main().

Use a forward declaration, which I'll focus on.

Forward declarations

Quoting Learn C++:

A forward declaration allows us to tell the compiler about the existence of an identifier before actually defining the identifier.

We forward declare a function by specifying its prototype, which comprises the function's name, return type, and parameters. We can also forward declare variables and user-defined types.

Rewriting the previous example to use a forward declaration:

#include <iostream>
#include <vector>

void print(std::vector<int> list); // forward declaration of print()

int main()
{
    std::vector<int> sizes{12, 23, 32, 39, 43, 40};
    print(sizes);
}

void print(std::vector<int> list)
{
    std::cout << "Your integer list contains the numbers: "
              << "\n";

    for (int listItem : list)
    {
        std::cout << listItem << "\n";
    }
}

The compiler now knows about print() before its usage in the main() function, and all is perfect.

Working with multiple files

Forward declaration also applies when working with multiple files. We can split the previous example into two files:

print.cpp:

#include <iostream>
#include <vector>

void print(std::vector<int> list)
{
    std::cout << "Your integer list contains the numbers: "
              << "\n";

    for (int listItem : list)
    {
        std::cout << listItem << "\n";
    }
}

And main.cpp:

#include <iostream>
#include <vector>

int main()
{
    std::vector<int> sizes{12, 23, 32, 39, 43, 40};
    print(sizes);
}

Similarly, this program will not compile unless you forward declare print() in main.cpp as before:

#include <iostream>
#include <vector>

void print(std::vector<int> list);

int main()
{
    std::vector<int> sizes{12, 23, 32, 39, 43, 40};
    print(sizes);
}

Multiple declarations, single definition

You can declare an identifier in different files across your program, but it can only have one definition. This is the 'One Definition Rule' in C++.

If you declare an identifier in a file but don't define it anywhere in your program, the compiler will compile the file, but the linker will fail. For example, after removing the print() definition in print.cpp above and attempting to build the program, I got the error:

undefined reference to `print(std::vector<int...>)'

Header files

Extending the previous example, let's say we don't just want to print() an integer list, but we also want to assign it a score based on how many even numbers it contains.

We can rename print.cpp to container.cpp with the following content:

#include <iostream>
#include <vector>

double evenScore(std::vector<int> list)
{
    int evenCount{};
    for (int listItem : list)
    {
        if (listItem % 2 == 0)
        {
            evenCount++;
        }
    }

    return (double) evenCount / list.size() * 100;
}

void print(std::vector<int> list)
{
    std::cout << "Your integer list contains the numbers: "
              << "\n";

    for (int listItem : list)
    {
        std::cout << listItem << "\n";
    }
}

And then forward declare both functions in main.cpp:

#include <iostream>
#include <vector>

void print(std::vector<int> list);
double evenScore(std::vector<int> list);

int main()
{
    std::vector<int> sizes{12, 23, 32, 39, 43, 40};
    print(sizes);

    std::cout << "The evenScore of the container with sizes is: " << evenScore(sizes) << "%\n";
}

This works fine, but imagine how boring it would be to forward declare all the functions we need if container.cpp had ten functions.

Thankfully, C++ has header files which simplify this process. Header files have a .h or .hpp extension and you can declare all identifiers in a header file.

We can extend our running example by splitting container.cpp into a header and source file:

container.h:

#include <iostream>
#include <vector>

double evenScore(std::vector<int> list);

void print(std::vector<int> list);

We use container.h by #include-ing it in the source files:

container.cpp:

#include <iostream>
#include "container.h"

double evenScore(std::vector<int> list)
{
    int evenCount{};
    for (int listItem : list)
    {
        if (listItem % 2 == 0)
        {
            evenCount++;
        }
    }

    return (double)evenCount / list.size() * 100;
}

void print(std::vector<int> list)
{
    std::cout << "Your list has the following content: "
              << "\n";

    for (int listItem : list)
    {
        std::cout << listItem << "\n";
    }
}

main.cpp:

#include <iostream>
#include "container.h"

int main()
{
    std::vector<int> sizes{12, 23, 32, 39, 43, 40};
    print(sizes);

    std::cout << "The evenScore of the container with sizes is: " << evenScore(sizes) << "%\n";
}

#include is a preprocessor directive that tells the preprocessor to paste the content of another file into the current file. In our example, main.cpp and container.cpp will contain the declarations in container.h after the preprocessor runs.

Note that I didn't need to include container.h in container.cpp, but I have seen it recommended that you include a header file in its matching source file and you can find an excellent demonstration of why here.

Header guards

Say just for fun, we are writing a program to compare Tidal and Spotify with the following rubric:

Assign a score to each platform based on the average length of the top K most played songs on the site, where K is user-provided and must be at least 2.

We can introduce a new function in container.h to calculate the average value and define a global variable for the minimum value of K.

Container

container.h:

#include <iostream>
#include <vector>

const int minimumSize = 2; 

double average(std::vector<double> list);

void print(std::vector<double> list);

container.cpp

#include <iostream>
#include "container.h"

double average(std::vector<double> list)
{
    double total = 0;

    for (double listItem : list)
    {
        total += listItem;
    }

    return total / list.size();
}

void print(std::vector<double> list)
{
    std::cout << "Your list has the following content: "
              << "\n";

    for (double listItem : list)
    {
        std::cout << listItem << "\n";
    }
}

Adding new platform-specific header and source files:

Spotify

spotify.h:

#include "container.h"

namespace Spotify
{
    double calculateTopKScore(int k);
}

spotify.cpp:

#include "spotify.h"

namespace Spotify
{
    double calculateTopKScore(int k)
    {
        if (k < minimumSize)
        {
            return -1.0;
        }

        //Assuming that k = 4 and after getting the data from Spotify, we have this list:
        std::vector<double> songLengthsInSeconds{2.49, 5.53, 4.46, 3.35};

        return average(songLengthsInSeconds);
    }
}

Tidal

tidal.h:

#include "container.h"

namespace Tidal
{
    double calculateTopKScore(int k);
}

tidal.cpp:

#include "tidal.h"

namespace Tidal
{
    double calculateTopKScore(int k)
    {
        if (k < minimumSize)
        {
            return -1.0;
        }

        //Assuming that k = 4 and after getting the data from Tidal, we have this list:
        std::vector<double> songLengthsInSeconds{3.06, 4.17, 6.44, 3.07};

        return average(songLengthsInSeconds);
    }
}

With these in place, we can write our main.cpp as:

#include <iostream>
#include "spotify.h"
#include "tidal.h"

int main()
{
    int k = 4;
    double spotifyScore = Spotify::calculateTopKScore(k);
    double tidalScore = Tidal::calculateTopKScore(k);

    std::cout << "The top " << k << " Spotify songs have an average length of " << spotifyScore
              << " minutes, while Tidal songs have an average length of " << tidalScore << " minutes. \n";
}

Can you tell why this program will not compile?

.
.
.

I get the error below which says minimumSize is redefined.

In file included from tidal.h:1,
                 from main.cpp:3:
container.h:4:11: error: redefinition of ‘const int minimumSize’
    4 | const int minimumSize = 2; 
      |           ^~~~~~~~~~~
In file included from spotify.h:1,
                 from main.cpp:2:
container.h:4:11: note: ‘const int minimumSize’ previously defined here
    4 | const int minimumSize = 2;
      |           ^~~~~~~~~~~

This is because when the preprocessor runs, it #includes the content of the spotify.h and tidal.h into main.cpp, and both header files #include container.h.

Since container.h contains a definition of minimumSize, main.cpp will have two definitions of minimumSize after the preprocessor completes, which violates the One Definition Rule.

But if you rewrite container.h so that it contains the below, the program will compile fine.

container.h:

#ifndef CONTAINER_H // Note that CONTAINER__H can be replaced with any unique name.

#define CONTAINER_H

#include <iostream>
#include <vector>

const int minimumSize = 2;

double average(std::vector<double> list);

void print(std::vector<double> list);

#endif

The difference is container.h now contains a header guard, which tells the preprocessor to first check if CONTAINER_H is not defined in the current translation unit (#ifndef), and only then should it define it and include the content of the header file.

In main.cpp, when the preprocessor is processing container.h included in tidal.h, it will see that it has already defined CONTAINER_H in the translation unit from when it processed spotify.h, and will not (re)include container.h's content.

This simple example involves a variable, but header files may contain function definitions too and header guards help prevent multiple definitions of the functions.

In summary, header guards prevent a translation unit from having multiple definitions of an identifier.

Header files and linkage

I wrote about linkage in the previous post and will give another example here.

We can introduce a linking error in our music streaming application by adding the extern keyword before the variable in container.h:

#ifndef CONTAINER_H 

#define CONTAINER_H

#include <iostream>
#include <vector>

extern const int minimumSize = 2;

double average(std::vector<double> list);

void print(std::vector<double> list);

#endif

I get an error saying multiple definition of 'minimumSize'. This is because the minimumSize variable initially has internal linkage since it's a const global variable, meaning each translation unit will get its copy of the variable.

But by making it extern and giving it external linkage, we're telling the linker that each usage of the variable refers to the same instance, and since the variable definition will appear in spotify.cpp and tidal.cpp, we are violating the One Definition Rule.

We can maintain the external linkage property and fix the error by replacing the definition in the header file with a declaration, and defining the variable in only one of the source files we include the header in, as shown below:

container.h:

#ifndef CONTAINER_H

#define CONTAINER_H

#include <iostream>
#include <vector>

extern int minimumSize;

double average(std::vector<double> list);

void print(std::vector<double> list);

#endif

tidal.cpp:

#include "tidal.h"

int minimumSize = 2;

namespace Tidal
{
    double calculateTopKScore(int k)
    {
        if (k < minimumSize)
        {
            return -1.0;
        }

        std::vector<double> songLengthsInSeconds{3.06, 4.17, 6.44, 3.07};

        return average(songLengthsInSeconds);
    }
}

There is now only one definition of minimumSize in the program and because it is extern, any usage of the variable will refer to the value we defined.

Learning C++ from Java - Building, Namespaces, Linkage, and more

2021-11-13T19:35:49-00:00

I recently had to learn C++ for work and have done most of that learning so far through Learn C++. This series of posts will highlight what I have found interesting about C++, especially given my Java background.

Note that these posts are not meant to be tutorials on writing C++; I recommend visiting Learn C++ if you want a thorough C++ tutorial.

Table of Contents

Building a C++ Program
Initializing Variables
Namespaces
- Namespace Aliases
Linkage
- Linkage is not scope
Storage Duration
The 'static' keyword
Conclusion
Further Reading

Building a C++ Program

A C++ program comprises one or more source files and header files, and the steps to create an executable from source code are preprocessing, compilation, and linking.

Preprocessing

This involves handling all preprocessor directives. Preprocessor directives are instructions for a preprocessor that tell it to perform text manipulation tasks. An example is the #include directive, which tells the preprocessor to include the contents of a header file into the current file.

The output of the preprocessor is one or more translation units. A translation unit is a C++ source file after the preprocessor has included the contents of its header file.

Compilation

This is where the compiler converts each translation unit into an object file containing machine code. These object files cannot be run yet but can be stored for reuse later on.

Linking

This is the final stage, where a linker combines all the object files to form an executable program. This stage involves linking any needed library code and resolving all cross-file dependencies.

C++ compilers typically come bundled with separate programs to perform all three steps.

Initializing Variables

There are different ways to initialize a C++ variable and the difference between these methods is more significant when initializing complex types; they have a similar effect when initializing primitive types.

This section will focus primarily on initializing primitive types, with a discussion on complex types to come in a later post.

Copy Initialization

Copy initialization takes place when you initialize a variable using an equals sign.

double width  = 10.23;
int height = 4;

Here, the program copies the value on the right-hand side of the equals into the address of the variable on the left.

Copying can be an expensive operation for large objects, but it is efficient for primitive types.

Direct Initialization

Direct initialization is initialization using non-empty parentheses.

double width(10.23);
int height(4);

Direct initialization and copy initialization do the same thing for primitive types, but have different behaviours for class objects, which you can learn more about here.

List Initialization

List initialization (also called uniform initialization or brace initialization) can occur in both direct initialization and copy initialization contexts as direct-list-initialization and copy-list-initialization.

double width{10.23}; // Direct-list-initialization.
int height = {4}; // Copy-list-initialization.

One difference between using list or brace initialization and the other forms of initialization for primitive types is that list initialization is stricter. Quoting Learn C++:

Brace initialization has the added benefit of disallowing “narrowing” conversions. This means that if you try to use brace initialization to initialize a variable with a value it can not safely hold, the compiler will throw a warning or an error.

So if you attempt to initialize a variable like int speed{43.1}, list initialization will throw an error, while copy and direct initialization will simply drop the fractional part.

Unlike direct and copy initialization, you can also use list initialization to populate containers (collections in Java) with elements during initialization like:

std::vector<double> widths = {12.3, 10.2, 3.4, 4.7};
std::vector<int> heights{2, 3, 4, 5};

Value Initialization and Zero Initialization

Value initialization takes place when you initialize a variable with empty braces. For primitive types, this will initialize the variable with a zero value equivalent for the type.

double width{}; // The value is 0.0.
int height{}; // The value is 0.

Default Initialization

Default initialization takes place when you declare a variable with no initializer.

double width; // The value is undefined.
int height;

The C++ standard does not define what the values of width and height in the above example should be; it leaves it up to the compiler implementation. In many compilers, the values will be whatever garbage is in the memory address allocated for the variables.

Namespaces

Namespaces in C++ help to prevent naming collisions by providing scope to identifiers. They are similar to Java packages in that they both prevent naming collisions, but you can nest namespaces in C++ and can declare multiple namespaces in a single C++ file, among other differences.

For example, if you define a namespace like

namespace Math {
  int add(int x, int y){
      return x+y;
    }
}

You can call the add function by prefixing it with its namespace Math::add(1,2), and the compiler won't confuse it with an add function defined in a separate namespace.

You can also declare functions and other identifiers in C++ outside an explicit namespace. Such identifiers become part of the global namespace of your program. Using the above example, if you define an add function in the global namespace, you can refer to it as ::add(3,4) or just add(3,4) if there is no conflicting add() function defined.

C++ also lets you declare the same namespace in multiple files, provided there's only one definition of each identifier in the namespace. The compiler will group the declarations together in the linking phase.

Namespace Aliases

I mentioned earlier that you can nest namespaces in C++, so it's not uncommon to see code like:

School::Subject::Maths::Numbers::Arithmetic::add(12,34);

Okay, I exaggerate and I hope it is uncommon, but my point is namespaces can be deeply nested and C++ provides a convenient way to refer to nested namespaces through namespace aliases.

You can define a namespace alias for the above example like:

namespace MathsArithmetic = School::Subject::Maths::Numbers::Arithmetic;

And invoke the add function using:

std::cout << MathsArithmetic::add(100, 345) << "\n";

In this example, cout is a function that belongs to the std namespace.

Linkage

Linkage is a property of an identifier that specifies whether it is visible outside its translation unit.

If an identifier has internal linkage, it is only visible to the linker in its translation unit. const global variables have internal linkage by default.

Quoting Learn C++:

An identifier’s linkage determines whether other declarations of that name refer to the same object or not...This means that if two files have identically named identifiers with internal linkage, those identifiers will be treated as independent.

For example, if I have two source files, fruit.cpp and vegetable.cpp containing the following definitions:

fruit.cpp:

#include <string>
#include <iostream>

const std::string favouriteFruit = "carrot";
int main (){
    std::cout << favouriteFruit << "\n";
}

vegetable.cpp:

#include <string>
#include <iostream>

const std::string favouriteFruit = "carrot";
void printFruit()
{
    std::cout << favouriteFruit << "\n";
}

Because const global variables have internal linkage, there will be no issues here during the linking phase. Each translation unit will get its own copy of favouriteFruit, avoiding any conflicts.

To make an identifier have external linkage, use the extern keyword.

An identifier with external linkage is visible from any translation unit in the program. All functions and non-const global variables implicitly have external linkage.

Using the previous example,

fruit.cpp:

#include <string>
#include <iostream>

std::string favouriteFruit = "carrot";
int main (){
    std::cout << favouriteFruit << "\n";
}

vegetable.cpp:

#include <string>
#include <iostream>

std::string favouriteFruit = "carrot";
void printFruit()
{
    std::cout << favouriteFruit << "\n";
}

Here, both definitions of favouriteFruit have external linkage and will cause a linking error because of the multiple definitions available to the linker, in violation of the one definition rule. Use the static keyword to make an identifier with external linkage have internal linkage.

Local variables have no linkage, which means all declarations of local variables with the same name refer to different objects.

The next post on header files will include a broader discussion on linkage with more usage examples, and I recommend reading this article if you want to learn more before then.

Linkage is not scope

While scope and linkage may seem similar, they mean different things. An identifier's scope defines where it is visible within its translation unit, while linkage determines whether declarations of the same name in other translation units refer to the same identifier or not.

An article from IBM also distinguishes them like:

Scope and linkage are distinguishable in that scope is for the benefit of the compiler, whereas linkage is for the benefit of the linker. During the translation of a source file to object code, the compiler keeps track of the identifiers that have external linkage and eventually stores them in a table within the object file. The linker is thereby able to determine which names have external linkage, but is unaware of those with internal or no linkage.

Storage Duration

All variables in a C++ program have a storage duration, which determines the rules for when the program creates and destroys them. Two forms of this duration are automatic and static duration.

When a variable has automatic duration, it means the program allocates its storage at the point of the variable's definition and deallocates it when the program exits the enclosing code block. Local variables have automatic duration by default.

For a variable with static duration, its storage is allocated when the program starts and deallocated when the program ends. Only one instance of a variable with static duration exists. Global variables have static duration.

You can also use the static keyword to change the duration of a local variable to static duration. When you make a local variable static, the program only initializes the variable the first time it encounters the initialization. Subsequent calls to its enclosing function will reuse the already initialized instance.

Variables can also have dynamic duration, which means the program creates and destroys them on programmer request, as in dynamically allocated variables created with the new keyword.

The 'static' keyword

So far, I've mentioned the static keyword as both a way to denote internal linkage and static storage duration. This might be confusing, so note that the static keyword in C++ can appear in three contexts:

When used in declaring a variable, the static keyword acts as a storage class specifier and gives the variable both static duration and internal linkage.
When used with a function that's not a member of a class, the static keyword gives it internal linkage.
For class member variables and functions, as in Java, making them static means we can use them without creating an instance of the class.

Conclusion

I went into this thinking C++ would be mostly like Java except with pointers and references, and I have been only been proven wrong so far. Though they bear some similarities, especially in their syntax, I've been surprised by how different they are and I'm looking forward to exploring that further in subsequent posts.

MIT 6.824: Lecture 20 - Blockstack

2020-12-23T12:15:46-00:00

The final post in this lecture series is about Blockstack. Blockstack is a network for building decentralized applications based on blockchain. I find the idea of decentralized applications appealing because of its promise to give users more ownership and control of their data.

Blockstack is also interesting as it's a non-cryptocurrency use of blockchain, which I covered in the previous post. I'll start this post with an overview of how a decentralized application might work, before describing Blockstack's approach.

Table of Contents

Decentralization
Blockstack
Conclusion
Further Reading

Decentralization

Some of the most popular applications today like Gmail, Facebook, and Twitter are run by companies which own and manage their users' data and expose an interface for users to access their data. These apps are centralized in that the companies that run them store and manage all the user data.

While this model has been very successful for both the companies and users, it has come with its downsides:

Companies can use their users' data for nefarious reasons.

Employees of these companies can snoop on private user data.

Most users have to go through the application's UI to access their data, and they can only do what the UI supports with their data.

For these reasons and more, there has been a trend towards building decentralized applications which move ownership and agency of data back into users' hands.

A decentralized architecture

In centralized applications, there is a tight coupling between the application code and the way they store data. For example, the Twitter app knows how to interact with Twitter's databases. But in a decentralized app, we can separate the app code from user data.

In this architecture, we can have a storage service that is independent of the applications that interact with it. This service will store data on a per-user basis instead of a per-app basis, and each user on the network can own and control their data.

Figure 1: A decentralized architecture.

In designing this architecture, the storage service must meet the following requirements:

General-purpose. Similar to the file system on your computer, it must have an API that allows multiple applications to interact with it.
Cloud-based, so users can access their data from anywhere.
Fully controllable by a user. It must support mechanisms for securing the data and controlling who can access it.
Supports sharing between users for apps where one user might read another user's information.

This architecture will provide the following benefits to users:

Better ownership and control over their data and how it's used.

Better security and data privacy, assuming app owners implement end-to-end encryption.

Improved ability for users to switch between similar applications, since data storage is independent of the applications.

A decentralized application

With this architecture, using an application like Facebook will involve:

Running the Facebook app, which will have no associated servers, on your computer.

The Facebook app reading from and writing to your data store.

The Facebook application reading from your friends' data stores to display their information on your feed.

The key thing here is that the application running on your computer contains all the logic it needs to interact directly with a general-purpose storage service, with no servers involved.

This is how many downloaded applications on your computer work, in that they interact with your local file system without needing to talk to a server.

Decentralization can be painful

This decentralized architecture comes with its limitations for both users and developers. It can be significantly more challenging for developers to build decentralized apps, especially since a per-user general-purpose storage service is less flexible than a dedicated database.

Users may not want to manage the security of their data, especially when it involves more complex security mechanisms. There's also the social challenge of convincing users to even consider using decentralized apps, especially since the current centralized architecture works so well.

Blockstack

Blockstack is an open-source approach to building a decentralized internet, using blockchain as the underlying infrastructure. Similar to the architecture discussed so far, each user on the Blockstack network has their private data store, and applications running on the network interact with these data stores.

According to the 2017 paper which this lecture is based on, the authors built Blockstack with three design goals:

Decentralized Naming & Discovery: End-users should be able to (a) register and use human-readable names and (b) discover network resources mapped to human-readable names without trusting any remote parties.

Decentralized Storage: End-users should be able to use decentralized storage systems where they can store their data without revealing it to any remote parties.

Comparable Performance: The end-to-end performance of the new architecture (including name/resource lookups, storage access, etc.) should be comparable to the traditional internet with centralized services.

To achieve these goals, Blockstack's architecture comprises the different components shown below.

Figure 2: Overview of the Blockstack architecture.

The paper's authors further divide these components into layers. I'll describe these layers and how they all fit together soon.

Naming

Names are important in Blockstack for identifying users, applications, and domains. A Blockstack name can map to a user's public key, the location of their data store, an IP address, etc.

When designing Blockstack, there were three properties that the authors desired for a name:

Human-readable: Instead of using a hash to identify users, Blockstack uses human-readable names to provide a good user experience.

Globally unique: There should be only one owner of the name in the network.

Decentralized allocation: There should be no central service in charge of allocating names.

According to the paper, before the invention of blockchains, it was only possible to get two of these three properties at a time. This is a computer science limitation called Zooko's triangle. For example:

Email addresses are unique and human-readable but not decentralized as the company you register with controls the namespace.

Public keys are unique and decentralized as users can generate them without a central service, but they are not human readable.

The names on your contact list are decentralized and human-readable, but not unique.

The next section will cover how a blockchain helps achieve these properties.

The Blockchain

Blockstack uses a blockchain layer to get all the three desired properties of names. This layer comprises two components: the Bitcoin blockchain and a virtualchain, which work together to form the Blockchain Naming System (BNS). The BNS replaces DNS in the network, except that there are no central root servers involved.

Claiming a name in BNS requires Bitcoin transactions, which Blockstack embeds with information about the name. Since the Bitcoin blockchain produces an ordered chain of blocks, we can determine who claimed a name first and ensure that names are unique. Using the blockchain to claim names also means that allocation is decentralized by design.

For example, if you're claiming a new Blockstack name, the associated Bitcoin transaction will contain the following:

Your desired name.
Your public key.
The hash of a BNS zone file. Blockstack creates a zone file for each name in the network and this file contains the routing information for that name, i.e., what resource the name points to.

Figure 3: The Blockchain layer.

The virtualchain component sits on the Bitcoin blockchain and parses the transaction records to create name records from the information on the Blockchain. It then stores these name records in a name database, which each peer on the network has a copy of. With this, users can look up the information for a name without having to search the underlying Bitcoin blockchain.

The Peer Network

Blockstack uses a peer network called Atlas for users to discover routing information on the network. Atlas enables Blockstack to separate the task of discovering where data is stored from the actual storage of data, allowing for multiple storage providers to coexist.

Each peer stores a table of zone records. A zone record contains a zone file and its hash. For each name record on the virtual chain, there is a corresponding zone record in the local database.

Figure 4: Peer Network.

Any new peers joining the network will communicate with existing ones to get up-to-date information about the zone records.

Storage

The final layer is the storage layer called Gaia, which enables users to interact with existing cloud storage providers like Dropbox, Google Drive, and Amazon S3. When a user creates an identity on the blockchain, Blockstack associates that identity with a corresponding data store in Gaia.

Figure 5: Storage

Gaia stores data as a key-value store and users can choose what storage providers they want to use. It provides a uniform API for applications to access user data regardless of the storage provider they use.

Before a user writes to their store in Gaia, they encrypt and sign the data with their cryptographic keys. Thus, even though data is stored with existing cloud storage providers, they have no visibility into the data.

Putting them all together

Quoting the paper, if a Blockstack application wants to look up data for a name, it works as follows:

Lookup the name in the virtualchain to get the (name, hash) pair.

Lookup the hash(name) in the Atlas network to get the respective zone file (all peers in the Atlas network have the full replica of all zone files).

Get the storage backend URI from the zone file and lookup the URI to connect to the storage backend.

Read the data (decrypt it if needed and if you have the access rights) and verify the respective signature or hash

Conclusion

Blockstack is a system in use today, and you can learn about some decentralized apps that have been built here. I've also left out some details about Blockstack in this post, and I recommend reading the sites linked in the next section to learn more about its implementation. This post is mainly an exploration of how the internet could be different and perhaps better, using Blockstack as an example.

But blockchain-based apps are not the only approach to building decentralized applications. CRDTs are another approach being actively explored today, and you can find a great overview here.

Overall, I'm intrigued by the decentralization vision for building applications and while it's still some way off being fully realized, the ongoing research is promising.

MIT 6.824: Lecture 19 - Bitcoin

2020-12-11T20:55:59-00:00

Following the lecture on Certificate Transparency, we are exploring Bitcoin, another open system comprising mutually untrustworthy components.

Bitcoin is a digital currency for making online payments. I'll start this post by making a case for digital currencies, before describing Bitcoin and how it solves the double-spending problem.

Table of Contents

Digital Currencies
Bitcoin
- Limitations of the model so far
  - An attacker can steal a private key
  - The current owner can double-spend
- Addressing double-spend
  - Publishing to a log
The Bitcoin blockchain
Conclusion
Resources

Digital Currencies

Majority of e-commerce activities today rely on trusted third parties like banks and other financial institutions to process payments. These third parties offer some protection from fraud at the cost of increased transaction fees.

But with digital currencies like Bitcoin, two people can make payments directly without a trusted third party involved. These payments rely on other methods to prevent fraud, such as cryptographic proofs. Digital currencies offer several advantages, including:

Lower transaction fees from not needing a third party involved.
Fewer risks for merchants: Transactions are irreversible, which protects merchants from losses caused by fraud or fraudulent chargebacks.
Payment freedom: You can make and receive payments from anywhere in the world.

Digital currencies, however, come with their technical and social challenges. One major technical challenge is in solving the double-spending problem, i.e., how does one prevent a digital coin from being spent more than once?

With traditional currencies, we can rely on banks to prevent double-spending, but different digital currencies have their unique ways of tackling this problem.

This post will largely focus on Bitcoin's approach.

Bitcoin

Bitcoin is a decentralized digital currency for making payments. Decentralized here means there is no single authority or entity like a central bank on the Bitcoin network. The network is run by many peers, which are computers that collaborate to make any necessary decisions. Anyone can add a peer to the network.

The Bitcoin network comprises coins, each owned by someone. A coin is a chain of transaction records, where each record represents each time an owner transferred the coin to a new owner as payment. The latest transaction record in the chain shows the coin's current owner.

Each coin owner has a public/private key pair which the network uses to verify the integrity of transactions. I'll go over how that works soon, but you can read the previous post on Certificate Transparency or this article on public-key cryptography for more detail.

When the current owner of a coin wants to transfer the coin to a new owner, they create a transaction record which contains:

The public key of the coin's new owner.
A hash of the previous transaction record in the chain.
A signature of the above hash signed with the current owner's private key.

This information allows the new owner to verify that it received the coin from the right owner. To illustrate this, if a user Y owns a coin that they received from user X, the latest transaction record in the coin will look like:

Figure 1: Let the latest transaction be T₁, the record will then contain the hash for T₀.

If user Y then transfers the same coin to another user Z in a new transaction, T₂, the coin will have a chain which now includes a new transaction record:

Figure 2: User Y transferring a coin to user Z

User Z will verify that the coin actually belongs to user Y by checking if the public key in T₁ matches the private key signature in T₂.

Limitations of the model so far

An attacker can steal a private key

As shown in Figure 2, the current owner of a coin uses their private key to sign the next transaction. If an attacker steals the private key from the owner's computer, they can spend the coin. This is a real possibility and is a hard problem to solve.

The current owner can double-spend

In the previous example, there's nothing stopping user Y from spending the same coin for both user Z and another user Q. User Y can create two transactions with the same coin using the hash of the previous record.

Trusted third parties like banks shine here since they can protect a payee from a double-spending payer. But if Bitcoin is to operate without a third party, a payee needs to know that the previous owner of a coin did not sign any earlier transactions.

Addressing double-spend

Similar to Certificate Transparency, we could introduce a log which all peers must publish transactions to, ensuring that all the transactions in the network are visible to all peers. This log must have the same requirements as a certificate log:

Everyone must see the same log in the same order.

No one should be able to un-publish a transaction.

If all the transactions are visible to all peers and no peer can delete a transaction, a peer will detect when a coin has been previously spent on an earlier transaction.

Using the above double-spending example where a user Y attempts to spend the same coin for users Z and Q, these requirements will ensure that if Y->Z happened before Y->Q, then:

User Z will see Y->Z came before Y->Q and will accept Y->Z.
User Q will see Y->Z came before Y->Q and will reject Y->Q.

Publishing to a log

One challenge with this log approach is in determining what gets published to the log and in what order. For example, we could have a central log server or a leader that decides the order of transactions like Raft, but this goes against Bitcoin's decentralization goal.

Another way to manage the log is to send new transactions to all peers and have them vote on which transaction to append to the log, with the majority winning the vote. To determine the majority, we could count one vote per IP address, but an attacker can forge IP addresses to claim a majority and vote multiple times. The impact of this will be that:

When user Z asks, the attacker's majority says, "Y->Z is in the log before Y->Q"
When user Q asks, the attacker's majority says, "Y->Q is in the log before Y->Z"

Bitcoin addresses the double-spending problem by using a blockchain, and I'll describe that next.

The Bitcoin blockchain

The Bitcoin blockchain is a sequence of blocks that acts as a public ledger containing all the transactions on every coin in the network. Each block is identified with a hash and contains:

The hash of the previous block in the chain.
A set of transactions.
A nonce, which I'll explain soon.

Figure 3: Blockchain representation.

Each peer in the network has a complete copy of the chain. When a peer wants to add a new block to the chain, it broadcasts the block to all the peers. Any new transactions also get flooded to all the peers. All the blocks and transactions on the Bitcoin network are publicly visible here.

When you make a payment, the payee won't accept it until the transaction is in the blockchain. And since all transactions are in the blockchain, the payee will find out if the coin has been spent before.

The challenge posed in the previous section was on determining what gets added to the log. The next section will cover Bitcoin's approach.

Adding a new block

When a peer receives new transactions, it collects them into a block. Before it can add the block to the blockchain, it needs to do actual CPU work. This is called a proof-of-work.

To explain this, assume we have a peer S with an unpublished block containing a list of transactions and the previous block's hash. S needs to create a hash to identify the unpublished block using its contents, and that involves solving a hard computational puzzle. The puzzle is this:

Given that peer S has a list of transactions(let's call this l) and the previous block's hash (hp), S must find a value x such that when it applies a hash function to the combination of l, hp, and x, it gets an output that begins with a long run of zeros.

Stated another way, S must find a value x such that:

 hash(l + hp + x) = 000000000000000...

This value of x is known as the nonce and the difficult process of finding this nonce is called mining.

The exact number of leading zeros required in the output varies with the speed with which peers generate new blocks. If the peers are generating new blocks too quickly, the network will make the proof-of-work more difficult by increasing the number of leading zeros required.

Finding this proof-of-work for a block is a costly operation as it requires major CPU power, and so this system essentially limits peers to one vote per CPU.

It takes 10 minutes on average for a peer to mine a new block, which means that the parties involved in a transaction have to wait for about 10 minutes before it appears on the blockchain.

A peer broadcasts a block to all peers after finding its proof of work.

Validation Checks

When a peer receives a new block, it validates the block by checking that:

The block's hash has the required number of leading zeros.
The previous block's hash exists in the chain. If the previous block doesn't exist, the peer will request it from the network.
All the transactions in the block are valid.

The peer validates each transaction by checking that:

No other transaction has spent the same previous transaction (Recall that a transaction record contains a hash of the previous transaction).
The transaction's signature is by the private key of the public key in the previous transaction, as described earlier.

If the block is valid, the peer shows its acceptance by working on creating the next block in the chain—using the accepted block's hash as the previous hash.

Temporary double-spending is possible

Scenario #1

One possible scenario in the Bitcoin network is that two peers A and B mining the next block in the chain find the nonce at the same time and broadcast the block to all peers, but because of network issues, some peers accept the block from peer A before B's block, while others accept B's block first.

This causes a fork on the blockchain where it now has two branches.

Scenario #2

Another possible scenario is that a peer sends one transaction to a subset of peers and another one to a different subset using the same coin.

For example, a peer Y could tell some peers about a transaction Y->Z and others about Y->Q, which both use the same coin. This will create a fork as illustrated below.

Figure 4: There is a fork caused by B6 having two successors.

Peers will abandon the shorter branch

When a peer receives two different successors to the same block, it will start working on the first one it received but save the other branch in case it gets longer. A branch will get longer when a peer finds the next proof-of-work.

If a peer is working on a branch and sees that another branch has gotten longer, it will abandon its current branch and switch to the longer one. This will also cause any transactions on the shorter branch to get abandoned.

During this period where different peers are seeing different branches of the chain, an attacker will be able to double-spend a coin. But what makes Bitcoin work is that the shorter branch will eventually get abandoned and only one transaction will remain on the blockchain.

This possibility of a fork is why careful Bitcoin clients wait until there are a few successor blocks (typically six) to the one that contains their transaction before believing it was successful. If a block has many successor blocks, it is unlikely that a dubious fork will overtake it.

FAQ

Where are new coins from?

When a peer mines a block and the other peers accept it, the peer gets a 6.25-bitcoin reward. This is an incentive for people to operate bitcoin peers.

Can an attacker change an existing block in the middle of the blockchain?

An attacker may want to do this so they can delete a spend of their coin from the blockchain and spend it again.

The Bitcoin network prevents this by the fact that the block's hash will change if you delete a transaction, and so the previous hash in the next block will be different, which the peers will detect.

Can an attacker start a fork from an old block?

If an attacker wants to double-spend a coin, they may start a fork from the block that precedes the one with the first spend of the coin and mine new blocks until the forked branch is the longest branch of the chain.

For this to be successful, the attacker must have enough CPU power to come from behind and mine blocks faster than all the honest peers.

If the attacker can create the longest branch, everyone will switch to it and so the attacker can double-spend a coin. But if an attacker has that much CPU power to mine blocks faster than all the honest peers, they might as well use it to generate new coins instead of reusing an old one.

Conclusion

There's a lot more to learn about Bitcoin and I've offered a one-sided view so far, but my goal here is to present an idea of how it works. Some downsides of using Bitcoin are that the proof-of-work takes too much power and the 10-minute confirmation wait is too long, among other points.

In summary, having an auditable public ledger is a great idea, especially in building open systems. Extra points if there are incentives in place to keep people honest.

Resources

Bitcoin: A Peer-to-Peer Electronic Cash System - Original Bitcoin paper.
How the Bitcoin protocol actually works by Michael Nielsen.
Blockchain Explorer - View the latest blocks and transactions in the network.
Proof of work - Further explanation of the Proof of work and what makes it difficult.
Lecture 19: Bitcoin - MIT 6.824 lecture notes.
Bitcoin FAQ - Additional material from 6.824.

Last updated on December 12, 2020.

MIT 6.824: Lecture 18 - Certificate Transparency

2020-12-02T00:05:30-00:00

This lecture is about building systems out of mutually untrustworthy components—using the Web as a case study. The systems we have seen so far are closed systems for which we have assumed that all the participants are trustworthy. But in an open system like the Web where anyone can take part, and there is no universally trusted authority, trust and security are top-level issues to address.

A fundamental challenge in building open systems is verifying the identity of each component involved. We can frame that challenge as each computer in the system asking: Am I talking to the right computer?

Certificate Transparency (CT) aims to help answer this question, but before going into CT, I'll give a brief tour of the evolution of security on the Web.

Without HTTPS (Man-in-the-middle attacks)
HTTPS
Certificate Transparency
Conclusion
Further Reading

Without HTTPS (Man-in-the-middle attacks)

A man-in-the-middle attack happens when a third party intercepts a connection between a user and an application, as illustrated in the figure below^[1].

Figure 1: Man-in-the-middle attack.

When an attacker intercepts an HTTP connection, they can read and change any packets being sent over the network. These packets may contain any information ranging from passwords to bank details to other private information that a user does not intend for an intruder.

HTTPS

HTTPS was invented to make communication over the internet more secure. It takes the original HTTP protocol and adds a layer of security known as to as SSL/TLS. I'll refer to this layer as TLS for the rest of the post.

With TLS (Transport Layer Security), you only send and receive encrypted data over the network, and only a secret key agreed on by your computer and the site you're visiting can decrypt this data. Thus, while an attacker can still intercept your HTTPS connection, they cannot make sense of the transmitted packets.

How HTTPS works

TLS is based on public-key cryptography, which means here that the server has a public/private key pair. The server exposes the public key and keeps the private key private. When a client encrypts data using the server's public key, only the private key can decrypt it.

The first step in enabling HTTPS for a server is to get a certificate from a Certificate Authority (CA). This certificate is an ID for the server that contains its domain name, information about its owners, the server's public key, the CA's identity, and a digital signature signed by the CA.

At a high level, when your browser connects to an HTTPS server:

The server responds with its certificate to prove its identity to the browser.

Your browser then checks the validity of this certificate using the digital signature from the CA. I'll describe how this works soon.

After verifying the certificate, your browser will generate a random key and encrypt it using the server's public key. It will then send this encrypted key to the server as a challenge. The challenge is for the server to prove that it has the private key equivalent for the public key by decrypting the encrypted message.

Once the server decrypts this random key, it means each party is happy that they are talking to the right party and they agree to use the key for subsequent communication.

These steps make up the TLS handshake. After the handshake is complete, both parties encrypt HTTP requests and responses using the key they agreed on, which only the other party can decrypt.

Certificate Authorities protect us...

Without the digital signature from a Certificate Authority, anyone can create a certificate falsely claiming to be, say, 'netflix.com' and get your browser to trust them. Note that your browser will trust them as long as the fake server can decrypt a message it encrypts with the server's public key. To prevent this, there are few authorized CAs which may issue certificates.

Like servers, CAs also have a public/private key pair. When a CA issues a certificate, it encrypts the certificate's content with its private key and uses the encrypted text as the digital signature for the certificate. Anyone can decrypt the signature using the CA's public key.

Each browser comes with a pre-installed list of public keys of all CAs it trusts. When your browser receives a server's certificate, it first checks its list for whether it trusts the issuing CA before decrypting the digital signature using the CA's public key. If the decrypted content matches the certificate, your browser is sure that a valid CA issued the certificate and continues the TLS handshake.

...But they can go rogue

Unfortunately, CAs can get compromised or go rogue and end up issuing "bogus" certificates, i.e., a CA may issue the certificate for a domain name to the wrong owner. This has happened before. Since any CA can issue a certificate for any domain name, the least secure CA limits the overall security of the certificate mechanism.

Thus, while HTTPS can increase our confidence that we are talking to the right computers, it is not enough.

Can we have an online database of valid certificates?

To limit the effect of a bogus certificate, what would be ideal is if our browsers could somehow detect and reject bogus certificates. One way this could work is if there is a database of all the valid certificates in existence that our browsers could query. This comes with several questions, though:

Given that there's no single authority that the entire world trusts, who would run this database?
How do we decide who owns a domain name?
How do we handle situations where people change their CAs, renew their certificates or lose their private key and have to request a new one? These will all look like a second certificate for an existing domain name.

Certificate Transparency is an approach to answering these questions, and we'll look at it next.

Certificate Transparency

Certificate Transparency (CT) is a system for making the existence of all certificates publicly available to domain owners, CAs, and browsers. This way, when a rogue CA issues a certificate for 'netflix.com' to the wrong person, the certificate is immediately visible to the right owners for them to act on it.

CT works by introducing three components to the certificate system: certificate logs, monitors, and auditors.

Certificate Logs

Certificate logs contain an append-only record of certificates. Anyone can add a certificate to the logs, though only certificate authorities do this typically. When a server issues a new certificate, it must add it to the logs. Anyone can also query a log to verify that it contains a certificate.

Certificate logs are hosted on a group of servers spread over the world and can be managed independently by a CA or any interested party.

Monitors

Monitors are servers that periodically check the log servers for if a CA has issued any suspicious certificates for the domain names they are aware of. They are hosted by organizations which manage a set of domain names. For example, a company like Netflix could host their monitors and run periodic checks on the certificate logs for if any suspicious certificates exist 'netflix.com'.

When a monitor detects a suspicious certificate, I believe there is a manual step involved where a human checks whether the certificate is actually OK or was wrongly issued. There is a certificate revocation process to get rid of bad certificates.

Auditors

An auditor runs in a web browser and checks whether the certificate it receives from a server has been registered in the certificate logs.

These components work together to bring openness to the SSL certificate system. Quoting the lecture notes:

If browsers and monitors see the same log, and monitors raise an alarm if there's a bogus cert in the log, and browsers require that each cert they use is in the log, then browsers can feel safe using any cert that's in the log.

Note that auditors and monitors also communicate with each other to exchange information about the logs.

Everyone must see the same logs

For certificate transparency to work, there are two critical requirements for the logs:

No deletion: This requirement prevents a situation where a log server claims that a bogus certificate is in the log and shows it to the browser, but then the log operator deletes it from the log before the monitor can detect it.
No equivocation: All parties must see the same log content, otherwise, a log server could show browsers a log with the bogus certificate, and show the monitor a log without the certificate.

With these requirements, if a CA issues a bogus certificate for a domain name, it must add the certificate to the log. And since a log operator can't delete it, the domain name's owner will eventually see it. But meeting this requirement is difficult because, like CAs, log operators can also get compromised and may even conspire with malicious CAs.

To show that it isn't violating any of the requirements, a certificate log must be able to prove two things:

That a particular certificate is in the log.
That if it is showing a version with new certificates added, that version is consistent with the previous version. Proving this confirms that a log operator has modified no certificates in the log and the log has never been branched or forked.

It does this by storing the certificates in a Merkle Tree data structure. I won't go into the details of that here, but I recommend reading this post if you're interested in that.

Conclusion

The key property of Certificate Transparency is that everyone sees the same logs. By doing this, users can detect when CAs have issued bogus certificates for their domain name, and browsers can be confident that any certificates in the log are approved by their owners—which means they are talking to the right servers.

Finally, note that Certificate Transparency does not completely prevent the effect of bogus certificates. There might be a window where the certificates may dupe a browser before the monitors can detect them. What CT offers, though, is a system for quicker detection of these certificates, which will reduce their effect.

[1]: Image lifted from this post by Imperva.

MIT 6.824: Lecture 17 - Causal Consistency, COPS

2020-11-23T23:40:24-00:00

In studying distributed systems, I've come across systems like Spanner, which incurs additional latency for strong consistency, and DynamoDB, which sacrifices strong consistency for low latency in responding to requests. This latency vs consistency tradeoff is one that many systems have to make, and COPS—this lecture's focus—is no exception.

What the COPS (Cluster of Order-Preserving Servers) system offers, though, is a geo-replicated database with a consistency model that's closer to strong consistency while offering performance similar to low latency databases. This consistency model is called causal+ consistency.

COPS provides low latency querying by directing clients' requests to their local data centres and ensuring that these requests (both reads and writes) can proceed without waiting for or talking to other data centres. This is in contrast to a system like Spanner, where at least one other data centre must acknowledge writes.

The rest of this post will describe the causal+ consistency model and how COPS works.

Causal+ Consistency
COPS
Limitations
Conclusion
Further Reading

Causal+ Consistency

Causal+ consistency combines causal consistency with convergent conflict handling. I'll describe those next.

Causal Consistency

Two operations are causally related if we can say that one happened before the other. Any system that implements causal consistency guarantees it will preserve this order across all replicas. If an operation a happens before an operation b, no replica should see the effect of operation b before it has seen the effect of operation a. Here, we say operation b is causally dependent on operation a or a is a dependency of b.

For example, let's assume we're building an e-commerce application and considering a merchant Ade and a customer Seyi. Here, Ade is trying to share a new item in her inventory with Seyi, which involves the following sequence:

First, Ade uploads the item onto the platform and then adds it to her online inventory.
Seyi then checks Ade's inventory, expecting to see the new item added.

Under causal consistency, if the inventory has a reference to the new item, then Seyi must be able to see the item. Weaker consistency models may not guarantee this. In eventually consistent systems, the operations to upload the item and then add it to the inventory may get reordered during replication and lead to a situation where Seyi sees a reference to the item but not the item itself.

Potential Causality

More formally, the paper mentions three rules that the authors used to define potential causality between operations, denoted ->:

Execution Thread. If a and b are two operations in a single thread of execution, then a -> b if operation a happens before operation b.

Gets From. If a is a put operation and b is a get operation that returns the value written by a, then a -> b.

Transitivity. For operations a, b, and c, if a -> b and b -> c, then a -> c

The execution in Figure 1 illustrates these rules.

Figure 1 - Graph showing the causal relationship between operations at a replica. An edge from a to b shows that a happened before b, or b depends on a.

The execution thread rule gives get(y)=2 -> put(x,4); the gets from rule gives put(y,2) -> get(y)=2; and the transitivity rule gives put(y,2) -> put(x,4).

If you're wondering how a system can determine causal relationships between operations at different replicas, you'll find out in a later section.

Convergent Conflict Handling

Causal consistency does not order concurrent operations. We say that two operations are concurrent if we cannot tell that one happened before the other. A system can replicate two unrelated put operations in any order, but when there are concurrent put operations to the same key, we say they conflict.

Quoting the lecture paper:

Conflicts are undesirable for two reasons. First, because they are unordered by causal consistency, conflicts allow replicas to diverge forever. For instance, if a is put(x,1) and b is put(x,2), then causal consistency allows one replica to forever return 1 for x and another replica to forever return 2 for x.

Second, conflicts may represent an exceptional condition that requires special handling. For example, in a shopping cart application, if two people logged in to the same account concurrently add items to their cart, the desired result is to end up with both items in the cart.

Convergent conflict handling requires that a causal+ system handles all conflicting puts in the same way across all replicas through a handler function h. The last-writer-wins rule is commonly used in handler functions to ensure that replicas eventually converge.

Causal+ vs Eventual and External Consistency

With the above properties, causal+ consistency differs from eventual consistency in that an eventually consistent system may not preserve the causal order of operations, leaving clients to deal with the inconsistencies that may arise.

Also, unlike in external consistency, which always returns the most up-to-date version, a causal+ system may return stale versions of a value. What causal+ guarantees though is that those stale values are consistent with a causal order of operations.

Let's now look at COPS, a system which implements this causal+ consistency model.

COPS

Overview

COPS (Cluster of Order-Preserving Servers) is a geo-replicated key-value storage system that guarantees causal+ consistency. It comprises two software components: a client library and the key-value store.

Each data centre involved has a local COPS cluster which maintains its copy of the entire dataset. A COPS client is an application that uses the client library to interact with the key-value store. Clients interact only with their local COPS cluster running in the same data centre.

COPS shards the stored data across the nodes in a cluster, with each key belonging to a primary node in each cluster. This primary node receives the writes for a key. After a write completes, the primary node in the local cluster replicates it to the primary nodes in the other clusters.

Each key also has versions, which represent different values for that key. COPS guarantees that once a replica has returned a version of a key, the replica will only return that version or a causally later version in subsequent requests.

COPS clients maintain a context

Each client maintains a context to represent the order of its operations. Think of this context as a list that holds items. After each operation, a client adds an item to its context. The order of these items in the list captures the dependencies between versions.

This works in line with the earlier section on potential causality. Using this context, a client can compute the dependencies for a version.

Lamport timestamps provide a global order

It is easy for a COPS client to determine the order of operations on a key in a local cluster based on its context, but when there are concurrent operations to the same key in different clusters—a conflict—we need another way to determine that order.

COPS uses Lamport timestamps to derive a global order over all writes for each key. With Lamport timestamps, all the replicas will agree on which operation happened before the other.

Writing values in COPS

Writes to the local cluster

When a client calls put for a key, the library computes the dependencies for that key based on its context and sends that information to the local primary storage node. This storage node will not commit the key's value until the COPS cluster has written all the computed dependencies.

After committing the value, the primary storage node assigns it a unique version number using a Lamport timestamp and immediately returns that number to the client.

By not waiting for the replication to complete, COPS eliminates most of the latency incurred by systems with stronger consistency guarantees.

Write replication between clusters.

The primary storage node asynchronously replicates a write to the other clusters after committing a write locally. The node includes information about the write's dependencies when replicating it.

When a node in another cluster receives this write, the node checks if the local nodes in its clusters have satisfied all the dependencies. The receiving node does this by issuing a dependency check request to the local nodes responsible for those dependencies.

If a local node has not written the dependency value, it blocks the request until it writes the value. Otherwise, it will respond immediately.

In summary, COPS guarantees causal+ consistency by computing the dependencies of a write, and not committing the write in a cluster until the cluster has committed all the dependencies.

Reading values in COPS

COPS also satisfies reads in the local cluster. A COPS client can specify whether they want to read the latest version of a key or a specific older one. When the client library receives the response for a read, it adds the operation to its context to capture potential causality (See "Execution Thread" and "Gets From" above).

Limitations

While causal+ consistency is a popular research idea, it has some limitations. Two major ones are:

It cannot capture external causal dependencies. The classic example for this is a phone call: if I do action A, call my friend on another continent to tell her about A, and then she does action B, a system will not capture the causal link between A and B.
Managing conflicts can be difficult, especially when last-writer-wins isn't sufficient.

Conclusion

The authors don't compare COPS with other systems in terms of performance or ease of programming in its evaluation section, which I found surprising given that the central thesis is that COPS has a better tradeoff between ease of programming and performance.

I've also left out some details about COPS here around fault tolerance and how it handles transactions, but I hope you've gotten a good idea of causal+ consistency and how one might implement it. I recommend reading the paper linked below if you want to know more.

MIT 6.824: Lecture 16 - Scaling Memcache at Facebook

2020-11-07T21:30:55-00:00

This lecture is about building systems at scale. The associated 2013 paper from Facebook doesn't present any new ideas per se, but I found it interesting to see how some ideas this course has covered so far on replication, partitioning and consistency play out in such a large scale system.

The paper is about how Facebook uses memcached as a building block for a distributed key-value store. Memcached is an in-memory data store used for caching. Many applications today benefit from its quick response times and simple API.

In this post, I'll describe how a website's architecture might evolve to cope with increasing load, before describing Facebook's use of memcached to support the world's largest social network.

Table of Contents

Memcached at Facebook
Conclusion
Further Reading

Evolution of web architectures

Say you're building a website for users to upload pictures to. At the start, you might have your application code, web server, and database running on the same as shown below.

Figure 1 - Evolution of a web architecture: simple, single machine running the application code, web server, and database server.

But as you get more users, the load on your server increases and the application code will likely take too much of the CPU time. Your solution might be to get more CPU power for your application by running a bunch of frontend servers, which will host the web server and the application code while connecting to a single database server as in Figure 2. Connecting to a single database server gives you the advantage of being certain that all your users will see the same data, even though their requests are served by different frontend servers.

Figure 2 - Evolution of a web architecture: multiple frontend servers to one database server.

As your application grows, the single database server might become overloaded as it can receive requests from an unlimited number of frontend servers. You may address this by adding multiple database servers and sharding your data over those servers as shown in Figure 3. This comes with its challenges—especially around sharding the data efficiently, managing the membership of the different database servers, and running distributed transactions—but it could work as a solution to the problem.

Figure 3 - Evolution of a web architecture: multiple frontend servers to multiple database servers.

But databases are slow. Reading data from disk can be up to 80x slower than reading data stored in memory. As your application's user base skyrockets, one way to reduce this latency in database requests is by adding a cache between your frontend servers and the database servers. With this setup, read requests will first go to the cache and only redirect to the database layer when there's a cache miss.

Figure 4 - Evolution of a web architecture: inserting a cache between the frontend and the database.

Maintaining a cache is hard, though. You must keep the cache in sync with the database and work out how to prevent cache misses from overloading the database servers.

This architecture in Figure 4 is similar to Facebook's setup, and the rest of this post will be on Facebook's memcache architecture and how they maintain a cache effectively.

Memcached at Facebook

Facebook uses memcached to reduce the read load on their databases. Facebook's workload is dominated by reads and memcached prevents them from hitting the database for every request. They use memcached as a look-aside cache. This means that when a web server needs data, it first attempts to fetch the data from the cache. If the value is not in the cache, the web server will fetch the data from the database and then populate the cache with the data.

For writes, the web server will send the new value for a key to the database and then send another request to the cache to delete the key. Subsequent reads for the key will fetch the latest data from the database.

This is illustrated below.

Figure 5 - Memcache as a demand-filled look-aside cache. The left half illustrates the read path for a web server on a cache miss. The right half illustrates the write path.

Note that the paper uses memcached to refer to the open source library and memcache to refer to the distributed system built on top of memcached at Facebook. I'll use memcache for the rest of this post.

Architecture

Facebook's architecture comprises multiple web, memcache, and database servers. A collection of web and memcache servers make up a frontend cluster, and multiple frontend clusters make up a data centre. These data centres are called regions in the paper. The frontend clusters in a region share the same storage cluster. Facebook replicates clusters in different regions around the world, designating one region as the primary and the others as secondary regions.

The architecture diagram below illustrates these components:

Figure 6 - Overall architecture

Each layer in this architecture comes with its set of challenges, and I'll cover those next.

In a cluster: Latency and Load

For performance in a cluster, the designers of this system focused on two things:

Reducing the latency of memcache's response.
Reducing the load on the database when there's a cache miss.

Reducing latency

Facebook's efforts to reduce latency in a cluster focused on optimising the memcache client. The memcache client runs on each web server and interacts with the memcache servers in its frontend cluster. This client is responsible for request routing, request batching, error handling, serialization, and so on. Next, let's see some optimizations made in the client.

Memcache clients parallelise and batch requests

Each web server constructs a directed acyclic graph (DAG) of all the data dependencies it needs for a page. The memcache client then uses this DAG to batch and fetch the required keys concurrently from the memcache servers. This reduces the number of network round trips needed to load a Facebook page.

UDP for reads and TCP for writes

Memcache clients use UDP for get requests and TCP for set and delete requests to the servers. I've written about the differences between UDP and TCP in an earlier post, but what's relevant here is that using UDP reduces the latency and overhead compared to TCP, which makes it less reliable for transmission.

When the UDP implementation detects that packets are dropped or received out of order (using sequence numbers), it returns an error to the client which treats the operation as a cache miss. But unlike a standard cache miss (i.e. one not related to dropped packets) which will redirect to the database and then populate the cache, the web server will not attempt to populate the cache with the fetched data. This avoids putting extra load on a potentially overloaded network or server.

Memcache clients implement flow control

Facebook partitions data across hundreds of memcache servers in a cluster using consistent hashing. Thus, a web server may need to communicate with many memcache servers to satisfy a user's request for a page. This leads to the problem of incast congestion. The paper describes this problem:

When a client requests a large number of keys, the responses can overwhelm components such as rack and cluster switches if those responses arrive all at once.

Memcache clients address this by using a sliding window mechanism similar to TCP's to limit the number of outstanding requests. A client can make a limited number of requests at a time and will send the next one only when it has received a response from an in-flight one.

Reducing load

As I wrote earlier, Facebook uses memcache to reduce the read load on their databases. But when data is missing from the cache, the web servers must send requests to these databases. Facebook had to take extra care when designing the system to prevent the databases from getting overloaded when there are many cache misses. They use a mechanism called leases to address two key problems: stale sets and thundering herds.

Leases and stale sets

A stale set occurs when a web server sets an out-of-date value for a key in memcache. This can happen when concurrent updates to a key get reordered. For example, let's consider this scenario to illustrate a stale set with two clients, C1 and C2.

  key 'k' not in cache
  C1 get(k), misses
  C1 reads v1 from DB as the value of k
    C2 writes k = v2 in DB
    C2 delete(k)  (recall that any DB writes will invalidate key in cache)
  C1 set(k, v1)
  now mc has stale data, since delete(k) has already happened
  will stay stale indefinitely until k is next written

Leases prevent this problem. When a client encounters a cache miss for a key, the memcache server will give it a lease to set data into the cache after reading it from the DB. This lease is a 64-bit token bound to the key. When the client wants to set the data in the cache, it must provide this lease token for memcache to verify. But, when memcache receives a delete request for the key, it will invalidate any existing lease tokens for that key.

Therefore, in the above scenario, C1 will get a lease from mc which C2's delete() will invalidate. This will lead to memcache ignoring C1's set. Note that this key will be missing from the cache and the next reader has to fetch the latest data from the DB.

Leases and thundering herds

A thundering herd happens when many clients try to read the data for an invalidated key. When this happens, the clients will all have to send their requests to the database servers, which may get overloaded.

To prevent the thundering herd problem, memcache servers give leases only once every 10 seconds per key. If another client request for a key comes in within 10 seconds of the lease being issued, the request will have to wait. The idea is that the first client with a lease would have successfully set the data in the cache during the 10 seconds window, and so the waiting clients will read from the cache on retry.

In a region: Replication

As Facebook's load increased, they could have scaled their system by adding more memcache and web servers to a frontend cluster and further partitioning the keyset. However, this has two major limitations:

Incast congestion will get worse as the number of memcache servers increases, since a client has to talk to more servers.

Partitioning by itself does not help much if a key is very popular, as a single server will need to handle all the requests for that key. In cases like this, replicating the data helps so we can share the load among different servers.

Facebook scaled this system by creating multiple replicas of a cluster within a region which share common storage. These clusters are called frontend clusters, with each cluster made up of web and memcache servers. This method addresses both limitations described above and provides an extra benefit: having smaller clusters instead of a single large one gives them more independent failure domains. They can lose a frontend cluster and still continue operating normally.

What's interesting to me here is that there is no special replication protocol to ensure that the clusters in a region have the same data. Their thinking here is that if they randomly route users' requests to any available frontend cluster, they'll all eventually have the same data.

Let's now see some optimizations made within a region.

Regional Invalidation

To keep memcache content in the different frontend clusters consistent with the database, the storage cluster sends invalidations to the memcache servers after a write from a web server. As an optimization, when a web server changes data in the storage cluster, it must also send the invalidations to the memcache servers in its local frontend cluster.

The paper points out that this guarantees read-after-write consistency for a single user request. My understanding here is that they randomly route each user's request to a frontend cluster in a region, but that routing is consistent across all the user's subsequent requests.

The storage cluster batches the changes and sends them to a set of dedicated servers in each frontend cluster. These dedicated servers then unpack the changes and route the invalidations to the right memcache servers. This mechanism results in fewer packets than if the storage cluster was sending each invalidation for a key directly to the memcache server holding that key.

Regional Pools

Items in a dataset which are accessed infrequently and have large sizes rarely need to be replicated. For those keys, there is an optimization in place to store only one copy per region.

Facebook stores these keys in a regional pool, which contains a set of memcache servers that are shared by multiple frontend clusters. This is more memory efficient than over-replicating items with a low access rate.

Cold Cluster Warmup

When a new frontend cluster is being brought online, any requests to it will result in a cache miss, and this could lead to overloading the database. Facebook has a mechanism called Cold Cluster Warmup to mitigate this.

Quoting the paper to describe the solution:

Cold Cluster Warmup mitigates this by allowing clients in the “cold cluster” (i.e. the frontend cluster that has an empty cache) to retrieve data from the “warm cluster” (i.e. a cluster that has caches with normal hit rates) rather than the persistent storage.

Across Regions: Consistency

Facebook deploys regions across geographic locations worldwide. This has a few advantages:

Web servers can be closer to users, which reduces the latency in responding to requests.
Better fault tolerance since we can withstand natural disasters or power failures in one region.
New locations can provide cheaper power and other economic benefits.

Facebook designates one region to hold the primary database which all writes must go to, and the other regions to contain read-only replicas. They use MySQL's replication mechanism to keep the replica databases in sync with the primary. The key challenge here is in keeping the data in memcache consistent with the primary database, which may be in another region.

With the information stored on Facebook—friend lists, status, posts, likes, photos—it is not critical for users to always see fresh data. Users will typically tolerate seeing slightly stale data for these things. Thus, Facebook's setup allows for users in a secondary region to see slightly stale data for the sake of better performance. The goal here, though, is to reduce that window of staleness and ensure that the data across all regions is eventually consistent.

There are two major considerations here:

When writes come from a primary region.
When writes come from a secondary region.

Writes from a primary region

This follows the mechanism described earlier. Writes go directly to the storage cluster in the region, which then replicates it to the secondary regions. Clients in secondary regions may read stale data for any key modified here if there is a lag in replicating the changes to those regions.

Writes from a secondary region

Consider the following race that can happen when a client C1 updates the database from a secondary region:

  Key k starts with value v1
  C1 is in a secondary region
  C1 updates k=v2 in primary DB
  C1 delete(k)  (in local region)
  C1 get(k), miss
  C1 reads local DB  -- sees v1, not v2!
  later, v2 arrives from primary DB (replication lag)

This violates the read-after-write consistency guarantee, and Facebook prevents this scenario by using remote markers. With this mechanism, when a web server in a secondary region wants to update data that affects a key k, it must:

Set a remote marker r_k in the regional pool. Think of r_k as a memcache record that represents extra information for key k.
Send the write to the primary region and include r_k in the request, so that the primary knows to invalidate r_k when it replicates the write.
Delete k in the local cluster.

By doing this, the web server's next request for k will result in a cache miss, after which it will check the regional pool to find r_k. If r_k exists, it means the data in the local region is stale and the server will direct the read to the primary region. Otherwise, it will read from the local region.

Here, Facebook trades additional latency when there's a cache miss for a lower probability of reading stale data.

Conclusion

I've often thought of using caches primarily to reduce the latency in a system, but this lecture has been an eye-opener in also thinking of caches as being vital for throughput survival. I've left out some bits from the paper on fault tolerance and single server improvements, which I'll encourage you to read up on. Also, the paper doesn't say much about this, but I'll be interested in learning more about what pages are cached on Facebook.

I suspect that this paper is severely outdated and a recent paper from Facebook makes me believe that memcache is no longer used there (this is subject to confirmation, though), but the ideas in here are still very relevant.

MIT 6.824: Lecture 15 - Spark

2020-10-16T21:03:01-00:00

In the first lecture of this series, I wrote about MapReduce as a distributed computation framework. MapReduce partitions the input data across worker nodes, which process data in two stages: map and reduce.

While MapReduce was innovative, it came with some limitations:

Running iterative operations like PageRank in MapReduce involves chaining multiple MapReduce jobs together. Since a MapReduce job writes its output to disk, these sequential operations require a high disk I/O and have high latency.
Similarly, interactive queries where a user runs multiple ad-hoc queries on the same subset of data need to fetch data from the disk for each query.
MapReduce's API is restricted. Programmers must represent each computation task as a map-reduce operation.

In essence, MapReduce is inefficient for applications that reuse intermediate results across multiple computations. Researchers at UC Berkeley invented Spark as a more efficient framework for executing such applications. It deals with these limitations by:

Caching intermediate data in the main memory to reduce disk I/O.
Generalizing the MapReduce model into a more flexible model with support for more operations than just map and reduce.

Table of Contents

Resilient Distributed Datasets (RDDs)
Conclusion
Further reading

Resilient Distributed Datasets (RDDs)

At the heart of Spark is the Resilient Distributed Datasets (RDDs) abstraction. RDDs enable programmers to perform in-memory computations on large clusters in a fault-tolerant manner. Quoting the original paper:

RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.

We can create RDDs from operations on either data on disk or other RDDs. These operations on other RDDs are called transformations. Examples of transformations are map, filter, and join.

Instead of holding the materialized data in memory, an RDD keeps track of all the transformations that have led to its current state, i.e. how the RDD was derived from other datasets. These transformations are called a lineage. By tracking the lineage of RDDs, we save memory and can reconstruct an RDD after a failure.

There's another class of operations in Spark called actions. Until we call an action, invoking transformations in Spark only creates the lineage graph. Actions are what cause the computation to execute. Examples of actions are count and collect.

Let's look at the following example of a Spark program from the paper.

  lines = spark.textFile("hdfs://...")
  errors = lines.filter(_.startsWith("ERROR"))
  errors.cache()

On line 1, we create an RDD backed by a file in HDFS while line 2 derives a filtered RDD from it. Line 3 asks that the program stores the filtered RDD in memory for reuse across computations.

We can then perform further transformations on the RDD and use their results as shown below.

  // Count errors mentioning MySQL:
  errors.filter(_.contains("MySQL")).count()

  // Return the time fields of errors mentioning HDFS as an array
  // (assuming time is field number 3 in a tab-separated format):
  errors.filter(_.contains("HDFS"))
        .map(_.split(’\t’)(3))
        .collect()

It is only after the first action— count—runs that Spark will execute the previous operations and store the partitions of the errors RDD in memory. Note that the first RDD, lines, is not loaded into RAM. This helps to save space as the error messages that we need may only be a small fraction of the data.

Figure 1 below shows the lineage graph for the RDDs in our third query.

In the query, we started with errors, which is an RDD based on a filter of lines. We then applied two further transformations, filter and map, to yield our final RDD. If we lose any of the partitions of errors, Spark can rebuild it by applying a filter on only the corresponding partitions of lines.

This is already more efficient than MapReduce as Spark forwards data directly from one transformation to the next and can reuse intermediate data without involving the disk.

Note that users can control how the programs stores RDDs. We can choose whether we want an RDD to use in-memory storage or persist it to disk. We can also specify that an RDD's elements should be partitioned across machines based on a key in each record.

Spark Programming Interface

Spark code, as shown in the previous example, runs in a driver machine which manages the execution and data flow to worker machines. The driver defines the RDDs, invokes actions on them, and tracks their lineage. The worker machines store RDD partitions in the RAM across operations. The figure below illustrates this.

Spark supports a wide range of operations on RDDs. You can find a full list of that here.

Representing RDDs

The Spark implementation represents an RDD through an interface that exposes five pieces of information:

A set of partitions, which are atomic chunks of the dataset.
Data placement of the RDD.
A set of dependencies on parent RDDs.
A function for computing the dataset based on its parents.
Metadata about its partitioning scheme.

The table below describes these pieces.

Table 1 - Interface used to represent RDDs in Spark.

The paper classifies dependencies into two types: narrow and wide dependencies. A parent RDD and a child RDD have a narrow dependency between them if at most one partition of the child RDD uses each partition of the parent RDD. For example, the result of a map operation on a parent RDD is a child RDD with the same partitions as the parent.

But with wide dependencies, multiple child partitions may depend on a single partition of a parent RDD. An example of this is in a groupByKey operation, which needs to look at data from all the partitions in the parent RDD since we must consider all the records for a key.

The paper explains the reasons for this distinction:

First, narrow dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions. For example, one can apply a map followed by a filter on an element-by-element basis. In contrast, wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReducelike operation.

Second, recovery after a node failure is more efficient with a narrow dependency, as only the lost parent partitions need to be recomputed, and they can be recomputed in parallel on different nodes. In contrast, in a lineage graph with wide dependencies, a single failed node might cause the loss of some partition from all the ancestors of an RDD, requiring a complete re-execution.

Fault Tolerance

When a machine crashes, we lose its memory and computation state. The Spark driver then re-runs the transformations on the crashed machine's partitions on other machines. For narrow dependencies, the driver only needs to re-execute lost partitions.

With wide dependencies, since recomputing a failed partition requires information from all the partitions, we need to re-execute all the partitions from start—even the ones that didn't fail. Spark supports checkpoints to HDFS to speed up this process. By doing this, the driver only needs the recompute along the lineage from the latest checkpoint.

Conclusion

In the evaluation performed by the paper's authors, Spark performed up to 20x faster than Hadoop for iterative applications and sped up a real-world data analytics report by 40x. But Spark isn't perfect. For one, RAM is expensive, and our computation may require a huge RAM for processing in-memory. You can learn about other limitations here.

Overall, Spark's reuse of data in-memory and its wider set of operations make it an improvement over MapReduce for expressivity and performance.

MIT 6.824: Lecture 14 - Optimistic Concurrency Control

2020-10-06T22:00:31-00:00

This lecture on optimistic concurrency control is based on a 2015 paper from Microsoft Research describing a system called FaRM. FaRM (Fast Remote Memory) is a main memory computing platform that provides distributed transactions with strict serializability, high performance, durability and high availability.

FaRM takes advantage of two hardware trends to provide these guarantees:

Using Remote Direct Memory Access (RDMA) for faster communication between servers.
An inexpensive approach to providing Non-volatile Random Access Memory (NVRAM).

In this post, I'll explain how FaRM uses these techniques to perform faster and yield greater throughput than Google Spanner for simple transactions.

Table of Contents

Overview
How transactions work
Conclusion
Further Reading

Overview

The FaRM database is intended for a setup where all the servers are managed in the same data centre. It partitions data into regions stored over many servers and uses primary/backup replication to replicate these regions on one primary and f backup replicas. With this configuration, a region with f + 1 replicas can tolerate f failures.

FaRM uses ZooKeeper as its configuration manager, which manages the primary and backup servers allocation for each region.

Performance Optimizations

At a high level, some optimizations and tradeoffs that FaRM makes to provide its guarantees are:

The data must fit in the total RAM, so there are no disk reads.
It makes use of non-volatile DRAM, so there are no disk writes under normal operations.
FaRM's transaction and replication protocol take advantage of one-sided RDMA for better performance. RDMA means a server can have rapid access to another server's RAM while bypassing its CPU.
FaRM also supports fast user-level access to the Network Interface Card (NIC).

Let's now learn about these optimizations in greater detail.

Non-volatile RAM

A FaRM server stores data in its RAM, which eliminates the I/O bottleneck in reading from and writing data to disk. For context, it takes about 200ns to write to RAM, while SSD and hard drive writes take about 100μs and 10ms respectively ^[1]. Writing to RAM is roughly 500x faster than writing to SSD, and we gain a lot of performance from this optimization.

But by storing data in RAM, we lose the guarantee that our data will survive a power failure. FaRM provides this guarantee by attaching its servers to a distributed UPS, which makes the data durable. When the power goes out, the distributed UPS saves the RAM's content to an SSD using the energy from its battery. This operation may take a few minutes, after which the server shuts down. When the server is restarted, FaRM loads the saved image on the SSD back into memory. Another approach to making RAM durable is to use NVDIMMs, but the distributed UPS mechanism used by FaRM is cheaper.

Note that this distributed UPS arrangement only works in the event of a power failure. Other faults such as CPU, memory, hardware errors, and bugs in FaRM, can cause a server to crash without persisting the RAM content. FaRM copes with this by storing data on more than one replica. When a server crashes, FaRM copies the data from the RAM of one of the failed server's replicas to another machine to ensure that there are always f + 1 copies available, where f represents the number of failures we can tolerate.

By eliminating the bottleneck in accessing the disk, we can now focus on two other performance bottlenecks: network and CPU.

Networking

At a high level, communication between two computers on a network is as shown in Figure 1 below ^[2].

Figure 1 - Setup for RPC between computers on a network

In this setup, we see that communication between two machines A and B on a network goes through the kernel of both machines. This kernel consists of different layers through which messages must pass.

Figure 2 below shows this in more detail.

Socket -> TCP/UDP -> IP -> Ethernet -> Network Adapter">

Figure 2 - Networking stack. Message must pass through all the different layers.

This stack requires a lot of expensive CPU operations—system calls, interrupts, and communication between the different layers—to transmit messages from one computer to another, and these slow the performance of network operations.

FaRM uses two ideas to improve this networking performance:

Kernel bypass: Here, the application interacts directly with the NIC without making system calls or involving the kernel.
One-sided RDMA: With one-sided RDMA, a server can read from or write to the memory of a remote computer without involving its CPU.

Figure 3 highlights these ideas. In the figure, we see that the CPU of Machine A communicates with the NIC without the kernel involved, and the NIC of Machine B bypasses the CPU to access the RAM.

Figure 3 - RDMA and Kernel Bypass

These optimizations made in FaRM to reduce the storage, network, and CPU usage solve most of the performance bottlenecks found in many applications today.

Can we bypass the CPU?

The transaction protocols we have seen in the systems discussed in earlier lectures like Spanner require active server participation. They need the CPU to answer questions like:

Is a database record locked?
Which is the latest version of a record?
Is a write committed yet?

But one-sided RDMA bypasses the CPU, so it is not immediately compatible with these protocols. How then can we ensure consistency while avoiding the CPU?

The rest of this post details FaRM's approach to solving this challenge.

Optimistic and pessimistic concurrency control

To recap an earlier post, two classes of concurrency control exist for transactions:

Pessimistic: When a transaction wants to read or write an object, it first attempts to acquire a lock. If the attempt is successful, it must hold that lock until it commits or aborts. Otherwise, the transaction must wait until any conflicting transaction releases it lock on the object.
Optimistic: Here, we can run a transaction that reads and writes objects without locking until commit time. The commit stage requires validating that no other transactions changed the data in a way that conflicted with ours. If there's a conflict, our transaction gets aborted. Otherwise, the database commits the writes.

FaRM uses Optimistic Concurrency Control (OCC) which reduces the need for active server participation when executing transactions.

FaRM machines maintain logs

Before going into the details of how transactions work, note that each FaRM machine maintains logs which they use as either transaction logs or message queues. These logs are used to communicate between sender-receiver pairs in a FaRM cluster. The sender appends records to the log using one-sided RDMA writes, which the receiver acknowledges via its NIC without involving the CPU. The receiver processes records by periodically polling the head of its log.

How transactions work

FaRM supports reading and writing multiple objects within a transaction, and guarantees strict serializability of all successfully committed transactions.

Each object or record in the database has a 64-bit version number which FaRM uses for concurrency control and an address representing its location in a server.

Any application thread can start a transaction and this thread then becomes the transaction coordinator. This transaction coordinator communicates directly with replicas and backups as shown in the figure below.

Figure 4 - FaRM commit protocol with a coordinator C, primaries on P₁; P₂; P₃, and backups on B₁; B₂; B₃. P₁ and P₂ are read and written. P₃ is only read. We use dashed lines for RDMA reads, solid ones for RDMA writes, dotted ones for hardware acks, and rectangles for object data.

Execute

In the Execute phase, the transaction coordinator uses one-sided RDMA to read objects and buffers changes to objects locally. These reads and writes happen without locking any objects— optimism. The coordinator always sends the reads for an object to the primary copy of the containing region. For primaries and backups on the same machine as the coordinator, it reads and writes to the log using local memory accesses instead of RDMA.

During transaction execution, the coordinator keeps track of the addresses and versions of all the objects it has accessed. It uses these in later phases.

After executing the specified operations within a transaction, FaRM starts the commit process using the steps discussed next. Note that I'll detail them in a different order from Figure 4, but it will make sense why.

Lock

The first step in committing a transaction that changes one or more objects is that the coordinator sends a LOCK record to the primary of each written object. This LOCK record contains the transaction ID, the IDs of all the regions with objects written by the transaction, and addresses, versions, and new values of all objects written by the transaction that the destination is primary for. The coordinator uses RDMA to append this record to the log at each primary.

When a primary receives this record, it attempts to lock each object at its specified version. This involves using an atomic compare-and-swap operation on the high-order bit of its version number—which represents the "lock" flag. The primary then sends back a message to the coordinator reporting whether all the lock attempts were successful. Locking will fail if any of these conditions is met:

Another transaction has locked the object.
The current version of the object differs from what the transaction read and sent in its LOCK record.

A failed lock attempt at any primary will cause the coordinator to abort the transaction by writing an ABORT record to all the primaries so they can release any held locks, after which the coordinator returns an error to the application.

Commit primaries

If all the primaries report successful lock attempts, then the coordinator will decide to commit the transaction. It broadcasts this decision to the primaries by appending a COMMIT-PRIMARY record to primaries' logs. This COMMIT-PRIMARY record contains the ID of the transaction to commit. A primary processes the record by copying the new value of each object in the LOCK record for the transaction into memory, incrementing the object's version number, and clearing the object's lock flag.

The coordinator does not wait for the primaries to process the log entry before it reports success to the client, it only needs an RDMA acknowledgement from one primary that the entry has been received and stored in the NVRAM. We'll see why this is safe later.

Validate

The commit steps so far describe FaRM's protocol when a transaction modifies an object. But in read-only transactions, or for objects that are read but not written by a transaction, the coordinator does not write any LOCK records.

The coordinator performs commit validation in these scenarios by using one-sided RDMA reads to re-fetch each object's version number and status of its lock flag. If the lock flag is set or the version number has changed since the transaction read it, the coordinator aborts the transaction.

This optimization avoids holding locks in read-only transactions, which speeds up their execution.

Lock + Validate = Strict Serializability

Under non-failure conditions, the locking and validation steps guarantee strict serializability of all committed FaRM transactions. They ensure that the result of executing concurrent transactions is the same as if they are executed one after the other, and the order that produces the result is consistent with real time.

Specifically, the steps guarantee that for a transaction:

If there were no conflicting transactions, the object versions it read won't have changed.
If there were conflicting transactions, it will see a changed version number or a lock on an object it accessed.

Let's see how this plays out using some examples from the lecture.

Examples

In this first example, we have two simultaneous transactions, T1 and T2.

  T1:  Rx  Ly  Vx  Cy
  T2:  Ry  Lx  Vy  Cx

(Where L and V represent the Lock and Validate stages)

Here, the LOCKs in both operations will succeed while the VALIDATEs will fail. The validation for x will fail in T1 because its lock bit has been set by T2 and Vy will fail in T2 because y has been locked by T1. Both transactions will then abort, which satisfies our desired guarantee.

This next example has a similar setup.

 T1:  Rx  Ly  Vx      Cy
 T2:  Ry          Lx  Vy  Cx

In the example, T1 will commit while T2 will abort, since T2's Vy will see either T1's lock on y or a higher version of y.

Fault Tolerance - Commit backups

Unfortunately, the protocol described above is not enough to guarantee serializability when there are system failures.

Recall that the coordinator needs acknowledgement from only one primary for the COMMIT-PRIMARY record before its reports success to clients. This means that a committed write could be visible as soon as the COMMIT-PRIMARY is sent, since that primary will write the data and unlock the objects involved.

Thus, we could have a scenario where a transaction fails after it has committed on one primary but before committing on the others. This violates transaction atomicity, which requires that if a transaction succeeds on one machine, it must succeed on all the others.

FaRM achieves this atomicity through the use of COMMIT-BACKUP records. While LOCK records tell the primaries the new values, COMMIT-BACKUP records give the same information to the backups.

The key thing to note is that before sending the COMMIT-PRIMARY records, the coordinator sends a COMMIT-BACKUP record to the backups of the primaries involved in a transaction and waits for acknowledgement that they are persisted on all the backups. This COMMIT-BACKUP record contains the same data as the LOCK record for a transaction.

By also waiting for acknowledgement that a COMMIT-PRIMARY record has been stored from at least one primary, the coordinator ensures that at least one commit record survives any f failures for transactions reported as committed to the application.

Truncate

The coordinator truncates the logs at primaries and backups after receiving acknowledgements from all the primaries that they have stored a commit record.

Conclusion

FaRM is innovative in its use of optimistic concurrency control to allow fast one-sided RDMA reads while providing strict serializability for distributed transactions. I recommended reading the Spanner post for an alternative approach to distributed transactions.

As far as I know, though, FaRM is still a research prototype and has not been deployed in any production systems so far. For all of its performance optimizations, FaRM is limited because its data must fit in the total RAM and replication occurs only within a data centre. Its API is also restrictive, as it does not support SQL operations like Spanner does.

I've also omitted some implementation details here on failure detection and reconfiguring the FaRM instance after recovery, but I hope you've learned enough to pique your interest in the topic.

[1]: Storage with the speed of memory? XPoint, XPoint, that's our plan

[2]: Lecture Notes on Distributed Systems(Fall 2017) by Jacek Lysiak

MIT 6.824: Lecture 13 - Spanner

2020-09-12T12:30:49-00:00

Unlike many other databases that either choose not to support distributed transactions at all or opt for weaker consistency models, Spanner is an example of a distributed database that supports externally consistent distributed transactions.

This post will cover how Google Spanner implements a fault-tolerant two-phase commit protocol and how its novel TrueTime API enables it to guarantee external consistency.

Table of Contents

Overview
Spanner's read-write transactions use locks
Read-only transactions
- Why not read the latest committed values?
- Snapshot Isolation
  - What if the clocks on the replicas are not synchronized?
TrueTime
Conclusion
Further Reading

Overview

Spanner partitions data into Paxos groups

Spanner partitions data across many servers in data centres spread all over the world. It manages the replication of a partition using Paxos, and each partition belongs to a Paxos group. Paxos is like Raft in that each Paxos group has a leader, and the leader replicates a log of operations to its followers. Each replica in a Paxos group is typically in a different data centre.

This setup is great for a few reasons:

Partitioning the data means we can increase the total throughput via parallelism.
Data centres fail independently and so Spanner can continue to serve clients even after a data centre failure.
With global replication, clients can read data faster by going to the replicas closest to them.
Paxos only requires a majority of the replicas to respond and can tolerate slow or distant replicas.

Spanner guarantees external consistency

I've written about external consistency in another post, but to summarize, external consistency is the "gold standard" for isolation levels in distributed databases.

Serializability is often referred to as the gold standard, but while serializability guarantees that the result of concurrent transactions will be the same as if they had executed in some serial order, external consistency (sometimes referred to as strict serializability) imposes what that order must be. It guarantees that the order chosen will be one that is consistent with the real time execution of those transactions.

Using the example from the linked post where we execute two transactions, T1 and T2, in a database. Let's assume that the database replicates the data across two machines A and B, which are both allowed to process transactions.

1 and T₂ where T₁ has steps a11, a12, a13 and T₂ has a21, a22 and a23">

Sample Transactions T₁ and T₂

A client issues transaction T₁ to machine A which performs the operations a11, a12 and a13. After T₁ completes, another client issues transaction T₂ to machine B. Under serializability, it would be valid for transaction T₂ to not see the latest value of Y written by T₁, even though T₂ started after T₁ completed.

This is because the order where T₂ executes before T₁ is one of the two possible serial schedules for these transactions:

Sa: {a11, a12, a13, a21, a22, a23}
Sb: {a21, a22, a23, a11, a12, a13}

This scenario could happen if the database uses asynchronous replication, and the writes from T₁ have not yet replicated to machine B.

External consistency guarantees that if transaction T₁ commits before transaction T₂ starts, then T₂ will see the effects of T₁ regardless of what replica it executes on.

There is no difference between serializability and external consistency in a single-node database. By committing a transaction on a single-node database, it becomes visible to every transaction that starts after it.

External consistency is challenging to provide

One reason distributed databases are partitioned and replicated in data centres spread across multiple locations is to allow clients to make requests to the local replicas closest to them, which yields better performance. However, this arrangement makes it harder to provide strong consistency guarantees, as it means that reads of local replicas must always return fresh data. The challenge here is that a local replica may not have been involved in a Paxos majority, which means it won't reflect the latest Paxos writes.

The protocols needed to provide external consistency have performed poorly in terms of availability, latency, and throughput when implemented, which is why many database designers with an arrangement similar to Spanner's have either chosen not to support distributed transactions or to provide weaker guarantees than external consistency.

Spanner tackles this problem by making clients specify whether a transaction is read-only or read-write before executing and making optimizations tailored to each case. The rest of this post will explain how it does that.

Spanner's read-write transactions use locks

Read-write transactions that involve only one Paxos group or partition use two-phase locking for external consistency. The client issues writes to the leader of the Paxos group, which then manages the locks.

For transactions that involve multiple Paxos groups, Spanner uses the two-phase commit protocol with long-held locks to guarantee that read-write transactions provide external consistency. In the previous discussion of the two-phase commit protocol, we saw that one of its downsides is how the coordinator can be a bottleneck. I highlighted in the summary that if there is a coordinator failure, there is the risk of the transaction participants being stuck in a waiting state. Spanner solves this problem by combining two-phase commit with the Paxos algorithm. It replicates the transaction coordinator and participants into Paxos groups so it can automatically elect a new coordinator on failure, and the protocol is more fault-tolerant.

The client buffers the write operations that occur in a transaction until it is ready to commit. At commit time, it chooses a leader of one of the Paxos groups to act as the transaction coordinator and sends a prepare message with the buffered writes and the identity of the coordinator to the leaders of the other participant groups. Each participant leader then acquires writes locks and performs the specified operations before responding to the coordinator with the status of its mini-transaction.

The client also issues reads within read-write transactions to the leader replica of the relevant group, which acquires read locks and reads the most recent data. One of the other limitations of two-phase commit highlighted in the previous lecture is its proneness to deadlocks. Spanner uses the wound-wait locking rule to avoid deadlocks when reading data.

By holding locks on the data and ensuring that only the Paxos leader of the partition it belongs to can read or write the data, read-write transactions in Spanner are externally consistent.

Now, if Spanner only supported read-write transactions, then it would be fine to leave it with this protocol. But Spanner was built for a workload dominated by read-only transactions. The next section will cover how Spanner's protocol for read-only transactions achieves 10x latency improvements over read-write transactions.

Read-only transactions

Spanner makes two optimizations to achieve greater performance in read-only transactions:

It does not hold locks or use the two-phase commit protocol to serve requests.
Clients can issue reads involving multiple Paxos groups to the follower replicas of the Paxos groups, not just the leaders. This means that a client can read data from their local replica.

But these optimizations make it more difficult to provide external consistency. For example, a read request may be sent to a stale replica which may violate external consistency if it returns its latest committed value.

Something else to note is that reads in a read-only transaction only see the values that had been committed before the transaction started, even if a R/W transaction updates those values and commits while the transaction is running.

Why not read the latest committed values?

It might seem like transactions must read the latest committed values to guarantee external consistency, but the next example from the lecture shows why this behaviour could violate the property.

Suppose we have two bank transfers, T1 and T2, and a transaction T3 that reads both.

    T1:  Wx  Wy  C
    T2:                 Wx  Wy  C
    T3:             Rx             Ry

If T3 reads the value of x from T1 and y from T2, the results won't match any serial order.
    Not T1, T2, T3 or T1, T3, T2.

We want T3 to see both of T2's writes or none.
We want T3's reads to *all* occur at the *same* point relative to T1/T2.

Read-only transactions in Spanner use snapshot isolation to prevent the violation in the example.

Snapshot Isolation

With snapshot isolation, a database keeps multiple versions of an object—each version labelled with a timestamp representing what transaction produced it. Spanner assigns a timestamp to each transaction using these rules:

For a read-only transaction, Spanner chooses the time at the start of the transaction as the transaction timestamp (TS). Spanner has an API layer on the machines in the cluster which uses the system clock to get the current time.
The leader of the coordinator group in charge of a read-write transaction chooses its time when commit begins as the transaction timestamp.

Snapshot isolation enforces that a read-only transaction will only see the versions of a record that have a timestamp less than its assigned transaction timestamp i.e. a snapshot of what the record was before the transaction started.

This prevents the problem in the previous example as it guarantees that T3 will see either both of T2's writes or none. Here's another version of the example with snapshot isolation implemented:

                      x@10=9         x@20=8
                      y@10=11        y@20=12
    T1 @ 10:  Wx  Wy  C
    T2 @ 20:                 Wx  Wy  C
    T3 @ 15:             Rx             Ry

  "@ 10" indicates the time-stamp.

  - Now,T3's reads will both be served from the @10 versions.
    T3 won't see T2's write even though T3's read of y occurs after T2.

  - The results are now serializable: T1 T3 T2.
    The serial order is the same as the time-stamp order.

In this example, it is okay for T3 to read an older value of y even though there is a newer value because T2 and T3 transactions are concurrent, and external consistency allows either order for concurrent transactions.

But there is another problem that may arise in the example. T3 can read x from a local replica that hasn't seen T1's write because the replica wasn't in the Paxos majority. To prevent this violation, each replica maintains a safe time property, which is the maximum timestamp at which it is up to date. Paxos leaders send writes to followers in timestamp order and the safe time represents the most recent timestamp a replica has seen.

Before serving a read at time 20, a replica must have seen Paxos writes for time > 20, or it will have to wait until it is up to date.

What if the clocks on the replicas are not synchronized?

Imagine a scenario where the clocks on each replica are wildly different from each other. Remember that these clocks are used to assign transaction timestamps. How do you think that will affect the correctness of transactions? Does your answer depend on whether we are considering read-only or read-write transactions? I'll answer these questions in this section, but I suggest taking a minute to think about them before moving on.

Unsynchronized clocks will not affect the correctness of read-write transactions because they acquire locks for their operations and don't use snapshot isolation. As stated earlier, the locks guarantee external consistency and ensure that the transactions operate on recent data.

But for a read-only transaction spanning multiple Paxos groups, the effect of unsynchronized clocks can play out in two ways:

If Spanner assigns a timestamp that is too large, that transaction timestamp will be higher than the safe time of the replicas in the other Paxos groups involved, and the read will block until they are up to date. Here, the result will be correct, but the execution will be slow.
If the transaction timestamp is too small, the transaction will miss writes that have committed on other replicas before it started. Remember that the snapshot isolation rule mandates that a transaction can only read values below its timestamp. Let's look at the example below where T2 is assigned a timestamp less than T1, even though T1 committed first.
```
  r/w T0 @  0: Wx1 C
  r/w T1 @ 10:         Wx2 C
  r/o T2 @  5:                   Rx?

(C represents a commit)
```
This would cause T2 to read the version of x at time 0—which was 1—even though T2 started after T1 committed (in real time). External consistency requires that T2 sees x=2.

The next section will discuss how Google uses its TrueTime service to synchronize computer clocks to high accuracy.

TrueTime

Clocks are unreliable in distributed systems. A computer's time is defined by clocks at a collection of government labs and distributed to the computer via various protocols — NTP, GPS or WWV. This distribution to different computers is subject to variable network delays, and so there's always some uncertainty in what the actual time is.

TrueTime is a globally distributed clock that Google provides to applications on its servers. It uses GPS and atomic clocks as its underlying time reference to ensure better clock synchronization between the participating servers than other protocols offer. For context, Google mentions an upper bound of 7ms for the clock offsets between nodes in a cluster, while using NTP for clock synchronization gives somewhere between 100ms and 250ms.

TrueTime models the uncertainty in clocks

The beauty of TrueTime is that it also models the uncertainty in clocks by representing time as an interval. When you ask for the current time, it gives you an interval that represents what the earliest and latest possible values for the current time are. TrueTime's protocol guarantees that this interval is accurate and that the correct time is somewhere in the interval.

Table 1 below shows the TrueTime API.

From the paper, the TT.now() method returns a TTinterval that is guaranteed to contain the absolute time during which TT.now() was invoked. Note also that the endpoints of a TTinterval are of type TTstamp.

Spanner uses TrueTime to guarantee external consistency through the start and commit wait rules, which we'll look at next.

Start rule

This states that a transaction's timestamp, TS, must be no less than the value of TT.now().latest. Recall from an earlier section that:

For a RO transaction, Spanner assigns TS at the start of the transaction.
For a RW transaction, Spanner assigns TS when commit begins.

The start rule enforces that the value chosen for TS must be greater than or equal to TT.now().latest.

Commit Wait

This applies to only read-write transactions and ensures that a transaction will not commit until the chosen transaction timestamp has passed, i.e. when TT.after(TS) is true. Before a node can report that a transaction has committed, it must wait for about 7ms.

This rule guarantees that any transaction that starts after a read-write transaction has committed will be given a later transaction timestamp.

Using the rules

Let's look at the example below which updates the earlier examples with intervals and the commit wait rule. The scenario here is T1 commits, then T2 starts, and so T2 must see T1's writes i.e. we need TS1 to be less than TS2.

                   |1-----------10| |11--------------20|
  r/w T1 @ 10:         Wx2 P           C
                                 |10--------12|
  r/o T2 @ 12:                           Rx?

(where P stands for the 'prepare' phase in the two-phase protocol)

In this example, we assume the database assigns TS1 as 10, which guarantees that C will occur after time 10 due to commit wait. We also assume that Rx in T2 starts after T1 commits, and thus after time 10. T2 chooses TT.now().latest() as its transaction time (start rule), which is after the current time, which is after 10. Therefore TS2 > TS1.

Going back to the definition of external consistency which is that a transaction that starts after an earlier one commits should see the effects of the earlier one, this protocol provides that safety because:

Commit wait guarantees that the timestamp for a read-write transaction is in the past.
The timestamp used for snapshot isolation in a read-only transaction (TT.now().latest) is guaranteed to be greater than or equal to the correct time, and thus greater than or equal to the timestamp of any previously committed transaction (because of commit wait).

Conclusion

Spanner has inspired the creation of other systems like CockroachDB and YugaByteDB in its use of tightly synchronized clocks to provide external consistency, but its design hasn't come without criticism. "NewSQL database systems are failing to guarantee consistency," Daniel Abadi wrote in a 2018 post, "and I blame Spanner."

MIT 6.824: Lecture 12 - Distributed Transactions

2020-08-13T23:50:35-00:00

Distributed databases typically divide their tables into partitions spread across different servers which get accessed by many clients. In these databases, client transactions often span the different servers as the transactions may need to read from various partitions. A distributed transaction is a database transaction which spans multiple servers.

A transaction with the correct behaviour must exhibit the following, also known as the ACID properties:

Atomicity: Either all writes in the transaction succeed or none, even in the presence of failures.
Consistency: The transaction must obey application-specific invariants.
Isolation: There must be no interference from concurrently executing transactions. The ideal isolation level is serializable isolation. This guarantees that the result of the execution of concurrent transactions will be the same as if the database executed the transactions one after the order. I've written about serializability more extensively here and here, and I'll recommend reading them if you're interested in learning more about it.
Durability: Committed writes must be permanent.

These properties are more difficult to guarantee when a transaction involves multiple servers. For example, the transaction may succeed on some servers and fail on others. There needs to be a protocol to ensure that the database maintains atomicity even in that scenario. Also, if several clients are executing transactions concurrently, we must take extra care to control access to the shared data for those transactions.

This post will focus on how distributed databases provide atomicity through an atomic commit protocol known as Two-phase commit, and how concurrency control methods like Two-phase locking help to guarantee serializability.

Note: I've written about some of these topics in other posts on this site, so I'll be posting links to them if you want more detail.

Table of Contents

Concurrency Control
- Pessimistic Concurrency Control
  - Simple locking
  - Two-phase locking
Atomic Commit
- Two-phase commit
  - The coordinator is a bottleneck
  - Two-phase commit and Raft
Further Reading

Concurrency Control

Concurrency control ensures that concurrent transactions execute correctly, i.e., that they are serializable. There are two classes of concurrency control for transactions:

Pessimistic: Here, a transaction must place locks on the shared data objects that it wants to access before doing any actual reading or writing. When another transaction wants to access any of those records, it must wait for the original transaction to release those locks.
Optimistic: In this class, transactions read or modify records without placing any locks on them. However, when it's time to commit the transaction, the system checks if the reads/writes were serializable, i.e. if the transaction's results are consistent with a serial order of execution. If not, the database aborts the transaction and retries it.

Pessimistic concurrency control is faster if there are frequent conflicts between concurrent transactions, while optimistic concurrency control is faster when the conflicts are rare. We'll cover optimistic concurrency control in a later post.

Pessimistic Concurrency Control

There are two pessimistic concurrency control mechanisms highlighted in the lecture material for ensuring serializable transactions:

Simple locking
Two-phase locking

Simple locking

In simple locking, each transaction must first acquire a lock for every shared data object that it intends to read or write before it does any actual reading or writing. It then releases its locks only after the transaction has committed or aborted.

One downside of this method is that applications that discover which objects need to be read by reading other shared data will have to lock every object that they might need to read. Thus, a transaction may end up locking more data objects than needed.

Two-phase locking

Two-phase locking (or 2PL) differs from simple locking in that a transaction only acquires locks as needed. It works as follows:

Each transaction acquires locks as it proceeds its operation i.e. it monotonically increases the locks it holds until it requires no more locks. This is the first phase of the process.
After committing or aborting the transaction, it releases the locks. This is the second phase.

Two-phase locking is prone to deadlocks. A scenario involving two transactions T1 and T2, as shown below, is a real possibility in this protocol.

                                  T1      T2
                                  get(x)  get(y)
                                  get(y)  get(x)

The system must be able to detect cycles or specify a lock timeout, after which it must abort a blocked transaction. This is an issue even for single-node databases, as long as multiple clients can access the database at the same time. The database must be able to detect deadlocks and abort a transaction when that happens. This post I wrote earlier goes into more detail about 2PL and transaction isolation levels.

Atomic Commit

So far, we have discussed how concurrency control methods ensure that transactions are serializable. This next challenge, however, is more peculiar to distributed transactions. As stated earlier, the outcome on the individual servers involved in a distributed transaction may vary if one or more servers fail. To guarantee the atomicity property of transactions, we must take extra care to ensure that all the servers involved come to the same decision on the transaction outcome.

Two-phase commit

Two-phase commit(or 2PC) is a protocol used to guarantee atomicity in distributed transactions. Note that the only similarity it shares with Two-phase locking is in the naming, they do different things.

Figure 1: A successful execution of two-phase commit (2PC)^\[1\]

Two-phase commit works as follows for a distributed transaction:

The database adds another entity, known as the transaction coordinator, to be in charge of the transaction.
All the other servers involved in the transaction are called participants.
The transaction coordinator first delegates the writes in the transaction to the participants. Each participant creates a nested transaction from the original one, executes the operations which may require holding locks, and sends an acknowledgement to the coordinator.
When the coordinator receives the acknowledgement messages, it begins the first phase of the protocol. In this phase, the coordinator sends PREPARE messages to the participants. Each participant then responds to the coordinator by telling it whether it is PREPARED to commit or abort the transaction, based on the outcome of the nested transaction.
If any of the participants responds with an abort message, the coordinator decides to abort the whole transaction. The coordinator commits a transaction only if all the participants are ready to commit. The second phase starts when the coordinator creates a COMMITTED or ABORTED record for the overall transaction based on these conditions, and stores that outcome in its durable log. It then broadcasts that decision to the participant nodes as the outcome of the overall transaction.

Note that once a participant promises that it can commit the transaction, it must fulfil that promise regardless of failures. This is done by storing its outcome in a durable log before responding to the coordinator, so it can read from that log on recovery.

The coordinator is a bottleneck

The major downside of the two-phase commit protocol is if the coordinator fails before it can broadcast the outcome to the participants, the participants may get stuck in a waiting state. A participant that has indicated that it's prepared to commit cannot decide the outcome of the transaction on its own, as another participant may be prepared to abort. Also, a stuck participant cannot decide on its own to abort the transaction, because the coordinator might have sent a COMMIT message to another participant before it crashed.

This is not ideal because the participants may hold locks on shared objects while they are stuck in the waiting state, and thus may prevent other transactions from progressing.

We can improve the fault tolerance of 2PC by integrating it with a consensus algorithm, which will get discussed next.

Two-phase commit and Raft

Consensus algorithms like Raft solve a different problem from atomic commit protocols. We use Raft to get high availability by replicating the data on multiple servers, where all servers do the same thing. This differs from two-phase commit in that 2PC does not help with availability, and all the participant servers here perform different operations. 2PC also requires that all the servers must do their part, unlike Raft, which only needs a majority.

However, we can combine the two-phase commit protocol with a consensus algorithm as shown below.

Figure 2: Using 2PC with a distributed consensus algorithm

In Figure 2, the transaction coordinator(Tc) and the participants(A and B) each form a Raft group with three replicas. We can then perform 2PC among the leaders of each Raft group. This way, we can tolerate failures and still make progress with the system, as Raft will automatically elect a new leader. The next lecture will be on Google Spanner, which combines 2PC with the Paxos algorithm.

[1] By Martin Kleppmann in Designing Data-Intensive Applications.

MIT 6.824: Lecture 11 - Cache Consistency, Frangipani

2020-07-29T23:56:45-00:00

The ideal distributed file system would guarantee that all its users have coherent access to a shared set of files and be easily scalable. It would also be fault-tolerant and require minimal human administration.

Frangipani is a distributed file system that approximates this ideal by providing a consistent view of shared files while maintaining a cache for each user, offering the ability to scale up by adding new Frangipani servers, being able to recover automatically from server failures, and providing easy administration.

The post will focus on how Frangipani maintains cache coherence through the interaction between its two-layer structure and a distributed lock service.

Table of Contents

Frangipani
Conclusion
Further Reading

Frangipani

Overview

Frangipani is built on top of a distributed storage service named Petal, which provides virtual disks to its clients. Petal’s virtual disks are similar to physical disks in the way that data is written and read in blocks. It also provides the option to replicate data for high availability. Much of Frangipani’s abilities to be fault-tolerant, scalable and provide easy administration are inherited from Petal.

A typical setup consists of multiple Frangipani servers running on top of a shared Petal virtual disk as shown in Figure 1 below.

Frangipani maintains a write-back cache

Frangipani was built for a use case where all the servers are under a common administration e.g. a research lab with collaborating users. In such an environment, each user’s workstation will have a Frangipani file server module sitting below the user programs.

In this scenario, most of the operations will involve a user accessing their files. Frangipani makes these operations fast by maintaining a write-back cache on each workstation. However, a user may occasionally want to access files written by another user, or even access their files on another workstation. The goal in these cases is that the operations are correct. That is, we want every read for a file from one workstation to see the latest write to that file, despite the file being in another workstation’s cache. Herein lies the challenge of cache coherence: how can we keep the data across multiple caches consistent?

Next up, we’ll discuss Frangipani’s approach to solving this problem.

Synchronization and Cache Coherence

Frangipani uses multiple-reader/single-writer locks

When a Frangipani server wants to read a file or directory, it requests for a read lock on the object which enables the server to load relevant data from disk into its cache.

Similarly, a server updates a file or directory by requesting for a write lock, after which it can read or write associated data from the disk and cache it.

Multiple servers can hold read locks for an object, but those locks must be released before a write lock request can be granted. A Frangipani server gets a lease from the lock service when a lock request is granted, and it must continually renew this lease before a specified expiration time. Otherwise, the lock server will mark it as failed and reallocate the locks.

Dealing with conflicts

A read lock holder must invalidate its cache entries

When a conflicting lock request comes in for a file/directory, the lock service asks the current lock holder to release its lock. For a server holding a read lock, it will be asked to release its lock if a write lock request comes in. When this happens, the server must invalidate the object's entry in its cache before complying. This ensures that the server must fetch fresh data from the disk for any subsequent reads to that file or directory.

A write lock holder must flush its cache entries

A server holding a write lock may be asked to release its lock or downgrade it to a read lock. When that happens, it must first flush the dirty data to disk before complying. Note that if it is only downgrading its lock, it can still keep the cached data since no other server will update it. The upshot of this is that the cached copy of a server’s disk block can differ from the on-disk version only if it holds the write lock for that block.

The locking protocol ensures cache coherence

Summarizing this section with a quote from the paper:

Frangipani’s locking protocol ensures that updates requested to the same data by different servers are serialized. A write lock that covers dirty data can change owners only after the dirty data has been written to Petal, either by the original lock holder or by a recovery demon running on its behalf.

This protocol ensures that reads in Frangipani always see the latest writes, guaranteeing cache coherence.

Logging

Frangipani keeps track of metadata updates in a write-ahead log to simplify failure recovery and improve the performance of the system. The paper defines metadata as any on-disk data structure other than the contents of an ordinary file. This could refer to information about a directory, or pointers to the location of the files it contains.

Each Frangipani server has its own private log in Petal and before making a metadata update, it creates a log record describing the changes and appends the record to its in-memory log. This log is then written to Petal before the actual metadata is modified in its permanent location.

Frangipani assigns a version number for a metadata block each time it gets updated. A metadata update could span multiple blocks, and for each block that a log record updates, the record contains a description of the changes and the new version number.

Frangipani does not log user data

Note that Frangipani does not log the user data, only metadata is logged. This means that if a user on a workstation writes data to a file in its cache and the workstation crashes immediately, the recently written data may be lost. This is the same property on ordinary Unix file systems today. Applications that need stronger recovery guarantees can call fsync to flush the cache to disk immediately a file is written.

For example, if a user adds a new file with contents to a directory, Frangipani will log that a new file has been added to the directory, but it will not know the file contents until the cached data is flushed to disk. Note that if another workstation had been granted a read lock for the file before the crash happened, Frangipani’s locking protocol guarantees that the user changes from the original server must have been written to disk beforehand.

Recovery

There is a recovery daemon which helps to manage the recovery of failed servers. A failure can be detected in two ways:

When the lock service asks for a lock back and does not get a reply.
A client of a Frangipani server not receiving a response.

When a Frangipani server crashes while holding locks, the locks that it owns cannot be released without performing the necessary recovery actions. Specifically, the crashed server’s logs must be processed and any pending updates must be written to Petal.

The lock service performs recovery by asking another Frangipani server to process the crashed server’s logs and apply pending updates. The recovery server is itself granted a lock to ensure exclusive access to the crashed server’s log.

Frangipani uses the version number attached to each metadata block to ensure that recovery never replays a log record that describes an update which has already been completed.

Frangipani vs GFS

GFS is another system which has been covered earlier in a post for this course. Although they bear a similarity in that they are both distributed file systems, a major architectural difference is that GFS does not have caches, since its goal is good performance for sequential reads and writes of large files that are too big to fit in a cache. As a result, it needs no cache coherence protocol and its clients are relatively simple, unlike Frangipani workstations.

Another difference is that unlike Frangipani which presents itself as an actual file system, applications have to be explicitly written to use GFS via library calls. In other words, Frangipani runs at the kernel level while GFS runs at the application level.

Conclusion

Although some organizations today still store user and project files on distributed file systems, their importance has waned with the rise of laptops (which must be self-contained) and commercial cloud services. Also, the rise of web sites, big data, and cloud computing has shifted the focus of storage system development from file servers to database-like servers which provide a key/value interface.

However, Frangipani still presents some interesting ideas around cache coherence, distributed crash recovery, distributed transactions, and how these all interact with each other. Note that there are some limitations in its design, such as how locks are held on entire files/directories and the possibility of redundant logging since Petal also maintains its own log.

MIT 6.824: Lecture 10 - Cloud Replicated DB, Aurora

2020-07-15T09:40:36-00:00

Amazon Aurora is a distributed database service provided by AWS. Its original paper describes the considerations in building a database for the cloud and details how Aurora's architecture differs from many traditional databases today. This post will explain how traditional databases work and then highlight how Aurora provides great performance through quorum writes and by building a database around the log.

Table of Contents

How databases work (simplified)
Aurora
Conclusion
Further Reading

How databases work (simplified)

Traditional relational databases comprise many units that work together: the transport module which communicates with clients and receives queries, the query processor which parses a query and creates a query plan to be carried out, the execution engine which collects the results of the execution of the operations in the plan, and the storage engine.

The storage engine interacts with the execution engine and is responsible for the actual execution of the query. Databases typically make use of a two-level memory hierarchy: the faster main memory (RAM) and the slower persistent storage (disk). The storage engine helps to manage both the data in memory and on disk. The main memory is used to prevent frequent access to the disk.

The database organizes its records into pages. When a page in the database is about to be updated, the page is first retrieved from disk and stored in a page cache in memory. The changes are then made against that cached page until it is eventually synchronized with the persistent storage. A cached page is said to be dirty when it has been updated in memory and it needs to be flushed back on disk.

Let's consider the example of a database transaction which involves updating two records: A and B. It will involve the following steps:

The data pages on which the records are stored are located on disk and loaded into the page cache.
The database then generates redo log records from the changes that will be made to the pages. A redo log record consists of the difference between the before-image of a page and its after-image as a result of any changes made. These changes are then applied to the cached pages.
When the transaction commits, these log records are durably persisted to a write-ahead log stored on disk. The modified data pages may still be kept in memory and written back to disk later. In the event of a crash that leads to the loss of the in-memory data, the write-ahead log helps with recovery by ensuring that the database can still apply the logged changes to the on-disk structure.
After a period, the dirty pages are written back to disk.

The storage engine has a buffer manager for managing the page cache. Besides the buffer manager, it is made up of other components such as the transaction manager which schedules and coordinates transactions, the lock manager which prevents concurrent access to shared resources that would violate data integrity, and the log manager which keeps track of the write-ahead log.

Note: I've written another post that goes into more detail about how databases work.

Aurora

Aurora decouples the storage layer

In a monolithic database setup, the database runs on the same physical machine as the storage volume. However, in many modern distributed databases, the storage layer is decoupled from the database layer and replicated across multiple nodes to achieve scalability and resilience. While this is helpful for better fault tolerance, the downside is that communication between the database layer and the storage layer now happens via the network and the bottleneck lies there. Each write to the database could involve multiple write operations at the database layer, in what's known as write amplification, and all the communication between the database layer and the storage layer will happen over the network.

For example, the figure below illustrates write amplification. The setup shown is a synchronous mirrored MySQL configuration which achieves high availability across data centres. Each availability zone(AZ) has a MySQL instance with networked storage on Amazon Elastic Block Store (EBS), with the primary instance being in AZ1 and the standby instance in AZ2. There is also a primary EBS volume which is synchronized with the standby EBS using software mirroring.

Figure 1 - Network IO in mirrored MySQL

From the paper:

Figure 1 shows the various types of data that the engine needs to write: the redo log, the binary (statement) log that is archived to Amazon Simple Storage Service (S3) in order to support point-in-time restores, the modified data pages, a second temporary write of the data page (double-write) to prevent torn pages, and finally the metadata files.

This model shown in the setup is undesirable for two major reasons:

There is a high network load involved in moving the data pages around from the database instance to the storage volumes.
All four EBS volumes must respond for a write to be complete. This means that the process can be slowed down by at least one faulty node, which makes it less fault tolerant.

Aurora tackles these problems by offloading the processing of the redo log to the storage engine and through quorum writes. The next few sections will go into more detail about these optimizations.

The log is the database

As described above, a traditional database generates a redo log record when it modifies a data page in a transaction. The log applicator then applies this log record to the in-memory before-image of the page to produce the after-image. This log record must be persisted on disk for the transaction to be committed, but the dirty data page can be written back to disk at a later time.

Aurora reworks this process by having the log applicator at the storage layer in addition to the database layer. This way, no pages are ever written from the database layer to the storage layer. Redo log records are the only writes that ever cross the network. These records are much smaller than data pages and hence reduce the network load. The log applicator generates any relevant data pages at the storage tier.

Note that the log applicator is still present at the database layer. This way, we can still modify cached data pages based on the redo log records and read up-to-date values from them. The difference now is that those dirty pages are not written back to the storage layer; instead, only the log records are written back. There is a caveat on what redo records can be applied by the log applicator in the database layer, and that will be discussed in the next section.

Aurora uses quorum writes

Aurora partitions the storage volume into fixed-size segments of size 10 GB. Each partition is replicated six ways into Protection Groups (PGs). A Protection Group is made up of six 10GB segments organized across three availability zones, with two segments in each availability zone. Each write must achieve a quorum of votes from 4 out of 6 segments before it is committed. By doing this, Aurora can survive the failure of an availability zone or any other two nodes without losing write availability.

The database layer generates the fully ordered redo log records and delivers them to all six replicas of the destination segment. However, it only needs to wait for acknowledgement from 4 out of the 6 replicas before the log records are considered durable. Each replica can apply its redo records using its log applicator.

The paper's authors ran an experiment to measure the network I/O based on these optimizations and compared it with the setup in Figure 1. The results are shown in the table below.

Aurora performed significantly better than the mirrored MySQL setup. From the paper:

Over the 30-minute period, Aurora was able to sustain 35 times more transactions than mirrored MySQL. The number of I/Os per transaction on the database node in Aurora was 7.7 times fewer than in mirrored MySQL despite amplifying writes six times with
Aurora and not counting the chained replication within EBS nor the cross-AZ writes in MySQL. Each storage node sees unamplified writes, since it is only one of the six copies, resulting in 46 times fewer I/Os requiring processing at this tier. The savings we obtain by writing less data to the network allow us to aggressively replicate data for durability and availability and issue requests in parallel to minimize the impact of jitter.

The database instance is replicated too

In Aurora, the database tier can have up to 15 read replicas and one write replica. In addition to the storage nodes, the log stream generated by the writer is also sent to the read replicas. The writer does not wait for an acknowledgement from the read replicas before committing a write, it only needs a quorum from the storage nodes.

Each read replica consumes the log stream and uses its log applicator to modify the pages in its cache based on the log records. By doing this, the replica can serve pages from its buffer cache and will only make a storage IO request if the requested page is not in its cache.

Aurora does not need quorum reads

In Aurora, the database layer directly feeds log records to the storage nodes and keeps track of the progress of each segment in its runtime state. Therefore, under normal circumstances, the database layer can issue a read request directly to the segment which has the most up-to-date data without needing to establish a read quorum.

However, after a crash, the database layer needs to reestablish its runtime state through a read quorum of the segments for each protection group.

Conclusion

According to this post, Aurora was the fastest-growing service in AWS history as of March 2019. Its architecture has enabled it to provide performance and availability comparable to other commercial-grade databases at a cheaper cost. The paper goes into more detail about how the read and write operations work across the segments and the recovery procesws. As an aside, this has been my most difficult paper to read so far, and it's one where the lecture video (and rereading the paper multiple times!) really helped to clarify stuff.

MIT 6.824: Lecture 9 - CRAQ

2020-07-04T14:52:45-00:00

Many distributed systems today sacrifice stronger consistency guarantees for the sake of greater availability and higher throughput. CRAQ, which stands for Chain Replication with Apportioned Queries, is a system designed to challenge this tradeoff. CRAQ's approach differs from existing replication techniques we have seen so far, like in Raft. It improves on the original form of Chain Replication.

CRAQ is a distributed object-storage system that maintains strong consistency while still providing a high read throughput. Object-storage systems are better suited for applications that need flat namespaces, such as a key-value store. These are unlike file-based systems, which store data in a hierarchical directory structure.

This post will start by describing Chain Replication, before presenting how CRAQ improves on it.

Table of Contents

Chain Replication
CRAQ
Conclusion
Further Reading

Chain Replication

Chain Replication is an approach to replicating data across multiple nodes that involves arranging the nodes in a chain of a defined length C. All the write operations from clients go to the head of the chain, which passes them down to the next node in the chain. When a node receives a write operation, it applies the write and passes it down to the next node in the chain until the write reaches the tail node.

The tail node handles all read operations. This is because a write will not reach the tail until all the other replicas in the chain have applied it. The write is marked as committed when it reaches the tail. Therefore, a read will only return committed values.

Figure 1 illustrates a sample chain of length four. As shown by the dashed lines in the figure, the tail sends an acknowledgment back to the head when it commits a write.

Chain replication achieves strong consistency

Since all reads go to the tail and the tail stores all the committed write operations, the tail can apply a total ordering over all the operations. There is no possibility of a client seeing stale data. Concurrent reads to the tail will always return the most up-to-date value.

Write operations are cheaper

Another advantage of chain replication is that the cost of writes is spread equally over all nodes. Unlike in primary/backup replication where the primary node transmits data to all its backups, each node in chain replication transmits only to its successor in the chain. The simulation results by the paper's authors showed that chain replication achieved competitive or superior write throughput when compared with primary/backup replication.

The tail is a bottleneck

The major downside of chain replication is that since all the reads must go to the tail, read throughput cannot scale linearly with chain size. This is a trade-off that this approach makes to guarantee strong consistency. If clients can read from intermediate nodes, there is a possibility of concurrent reads for the same value to different nodes seeing different writes as they are being passed down the chain.

CRAQ, which we will discuss next, helps to address this downside.

CRAQ

CRAQ (Chain Replication with Apportioned Queries) is a modification of chain replication that increases the read throughput by allowing any node in the chain to handle read requests, while still guaranteeing strong consistency. CRAQ works as follows:

Each node in the chain can store multiple versions of an object: one clean version and a dirty version per recent write.
For write operations:
- Clients send writes to the head.
- As the write passes through a replica, the replica creates a new dirty version for that object.
- The tail creates a clean version for the object when it receives the write and sends an acknowledgment back along the chain.
- When a node receives an acknowledgment for an object version, it marks the latest object version as clean and deletes all previous versions for the object.
For reads from non-tail nodes:
- If the latest version that a node has for an object is clean, it replies with that version.
- Otherwise, it will ask the tail for the last committed version number for that object and returns that version of the object (also known as a version query).
If a node returns the most recent clean version of an object without asking the tail first, it may violate strong consistency, as the tail may have exposed a newer clean version to another reader.

In Figure 2, we see a CRAQ chain in the starting clean state. All the nodes will return the same value for any read request since they store an identical copy of the object. The nodes will remain in a clean state until they receive a write operation.

Figure 3 illustrates a dirty read situation where the successor node to the head makes a version query to the tail for its latest version number. The write request received at the head is still in propagation when the dirty read for key K comes in, which is why the node has multiple versions of the object.
The node then makes a version query to the tail node which returns V1 since that it's latest committed value. As a result, the dirty node returns the object value associated with the version number it gets from the tail. If a clean replica had received the read request, it would have returned its value immediately with no version query.

CRAQ offers throughput improvements over the standard chain replication in two different scenarios:

Read-Mostly Workloads: Here, most of the reads to the C-1 non-tail nodes will be clean reads, and so the throughput can scale linearly with the chain size C.
Write-Heavy Workloads: These have most read requests to non-tail nodes as dirty, and require version queries to the tail. However, these version queries are more light-weight than reading full objects from the tail, which allows the tail to process them at a higher rate.

The performance evaluation of CRAQ described in the paper showed that its read throughput is higher than in chain replication under these workloads.

Consistency models on CRAQ

CRAQ supports the varying needs of applications by providing three consistency models:

Strong Consistency: This works as described above. The guarantee is that all object reads will return the last committed write.
Eventual Consistency: For applications that may not need always need the latest version of an object, this allows an intermediate node to return the newest object version it knows about without contacting the tail. This means that a subsequent read to a different node for the same object may return an object version that is older than the previous one returned.
Eventual Consistency with Maximum-Bounded Inconsistency: This allows read requests to a node to return the newest object version it is aware of, but only to a certain point. We can set a limit based on either the time (relative to the local clock of a node) or an absolute version number. The advantage over the standard eventual consistency is that it guarantees that the value of a read operation has a maximum inconsistency period.

CRAQ needs a configuration manager

Unlike Raft, the CRAQ protocol cannot prevent a split-brain by itself. It is only concerned with data replication and does not handle things like leader (or head) election in the event of partitions. To address this, CRAQ is usually coupled with a configuration manager like ZooKeeper to deal with managing the nodes that make up a chain and handling leader election for the chain.

To recover from failure, each node in a chain keeps track of its predecessor and successor, as well as the chain head and tail. When a head fails, its immediate successor becomes the new head. Similarly, the tail's immediate predecessor takes over when the tail fails. Intermediate nodes can also be replaced by adding a new node between two nodes like in a doubly-linked list.

The configuration manager manages the nodes that make up a chain and choosing the chain's head and tail.

One slow node can weaken the chain

The major downside to CRAQ (and the standard chain replication) is that it requires all the nodes in the chain to take part before any write can commit. This is unlike quorum systems like Raft and ZooKeeper that only need a majority of the nodes to participate.

The upshot of this is that CRAQ by itself is less fault-tolerant than Raft and ZooKeeper as the system's throughput can be severely affected by at least one slow node.

Conclusion

CRAQ is a straightforward approach to replication with minimal chit-chat compared to a system like Raft, with its downside being that it's not very fault-tolerant as it needs all the nodes in a chain to respond to writes. It will be interesting to explore a quorum-based approach to chain replication.

MIT 6.824: Lecture 8 - ZooKeeper

2020-06-24T20:29:01-00:00

This week's lecture was on ZooKeeper, with the original paper being used as a case study. The paper sheds light on the following questions:

Can the coordination of distributed systems be handled by a stand-alone general-purpose service? If so, what should the API of that service look like?
Can we improve the performance of a system by N times if we add N times replica servers? That is, can the performance scale linearly by adding more servers?

This post will start by addressing the second question and discussing how ZooKeeper's performance can scale linearly through the guarantees it provides. In a later section, we'll discuss how ZooKeeper's API allows it to act as a stand-alone coordination service for distributed systems.

Table of Contents

ZooKeeper
Conclusion
Further Reading

ZooKeeper

Overview

ZooKeeper is a service for coordinating the processes of distributed applications. Coordination can be in the form of:

Group membership: Ensuring that members of a group know about all the other members of that group. The set of replicas involved in a Raft log entry replication form a group, for example.
Configuration: Keeping track of the operational parameters needed by the nodes in a cluster. Configuration items could include database server URLs for different environments, security settings, etc.
Leader Election: Like Raft, ZooKeeper can also be used to coordinate the election of a leader among a set of nodes.

ZooKeeper can scale linearly

Let's consider the distributed key-value store shown below which makes use of a Raft module in each replica. As discussed in the previous lecture, all the writes in Raft must go through a leader. To guarantee a linearizable history, all reads must go through the leader as well. One reason why reads cannot be sent to followers is that a replica may not be in the majority needed by Raft, and so may return stale value which violates linearizability.

Figure 1 - Sample Key-Value store using Raft for replication.

Going back to the second question asked at the beginning of this post:

If we added 2x more replicas to this setup, is there a chance that we could get 2x better performance of reads and writes?

The simple answer is: it depends. In a Raft based system like in Figure 1, adding more servers will likely degrade the performance of the system. This is because all the reads must still go through the leader, and the leader now has to store more information about the new servers.

ZooKeeper, on the other hand, allows us to scale the performance of our system linearly by adding more servers. It does this by relaxing the definition of correctness and providing weaker guarantees for clients. Reads can be served from any replica but writes are still sent to a leader. While this has the downside that reads may return stale data, it greatly improves the performance of reads in the system. ZooKeeper is a system designed for read-heavy workloads, and so the trade-off that leads to better read performance is worth it.

Next, we'll discuss the two basic ordering guarantees provided by ZooKeeper that make it suitable as a distributed coordination system.

Writes to ZooKeeper are linearizable

ZooKeeper guarantees linearizable writes, stated in the paper as:

All requests that update the state of ZooKeeper are serializable and respect precedence.

This means that if Client A writes a value for key X, any subsequent updates to that key by another client will first see the write by Client A. The leader chooses an order for the writes, and that order is maintained on all the followers.

Note that the distinction between this and linearizable reads is that the lack of linearizable reads here means that clients can read stale values for a key. For example, if a value for key X is updated by client A on one server, a read on another server for that key by client B may return the old value. This is because the freshness guarantee for ZooKeeper only applies to writes.

Client Requests are FIFO

A client is any user of the ZooKeeper service. The guarantee for clients is stated as:

All requests from a given client are executed in the order that they were sent by the client.

In other words, each client can specify an order for its operations (reads and writes), and that order will be maintained by ZooKeeper when executing.

These guarantees combined can be used to implement many useful distributed system primitives despite the weaker consistency guarantee. Note that ZooKeeper does provide optional support for linearizable reads which comes with a performance cost.

ZooKeeper as a service

ZooKeeper is a good example of how the coordination of distributed systems can be handled by a stand-alone service. It does this by exposing an API that application developers can use to implement specific primitives. Some examples of this are shown in a later section.

The ZooKeeper service is made up of a cluster of nodes that use replication for better fault tolerance and performance. Clients communicate with ZooKeeper through a client API contained in the client library. The client library also handles the network connections between ZooKeeper servers and the client. Some systems covered in the previous lectures where ZooKeeper could be used are:

VMware FT's Test-and-Set server: The system uses a test-and-set server to prevent split-brain by ensuring that the operation can succeed for only one of the replicas in the event of a network partition. This test-and-set server needs to be fault-tolerant. ZooKeeper is a fault tolerant service that allows to implement primitives like test-and-set operations.
GFS: GFS (pre-Colossus) made use of a single master to keep track of the metadata related to the chunks in the system. ZooKeeper could have played this role in the system and maybe even improved performance since all the replicas of the master would have been able to serve reads.
In MapReduce, ZooKeeper can be used to keep track of information like who the current master is, the list of workers, what jobs are assigned to what workers, the status of tasks, etc.

ZooKeeper uses a leader-based atomic broadcast protocol called Zab to guarantee that writes are linearizable in the system. You can think of Zab as a consensus protocol similar to Raft or Paxos.

Data Model

ZooKeeper provides a client API to manipulate a set of wait-free data objects known as znodes. Znodes are organized in a hierarchical form similar to file systems, and we can refer to a given znode using the standard UNIX notation for file systems. For example, we can use /A/B/C to refer to znode C which has znode B as its parent, where B has znode A as its parent.

A client can create two types of znodes:

Regular: Regular znodes are created and deleted explicitly.
Ephemeral: Ephemeral znodes can either be deleted explicitly or are automatically removed by the system when the session that created them is terminated.

In addition, a client can set a sequential flag when creating a new znode. When this flag is set, the znode's name is appended with the value of a monotonically increasing counter. For example, If z is the new znode and p is the parent znode's name, then the sequence value of z will be greater than that of any other sequential child znode of p created before z.

Figure 2: Illustration of ZooKeeper's hierarchical name space.

Znodes map to client abstractions

Znodes of either type can store data but only regular znodes can have children. Znodes are not designed for general data storage but are instead to use represent abstractions in a client's application. For example, the figure above has two subtrees for Application 1 and Application 2. Application 1 also has a subtree that implements a group membership protocol. The client processes p₁-p_n each create a znode p_i under /app1. In this example, the znodes are used to represent the processes for an application, and a process can be aware of its group members by reading from the /app1 subtree.

Znodes can also store specific metadata

Note that although znodes are not designed for general data storage, they allow clients to store specific metadata or configuration. For example, it is useful for a new server in a leader-based system to learn about which other server is the current leader. To achieve this, the current leader can be configured to write this information in a known znode space. Any new servers can then read from that znode space.

There is also some metadata associated with znodes by default like timestamps and version counters, with which clients can execute conditional updates based on the version of the znode. This will be explained further in the next section.

Client API

The ZooKeeper client API exposes a number of methods. Here are a few of them:

create(path, data, flags): Creates a znode with pathname path, stores data[] in it, and
returns the name of the new znode. flags enables a client to select the type of znode: regular, ephemeral, and set the sequential flag;
delete(path, version): Deletes the znode path if that znode is at the expected version;
exists(path, watch): Returns true if the znode with path name path exists, and returns false otherwise. The watch flag enables a client to set a watch on the znode;
getData(path, watch): Returns the data and meta-data, such as version information, associated with the znode. The watch flag works in the same way as it does for exists(), except that ZooKeeper does not set the watch if the znode does not exist;
setData(path, data, version): Writes data[] to znode path if the version number is
the current version of the znode;
getChildren(path, watch): Returns the set of names of the children of a znode;
sync(path): Waits for all updates pending at the start of the operation to propagate to the server that the client is connected to.

Note the following about the client API:

When clients connect to ZooKeeper, they establish a session. It is through this session that ZooKeeper can identify clients in fulfilling the FIFO order guarantee.
The sync method can be used to ensure that the read for a znode is linearizable, though it comes at a performance cost. It forces a server to apply all its pending write requests before processing a read.
All the methods have both a synchronous and asynchronous version available through the API.
The update methods(delete and setData) take an expected version number. If this differs from the actual version number of the znode, the operation will fail.
When the watch parameter in the read methods (getData and getChildren) is set, the operation will complete as normal except that the server promises that it will notify the client when the returned information changes. This is another optimization made in ZooKeeper to prevent a client from continuously having to poll for the latest information.

Implementing primitives using ZooKeeper

Using ZooKeeper for Configuration Management

To implement dynamic configuration management with ZooKeeper, we can store configuration in a znode, z_c. When a process is started, it starts up with the full pathname of z_c. The process can get its required configuration by reading z_c and setting the watch flag to true. It will get notified whenever the configuration changes, and can then read the new configuration with the watch flag set to true again.

Using ZooKeeper for Rendezvous

Let's consider a scenario where a client wants to start a master process and many worker processes, with the job of starting the processes being handled by a scheduler. The worker processes will need information about the address and port of the master to connect to it. However, because a scheduler starts the processes, the client will not know the master's port and address ahead of time for it to give to the workers.

This can be handled by the client creating a rendezvous znode, z_r, and passing the full pathname of z_r as a startup parameter to both the master and worker processes. When the master starts up, it can fill in z_r with information about its address and port. The workers can read from the znode when they start up and set the watch flag set to true. This way, workers will get notified when z_r is updated and can use the information there.

Using ZooKeeper for Group Membership

As described in an earlier section, we can use ZooKeeper to implement group membership by creating a znode z_g to represent the group. Any process member of the group can create an ephemeral child znode with a unique name under z_g. The znode representing a process will be automatically removed when the process fails or ends.

A process can obtain other information about the other members of its group by listing the children under z_g. It can then monitor changes in group membership by setting the watch flag to true.

Conclusion

The design of ZooKeeper is another great example of tailoring a system for a specific use case; in this example, strong consistency was relaxed to improve the performance of reads in read-mostly workloads. The results in the paper show that the throughput of ZooKeeper can scale linearly. ZooKeeper is also used in many distributed systems today, and you can find a list of some of those here.

There are a few more details about the implementation of ZooKeeper that were skipped in this post which you can find in the paper about things like how replication works, the atomic broadcast protocol, request handling from clients, etc. The section below contains links that can help if you're interested in learning more about these details.

MIT 6.824: Lectures 6 & 7 - Fault Tolerance(Raft)

2020-05-30T15:54:54-00:00

One common pattern in the previous systems we have discussed like MapReduce, GFS, and VMware FT is that they all rely on a single entity to make the key decisions. For example:

MapReduce has a single master node responsible for organizing the computation among the workers.
GFS has a master responsible for picking the primary replica for a chunkserver.
VMware FT uses an atomic test-and-set operation on a single shared disk to choose a new leader and prevent split-brain.

While this makes it easier for the system to decide, the downside of this approach is that the entity is now a single point of failure. If the entity is down, the system may not be able to make any progress without manual intervention.

Ideally, we would want a more fault-tolerant system that can withstand the loss of at least one node, even if that node is the one making critical decisions at the time. Such a system would need a mechanism for the other nodes in the cluster to automatically agree on which one of them should take over as the next leader.

It turns out that getting all the nodes in a cluster to agree on a decision is a hard problem in distributed systems. The main difficulty lies in the fact that it is impossible to distinguish between when a node has crashed, and when there is a network fault. These two problems have the same symptom: no response will be received from the node. If a node is wrongly declared dead, it may still be able to receive and execute requests from clients, which may lead to inconsistencies between the nodes. The other nodes may need to decide on:

What node should take over as the leader/master in the cluster when there's a failure; if two or more nodes think that they are the master, we could end up with a split-brain situation.
What client requests to execute; if two nodes decide differently, a client could see inconsistent results.

This problem of getting multiple nodes to agree has led to the development of consensus algorithms. Distributed consensus is the ability for components in a distributed system to reach agreement even in the presence of failures and an unreliable network. Paxos and Raft are two of the most popular consensus algorithms today. This post will focus on Raft and you can find its original paper here.

Table of Contents

Raft Paper Summary
Conclusion
Further Reading

Raft Paper Summary

Before Raft, Paxos had almost solely dominated the landscape of consensus algorithms. The problem with Paxos, however, is that it is difficult to understand. This difficulty motivated the creation of Raft. The authors wanted to develop a consensus algorithm that was not only practical, but also understandable. Their primary goal in the creation of the algorithm was understandability.

One of the approaches the authors took to create an understandable algorithm was to decompose the consensus problem into separate parts:

leader election
log replication, and
safety.

These parts will be discussed separately below.

Replicated State Machines

The replicated state machine approach to replication has been discussed in an earlier post, but to recap, the idea is that if two state machines are fed the same input operations in the same order, their outputs will be the same.

A replicated log is typically used to implement replicated state machines. Each server maintains its log, which contains a series of commands that its state machine must execute in order. These logs must be kept consistent across all the servers. Consistency here means that all the logs must have the same commands in the same order.

The job of a consensus algorithm is to keep this replicated log consistent. Each server has a consensus module for managing its log. The consensus module on a server is responsible for adding client commands to the log and communicating with the consensus modules on the other servers to ensure that their logs eventually contain the same commands in the same order.

Practical consensus algorithms must not violate the following properties:

Safety: That is, they must return a correct result under all non-Byzantine conditions. These conditions include packet losses, network delays, partitions, etc. Any value decided on by a server must have been proposed by another server, it is not enough for the server to just always return 'null'.
Availability: Consensus algorithms must be fully functional, provided that the majority of the servers are operational and can communicate with clients and each other. A cluster of seven servers can tolerate the failure of any three servers. A minority of slow or failed servers should not impact the overall performance of the system.
They do not depend on timing to ensure that logs are consistent. Clocks are unreliable, and consensus algorithms must not rely on them to determine the right order of events.

Raft Overview

Raft works by electing a single leader among the servers, which manages log replication. The leader accepts client requests and decides where log entries should be placed, without having to consult other servers. When the current leader fails, Raft includes a protocol for electing a new leader to take over.

As stated earlier, Raft breaks down consensus into three independent subproblems:

Leader election: A new leader must be chosen when the old one fails.
Log Replication: The leader accepts log entries from clients and is responsible for replicating them to the other servers.
Safety: If a log entry has been applied at an index on one server, it must be applied at the same index on all the other servers.

Figure 2: An example of an application where Raft could be used is a key-value database, as shown above. Client requests are converted to log entries on a leader, which are then replicated on the other servers by the Raft module. The state machine converts those log entries into records in the key-value store.

Why Use Logs?

A log is an append-only sequence of records used as a storage abstraction in many distributed systems today. Some of the benefits of using a log include:

Ordering: The log assigns an order to all its records. This helps the replicas agree on a single execution order of commands.
A log stores tentative commands until they are committed.
A log also keeps a persistent state of all the commands that have been executed by the state machine. By doing this, the current state of the application can be recreated at any time by replaying the log commands.

Server States

A server is always in one of three states: leader, follower, or candidate. A leader receives requests from clients and communicates them with the other servers. A follower is passive; it only receives log entries from the leader and votes in elections. Any requests from a client to a follower will be redirected to the leader. The candidate state is used for leader elections.

Figure 3 - "Server states. Followers only respond to requests from other servers. If a follower receives no communication, it becomes a candidate and initiates an election. A candidate that receives votes from a majority of the full cluster becomes the new leader. Leaders typically operate until they fail."

Raft Uses Terms

Time in Raft is divided into terms which can be of arbitrary length. Each term begins with an election and has at most one leader. Terms act as a form of a logical clock in a system. Some other key things to note about terms are as follows:

Terms are labelled with consecutive integers.
Each server stores its current term number and whenever servers communicate, they exchange term numbers.
If a server's current term is smaller than the other servers' own, it updates its term to the larger value.
If a candidate or leader detects that its term is smaller than another server's own, it reverts to follower state.

The servers in Raft communicate through RPCs. There are two main RPC methods involved here:

AppendEntries RPC: This method is invoked by a leader to replicate its log entries to the other servers. It includes the leader's term, the new log entries, and identifiers for where to place the entries, among other things.
RequestVote RPC: This is invoked by candidates to gain votes from the other servers. It includes the candidate's term, identifier, and some other items which will be discussed later.

How is a leader elected?

A leader sends periodic heartbeat messages to its followers to maintain its authority. These heartbeat messages are in the form of AppendEntries RPCs which contain no log entry. If a follower has not received any communication from the leader within a specified election timeout, it will transition to candidate state. The follower does this because it assumes there is no leader in the cluster at present, and so it begins an election to choose a new one.

The first step that a follower takes after becoming a candidate is to increase its term number. After doing that, it votes for itself and sends an RPC (including its term number) to all the other servers in parallel to request votes from them. A candidate will remain in its state until any of the following conditions is met:

It wins the election.
Another server establishes itself as the leader.
There's a period with no winner of the election.

For a candidate to win an election, it must have received votes from the majority of the servers in the cluster. Note that the majority is out of all the servers in the cluster, not just the live ones. A server can vote for at most one candidate in a given term, and the decision is made on a first-come-first-served basis. The winning candidate in a term then becomes a leader and immediately sends out heartbeat messages to the other servers to establish its authority. Note that there is another restriction on how servers can vote which will be discussed later.

It is also possible that while waiting for votes, a candidate receives an AppendEntries RPC from another server that claims to be the leader. If the supposed leader's term included in the RPC is greater than or equal to the candidate's term, the candidate will accept that the leader is legitimate and then transition back to follower state. Otherwise, if the leader has a smaller term than the candidate's term, the candidate will reject the RPC and continue in its state.

Lastly, it's possible that a candidate neither wins nor loses an election. This can happen if there are many candidates at the same time, and the votes get split equally among them. If this happens, the candidates will time out, increase their terms, and then begin another round of the election process. Raft takes an extra measure to prevent split votes from happening indefinitely, which will be discussed next.

Election Timeouts Are Randomized

Raft takes measures to prevent split votes in the first place by ensuring that the election timeout of each server is randomly chosen from a fixed interval. This way, the probability that two or more servers time out at the same time and become candidates is reduced. A single server can then win an election and send heartbeat messages to the other servers before their election timeout expires.

Randomized timeouts are also used to handle split votes. Each candidate's election timeout is restarted at the start of an election, and the timeout must elapse before it can start another election. This reduces the probability of another split vote in the new election.

Any Two Majorities Must Overlap in At Least One Server

All of Raft's key decisions rely on getting some confirmation from a majority of the servers in the cluster. The insight here is that if there is a partition in the cluster i.e., one or more nodes cannot communicate with the other set of nodes, at most one of the partitions can have the majority. In addition, any subsequent majorities must overlap with the previous ones in at least one server. This is how Raft is able to ensure that a term has at most one leader and prevent a split-brain situation. If we have candidates from separate partitions that cannot communicate, only one of them will be able to gain a majority of votes. There is no chance of two candidates in separate partitions gaining a majority for the same term.

A Raft cluster is typically made up of an odd number of servers. This helps to improve fault tolerance in the cluster. For example, a cluster of 4 nodes needs 3 nodes to make a decision, and a cluster of 5 nodes also needs the same amount of nodes. This means that we can tolerate more failures and still make progress in the cluster. More generally, a cluster of 2f + 1 servers can tolerate f failed servers.

Systems that rely on the overlap of majority set of servers for operation are referred to as quorum systems.

How does log replication work?

A leader is responsible for receiving client requests once it has been elected. These client requests contain commands that must be replicated to the servers in the cluster.

When a request comes in, the leader creates a new log entry containing that command and appends the entry to its log. It then sends AppendEntries RPCs in parallel to its followers. An entry will only be considered as committed when it has been safely replicated on the majority of its followers. Once an entry is marked as committed, it must be durably persisted on all the followers. If a follower crashes or there is a network fault that drops packets to the follower, the leader will keep retrying the request indefinitely until the follower receives the log entry. All followers must eventually store committed log entries. When an entry is committed, it can then be executed by the state machine.

Figure 4 - "Logs are composed of entries, which are numbered sequentially. Each entry contains the term in which it was created (the number in each box) and a command for the state machine. An entry is considered committed if it is safe for that entry to be applied to state machines"

The protocol for log replication in Raft is described as follows:

Each log entry contains a command as well as the term number when the entry was received by the leader. An entry is also identified by an index, which is its position in the log.
A leader keeps track of the highest entry that it has committed and includes that index in its future AppendEntries RPCs. Note that when the leader commits the entry at an index, that also commits all the preceding entries in its log. Followers apply entries to their state machines once they are notified that the entries have been committed by the leader.
To maintain consistency in the logs, Raft ensures that the following properties are met:
- If two entries in different logs have the same index and term, then they store the same command.
- If two entries in different logs have the same index and term, then the logs are identical in all preceding entries
These properties constitute what is known as the Log Matching property. The first property is guaranteed by the fact that a leader will only create one entry at a particular log index with a given term, and the position of a log entry will never change.
Raft performs a consistency check with the AppendEntries RPC to guarantee the second property above. The leader always includes the index and term of the log entry that immediately precedes the new entries when it sends AppendEntries RPCs to its followers. When a follower receives the RPC, if it does not contain an entry with the same index and term from the leader, it will refuse the entry. If an entry is not refused, it means that the follower's log is identical to the leader's log up through the new entries.

What happens if the leader crashes?

During normal operations, the protocol described above helps the leader and the followers' logs to remain consistent. However, leader crashes can result in inconsistencies. These inconsistencies can then be compounded by subsequent leader and follower crashes.

The figure below describes how inconsistencies between the logs can play out.

Figure 5 - Possible inconsistencies in the logs of Raft servers.

It is possible that any of the scenarios (a-f) in Figure 5 could happen in follower logs when the leader at the top is elected. Each box represents a log entry and the number in a box is the term of the entry. In this figure, we see that a follower may be missing entries (as in a and b), may have extra uncommitted entries (c-d), or both scenarios (e-f). Quoting the paper:

Scenario (f) could occur if that server was the leader for term 2, added several entries to its log, then crashed before committing any of them; it restarted quickly, became leader for term 3, and added a few more entries to its log; before any of the entries in either term 2 or term 3 were committed, the server crashed again and remained down for several terms.

Raft handles inconsistencies by forcing the followers' log to duplicate the leader's, meaning that conflicting entries in a follower's log can be overwritten by the entries in the leader's log. To make the logs between a leader and follower consistent, the leader first finds the latest index at which their logs are identical, deletes all the entries in the follower's log after that index, and then sends the follower the entries in its own log that come after that point.

Why is it safe for a new leader to overwrite its follower's logs?

Raft places a restriction on what servers can be elected as a leader with the Leader Completeness Property, which states that:

The leader for any given term must contain all the entries committed in previous terms.

It is able to enforce this because the RequestVote RPC contains information about the candidate's log, and a voter will deny a candidate's vote request if its log is more up-to-date than the candidate's. Since a candidate must get votes from the majority, at least one node in the majority must have the latest entries. A log is more up-to-date than another if it has a higher term number. If the two logs have the same term number, the log with the higher index entry is the more up-to-date one.

By this restriction, it is safe for a leader to overwrite a follower's log since the leader will always have the latest committed entries. Any uncommitted entries can be safely discarded because there is no expectation on the client's side that their request has been executed. Only committed entries guarantee that. Therefore, the server can return an error message to the client, telling it to retry the requests for the uncommitted entries.

What happens when a follower or candidate crashes?

When a follower crashes, any RPCs sent to it will fail. These requests are retried indefinitely until the server restarts and the RPC completes. It is safe for the AppendEntries and RequestVote RPCs to be retried because they are idempotent. For example, a follower can ignore an AppendEntries RPC if the request contains entries that are already present in its logs. Candidate crashes are handled in the same way.

How do we choose the right election timeout?

Although Raft does not depend on timing to ensure the correctness of the logs, timing is important in ensuring that client requests are responded to swiftly. If the election timeout is too long, the servers might be without a leader for some time, which will delay the time it takes to respond to a client request.

The timing requirement that a Raft system should maintain is stated in the paper as:

broadcast Time << election Timeout << MTBF

where broadcast time is the average time taken by a server to send RPCs to all the servers in parallel and receive responses from them, and MTBF stands for the Mean Time Between Failures for a single server.

By selecting an election timeout that is significantly greater than the broadcast time, it means that leaders can reliably send the required heartbeat messages to prevent unnecessary elections.

In addition, the election timeout should be orders of magnitude less than the average time it takes for a server to fail. If the MTBF is less than the chosen election timeout, there could be a delay between when a leader fails and when its followers begin a new election. However, by selecting a timeout that is based on the MTBF, we can reduce that delay on average.

How does Raft tame its logs?

To prevent the log from growing indefinitely, which can increase the time it takes for the state machine to replay a log when it restarts, Raft uses snapshotting for log compaction. A snapshot of the current application's state is written to durable storage, and all the log entries up to the point of the snapshot are deleted from the log. Each server takes its snapshots independently, and snapshots are taken when the log reaches a fixed size in bytes.

A snapshot also contains metadata such as the last included index, which is the index of the last entry in the log being replaced by the snapshot, and the last included term. These metadata are kept because of the AppendEntries consistency check for the first log entry after the snapshot, which needs a previous log entry and term.

Conclusion

There are a few details missed in this post about how Raft manages the changes to the server in a cluster, and how client interactions are handled, among others. I do hope, however, that this post has helped you build some intuition for a consensus algorithm like Raft if you were not previously familiar. The further reading section contains links to resources with more detail.

Note: The study group will be going on a break for the two weeks until 15 June, and so there will be a longer delay before the next lecture summary.

MIT 6.824: Lecture 5 - Go, Threads, and Raft

2020-05-19T18:50:01-00:00

Although 'Raft' is mentioned in the title, a better title for this post is 'Concurrency in Go', as Raft will not be discussed until the next post. Also, unlike other posts on this blog, this one will also feature code samples! Examples of good and bad Go code will be shown for building concurrent applications.

Note that although the examples below are in Go, these concepts apply more generally to other programming languages.

Table of Contents

Mutexes
Condition Variables
Channels
Conclusion
Further Reading

Mutexes

Goroutines are lightweight threads in Go. Every Go program has at least one goroutine running, which is the main one from which other goroutines are started. Goroutines execute their functions asynchronously.

In the block below, goroutines are started on line 4 using the go keyword. The for loop on line 3 starts 1000 goroutines, each of which increments the counter variable declared on line 2.

Can you spot the bug?

Now, if you wrote a function like above, chances are that you would want and expect the final value of the counter to be equal to 1000. However, when this program is executed, it is highly unlikely that the final value of the counter is 1000.

This is because the counter field is shared by all the goroutines, and we are not protecting access to that variable. For example, if the current value of the counter is 5 and two goroutines try to increment the value at the same time, the value of the counter after the operations will be 6, rather than the 7 we expect after two increments. This is a race condition, and a program like this must take care to prevent it.

Go has mutexes, which help prevent race conditions when used properly. Mutexes can be used to protect critical sections in your code. Critical sections are those blocks of code which should only be accessed by one thread at a time. Those blocks will typically involve accessing shared variables, and protecting them is essential in ensuring that your program runs predictably.

In the block below, the variable mu declared on line 3 represents the mutex, and we wrap a lock around the shared counter variable on lines 6-8.

When this program is executed, the final value of the counter will be 1000, as expected.

More generally, use locks when you want to protect invariants in a concurrent application, i.e. properties of your application which should always be true. To explain this, let's look at this example below in which we perform operations for two clients of a bank: Seyi and Yinka.

Seyi and Yinka both start the operation with 10,000 in each of their accounts, leading to a total of 20,000 in the bank. In the first goroutine declared on line 8, Seyi transfers money from her account to Yinka's. In the second goroutine on line 19, Yinka transfers money from his account to Seyi's.

An author of a program like this will likely expect the sum of the amount in each account to always be equal to 20,000—since any amount removed from one account is expected to be transferred to the other immediately.

Can you spot the bug?

When this program is run, the code on line 33 will likely be executed multiple times, highlighting that there were multiple periods where that invariant—that the sum of each accounts' value should equal the initial total value—is observed to be violated.

The problem in this example is that although we have placed locks on each shared variable, locks by themselves are not enough to enforce the invariants in our program. If a section of code must be executed atomically (i.e. all the operations in that section must either execute together or not execute at all), then a lock should wrap around that section.

A modified version of the bank example is shown below. In this example, we see on lines 10-13 and lines 18-21 that locks are wrapped around the critical sections, which will help protect the invariants we are interested in.

Condition Variables

To introduce condition variables, let's consider this example below.

In this program, we have a for loop on line 6 which starts goroutines from the main goroutine and tries to gain votes from them on line 8. If it wins a vote, the shared count variable is incremented. After each vote is cast, the finished variable is also incremented.

On line 18, we have a for loop which will prevent the main goroutine from progressing until it has either won at least 5 votes or all 10 goroutines have cast their vote.

Note that we have placed locks to protect our invariants, and this program will indeed execute as we expect. However, on closer observation, the for loop starting on line 18 is not the best use of the CPU's resources. In each iteration, it will obtain a lock, check that the condition is met and then release the lock. The problem is that there is nothing guiding the number of iterations that the loop will undergo before the break condition on line 20 is met. This means that there could many unnecessary loop iterations which will each obtain access to the shared variable.

What if there was a way to run an iteration of the loop only when it is possible for the break condition to have been met? In the example above, the condition can only be met when the requestVote() call returns. Thanks to Condition Variables, we can delay the execution of the loop until it is possible for the break condition to have been met.

This next example illustrates that.

In this example, we have initialized a condition variable on line 5 and passed it the address of the mutex. When the for loop on line 21 is reached and the condition isn't met, we use the cond.Wait() method to pause the execution of the main goroutine. A key thing to note is that cond.Wait() releases the lock being held on mu by the current Goroutine. This execution pause and lock release will happen until the cond.Broadcast() method is executed on line 16.

Any newly cast vote may cause the break condition to be true, and so we use the cond.Broadcast() method to inform all the threads that are waiting on that condition to resume their execution, and try to obtain the lock on the mutex again.

This is a more efficient use of CPU resources than looping continuously without any knowledge of whether the break condition could have been met or not.

The general format when using condition variables is as shown in the pseudocode below.

Channels

A Channel is a primitive for communicating between goroutines. The general idea is that one goroutine sends a message on the channel, and that message is received from that channel on a separate goroutine.

An example is shown below.

The channel operator is used on lines 5 and 8 to communicate between the goroutines.

Some key things to note about channels in their default usage are as follows:

Channels have no internal capacity, not even of one. The implication of this is that for every send operation to a channel, there must be a corresponding receive operation, and vice versa.
Every send operation will block its thread until a receive happens, and vice versa.

Go has buffered channels which may have some internal capacity that corresponds to a user-defined value. With buffered channels, sends are non-blocking until the buffer is full, and receive operations are non-blocking until the buffer is empty.

A major use of channels is for implementing producer/consumer queues.

If you're a Java developer, the BlockingQueue interface is similar to Go Channels, with the ArrayBlockingQueue being an implementation of a buffered channel, and a SynchronousQueue being an unbuffered channel.

Conclusion

We have gone over some Go constructs that will help when building concurrent applications. I've had to make use of some of these in working out the labs. If you're interested in learning more about concurrency in Go, I would highly recommend checking out this page on the official Go site.

Note:

The code samples are lifted from the course material and can be downloaded from here.

MIT 6.824: Lecture 4 - Primary/Backup Replication

2020-05-09T19:11:43-00:00

This lecture's material was on the subject of replication as a means of achieving fault tolerance in a distributed system. The VMware FT paper was used as a case study on how replication can be implemented.

VMware FT Paper Summary

Replication Approaches

There are two main ways in which replication can be implemented in a cluster:

State Transfer: In this mode, the primary replica executes all the relevant operations and regularly ships its state to the backup(s). This state could include CPU changes, memory and I/O device changes, etc, and this approach to replication often requires a lot of bandwidth.
Replicated State Machine: The idea here is that if two state machines are fed the same input operations in the same order, their outputs will be the same. The caveat is that these input operations should be deterministic. VMware FT uses this approach for replication and does work to ensure that non-deterministic operations are applied in a deterministic way. This will be explained further in a later section.

Replication Challenges

Implementing replication is challenging, and some of those challenges deal with answering the questions:

What state do we replicate?
Should the primary machine always wait for the backup to apply the replicated state?
When should the backup should take over and become the primary?
How do we ensure that a replacement backup gets in sync with the primary?

As for what state to replicate, the options include:

Application-Level State: An example of this is replicating a database's tables. GFS replicates application-level state. In this case, the primary will only send high level operations like database queries to its backups.
Machine-Level State: This involves replicating things like the contents of the RAM and registers, machine events like interrupts and DMA, etc. The primary also forwards machine events (interrupts, DMA) to the backup. VMware FT replicates machine-level state. It's able to do this because it has full control of the execution of both the primary and backup, which makes it easier to deal with non-deterministic events. I'll go into more detail about this later.

For all the benefits of replication, it's important to note that replication is not able to protect against all kinds of failures. There are different failure modes in distributed systems and the failures dealt with in the paper's implementation are fail-stop failures.

Fail-stop failures are noticeable failures which cause a machine to stop executing. Here, it is assumed that the machine will not compute any bad results in the process of failing. Replication can help to provide fault tolerance for these kinds of failures. On the other hand, replication may less helpful when dealing with hardware defects, software bugs or human configuration errors. These may sometimes be detected using checksums, for example.

VMware FT Paper Summary

Glossary

Guest OS: A Guest OS is the operating system installed on a virtual machine.
Hypervisor: A Hypervisor (also known as a Virtual Machine Monitor) is the software which creates, runs and manages virtual machines.
Host OS: The Host OS is the operating system installed on the physical machine on which the hypervisor runs. (Note: There are some hypervisors which interact directly with the hardware in place of a Host OS, but those are not the focus of this topic.)
Virtual Machine: A virtual machine is an emulation of a physical computer system. Like physical machines, virtual machines are able to run an operating system and execute its applications as normal.
VMware vSphere FT: This is an implementation of fault tolerance on VMware's cloud computing virtualization platform.

Overview

Figure 1 illustrates the FT configuration which involves a primary VM and its backup VM. The backup VM is located on a different physical server from the primary and is kept in sync with the execution of the primary. The VMs are said to be in virtual lock-step. They also have their virtual disks located on shared storage, which means that the disks are accessible by either VM.

All the inputs (e.g. network, mouse, keyboard, etc.) go to the primary VM. These inputs are then forwarded to the backup VM via a network connection known as the logging channel. For non-deterministic input operations, additional information is also sent to ensure that the backup VM executes them in a deterministic way. Both VMs execute these operations but only the outputs produced by the primary VM are visible to the client. The outputs of the backup VM get dropped by the hypervisor.

Deterministic Replay

As mentioned earlier, for state machine replication to work, there needs to be a way to deal with the presence of non-deterministic instructions. A virtual machine can have non-deterministic events like virtual interrupts, and non-deterministic operations like reading the current clock cycle counter of the processor. These instructions may yield different results even if the primary and its backup have the same state. Therefore, it is important for them to be handled carefully to ensure consistency between the two machines.

The occurrence of this non-determinism presents three challenges when replicating the execution of a VM:

All the input and non-determinism on the primary VM must be correctly captured to ensure that it can be executed deterministically on the backup VM.
The inputs and non-determinism must be correctly applied to the backup VM.
They must be applied in a manner that does not degrade the performance of the system.

To deal with these challenges, VMware has a deterministic replay functionality that is able to capture all the inputs to a primary VM as well as possible non-determinism, and replay these events in their exact order of execution on a backup VM.

FT Protocol

We have seen so far that VMware FT uses deterministic replay to produce the relevant log entries and ensure that the backup VM executes identically to the primary VM. However, to ensure that the backup VM consistent in a way that makes it indistinguishable from the primary to a client, VMware FT has the following requirement:

Output Requirement: If the backup VM ever takes over after a failure of the primary, the backup VM will continue executing in a way that is entirely consistent with all outputs that the primary VM has sent to the external world.

The rule set in place to achieve the output requirement is the Output Rule stated as follows:

Output Rule: The primary VM may not send an output to the external world, until the backup VM has received and acknowledged the log entry associated with the operation producing the output.

The idea is that as long as the backup VM has received all the log entries (including for any output-producing operations), then it will be able to replay up to the state last seen by the client if the primary VM crashes.

This rule is illustrated below.

As shown in Figure 2, the output to the external world is delayed until the primary VM has received an acknowledgment from the backup VM that it has received the associated log entry for the output operations.

Note: VMware FT does not guarantee that all outputs are produced exactly once during a failover situation. In a scenario where the primary crashes after the backup has received the log entry for the latest output operation, the backup cannot tell if the primary crashed immediately before or after sending its latest output. Therefore, the backup may re-execute an output operation. The good news is that the VMware setup can rely on its network infrastructure to detect duplicate packets and prevent them from being retransmitted to the client.

Detecting and Handling Failures

Heartbeat operations are used in combination with monitoring the traffic on the logging channel to detect failures of either VM. The system can rely on the traffic on the logging channel because it logs timer interrupts (which occur regularly on a guest OS), and so when the traffic slows down, it can safely assume that the guest OS is not functioning.

If the backup fails, the primary stops sending entries on the logging channel and just keeps executing as normal. If you're wondering how the backup will be able to catch up later, VMware has a tool called VMotion that is able to clone a VM with minimal interruption to the execution of the VM. I'll touch more on that later.

If the primary fails, the backup VM must first replay its execution until it has consumed the last log entry. After that point, the backup will take over as the primary and can now start producing output to the external world.

To avoid a split-brain situation by ensuring that only one VM can be the primary at a time, VMware uses an atomic test-and-set operation on the shared storage. This is relevant if the primary and backup lose network communication with each other and they both attempt to go-live. The operation will succeed for only one of the machines at a time.

VMware FT is able to restore redundancy when either of the VMs fails by automatically starting a new backup VM on another host.

Practical Implementation of FT

Starting and Restarting VMs

A challenge in building a system like this is in figuring out how to start up a backup VM in the same state as the primary VM while the primary is running. To address this, VMware modified an existing tool called VMware VMotion. In its original form, VMware VMotion allows for a running VM to be migrated from one server to another with minimal disruption. However, for fault tolerance purposes, the tool was reworked as FT VMotion to allow for the cloning of a VM to a remote host. This cloning operation interrupts the execution of the primary by less than a second.

A logging channel is set up automatically by FT VMotion between the source VM and the destination, with the source entering logging mode as the primary, and the destination entering replay mode as a backup.

Managing the Logging Channel

Figure 3 above illustrates how the flow of log entries from when they are produced by the primary VM to their consumption at the backup VM.

If the log buffer of the backup VM is empty, the VM will stop its execution until the next log entry arrives. This pause can be tolerated since the backup VM is not interacting with clients.

On the other hand, the primary VM will stop executing when its log buffer is full, and the execution is paused until the log entries are flushed out. However, this pause can affect clients and so work must be done to minimize the possibility of the log buffer getting filled up.

The log buffer of the primary may fill up when the backup VM is executing too slowly and thus consuming its log entries too slowly. This can happen when the host machine of the backup VM is overloaded with other VMs, and so the backup is unable to get enough CPU and memory resources to execute normally.

Apart from avoiding the pauses due to a full log buffer in the primary VM, another reason why we do not want the backup to execute slowly is to reduce the time that it takes to "go-live"; i.e. during the failover process, if the backup is lagging far behind the latest log entry, it will take longer for the backup to take over as the primary. This delay will be noticeable by a client.

Interestingly, a mechanism has been implemented in VMware FT to slow down the execution of the primary VM when the backup VM is starting to get far behind (more than 1 second behind according to the paper). The execution lag between the primary and backup VM is typically less than 100 milliseconds, and so it makes sense that a lag of 1 second will raise an alarm. The primary VM is slowed down by giving it a smaller amount of CPU resources to use.

Implementation Issues for Disk IO

This section details some challenges related to disk IO in building a system like this, and how VMware FT addresses these challenges.

Issue #1

Simultaneous disk operations to the same location can lead to non-determinism, since these operations can execute in parallel and are non-blocking.

Solution

Their implementation is able to detect such IO races and force racing disk operations to execute sequentially.

Issue #2

What happens with disk IO operations that are not completed on the primary due to a failure, and the backup takes over? The challenge here is that the backup cannot know whether or not the operation completed successfully.

Solution

Those operations are simply reissued.

Issue #3

If a network packet or requested disk block arrives and needs to be copied into the main memory of the primary VM while an application running in the VM is also reading from the same memory, that application may or may not see the new data coming in. This non-determinism is a race condition.

Solution

They avoid this race by using bounce buffers to temporarily hold any incoming packet or disk block in the primary VM. When the data is copied to the buffer, the primary is interrupted and the bounce buffer is copied into the primary's memory, after which it can resume its execution. This interrupt is logged as an entry in the logging channel, ensuring that both the primary and backup will execute the operations at the same time.

Design Alternatives

This section compares some alternative design choices that were explored before the implementation of VMware FT and the tradeoffs they settled for.

Shared vs Non-shared Disk: The implementation of VMware FT uses a shared storage that is accessible to both the primary and backup VMs. In an alternative design, the VMs could have separate virtual disks which they can write to separately. This design could be used in situations where shared storage is not accessible to both the primary and secondary VM or shared storage is too expensive. A disadvantage of the non-shared design in this system is that extra work needs to be done to keep not just the running states in sync, but also their disk states.

Executing Disk Reads on the Backup VM: In the current implementation, the results of disk reads from the primary VM are sent to the backup via the logging channel. In an alternative design, the backup VM could just perform its disk reads directly. This approach can help to reduce the logging channel traffic when a lot of disk reads are involved in a workload. However, two main challenges with this approach are:
- It may slow down the execution of the backup VM. Remember that we need the backup VM to execute quickly to speed up the failover process.
- What if the read succeeds on the primary but fails on the backup(and vice versa)?
Performance evaluation by VMware showed that executing disk reads on the backup slowed down throughput by 1-4%, but also reduced the logging bandwidth.

Conclusion

This was an interesting lecture for me because my knowledge of replication was limited to data replication (application-level state), and so getting exposed to a different approach has been a good experience.

One limitation of this system is that it only supports uniprocessor executions, as the non-determinism involved in multi-core parallelism will make it even more challenging to implement. VMware has since extended the system described in the paper with an implementation that does support mutli-core parallelism. I do not know the details of that system yet and can't say much about what has changed.

Open Question I Have

In the section on detecting and handling failures, it was mentioned that an atomic test-and-set operation is used to prevent split-brain scenarios. What is unclear to me is how those operations work.

According to the FAQ on the course website for this lecture, the pseudocode of such an operation looks like:

 test-and-set() {
    acquire_lock()
    if flag == true:
      release_lock()
      return false
    else:
      flag = true
      release_lock()
      return true

What is unclear to me is when the flag is set to false; i.e. if the operation succeeds for a VM, does that VM set the flag to false when it is about to fail? I'm assuming it does, but would like some confirmation on this.

Update

I sent an email about the above question to the course staff and got this response from Prof. Morris on how the ATS operation works:

The flag is ordinarily false, when both primary and backup are alive.

If the primary and backup stop being able to talk to each other, the one
that is alive will use test-and-set to set the flag and acquire the
right to "go live". If both are alive but can't talk to each other, both
may call test-and-set() but only one will get a "true" return value, so
only one will go live -- thus avoiding split brain.

The paper doesn't say when the flag is cleared to false. Perhaps a
human administrator does it after restarting the failed server and
ensuring that it's up to date. Perhaps the server that "went live"
clears the flag after it sees that the other server is alive again
and is up to date.

MIT 6.824: Lecture 3 - GFS

2020-05-02T00:03:19-00:00

The Google File System paper is relevant to this course because GFS is an example of distributed storage, which is a key abstraction in building distributed systems. Many distributed systems are either distributed storage systems or systems built on top of distributed storage.

Building distributed storage is a hard problem for a couple of reasons:

These systems are built to get a high performance when the volume of data is too large for a single machine, which leads us to sharding (splitting) the data over multiple servers.
Because multiple servers are involved, there will likely be more faults in the system.
To improve fault tolerance, the data is usually replicated across multiple machines.
Replication of data leads to potential inconsistencies in the data. A client could read from a stale replica, for example.
The protocols for better consistency often lead to a lower performance, which is the opposite of what we want.

This cycle, which leads back to performance, highlights the challenges in building distributed systems. The GFS paper touches on these topics and discusses the trade-offs that were made to yield good performance in a production-ready distributed storage system.

Table of Contents

Paper Summary
Conclusion
Further Reading

Paper Summary

The system was built at a time when Google needed a system to meet its data-processing demands with the goals of achieving good performance while being:

Global: Not tailored for just one application, but available to many Google applications.
Fault Tolerant : Designed to account for component failures by default.
Scalable: It could expand to meet increasing storage needs by adding extra servers.

Also note that it was tailored for a workload that largely consisted of sequential access to huge files (read or append). It was not built to optimize for low-latency requests; rather, it was meant for batch workloads which often read a file sequentially, as in MapReduce jobs.

Design Overview

A GFS Cluster is made up of a single master and multiple chunkservers, and is accessed by multiple clients, as shown in the figure below.

Breakdown of the architecture:

A client and a chunkserver may reside on the same machine, provided it has enough resources.
A file stored in GFS is broken into chunks. A chunk is identified by an immutable and globally unique chunk handle.
A chunk can be replicated across multiple chunkservers. It is configured to be replicated across three chunkservers by default.
The master keeps track of all the filesystem metadata. It knows how a file is split into chunks, and keeps track of what chunkservers hold each chunk.

It is also responsible for garbage collection of orphaned chunks (when a file is deleted), and the migration of chunks between chunk servers for rebalancing load.
The master communicates with each chunkserver via Heartbeat operations to pass instructions to it and collects its state.
There is a GFS Client library that is linked into each application using GFS. The library handles communication with the master and chunkservers to read and write data on behalf of the application.
The chunkservers do not perform any form of caching but instead rely on the Linux buffer cache which keeps frequently accessed data in memory.

An interesting design choice made in this system is the decoupling of the data flow from the control flow. The GFS client only communicates with the master for metadata operations, but all data-bearing communications (reads and writes) go directly to the chunkservers. I'll explain how that works next.

Single Master

As noted earlier, there is a single master in a GFS cluster which clients only interact with to retrieve metadata. This section highlights the role of the master in decoupling the data flow from the control flow.

To read the data for a file:

The client first communicates with the master, sending it a request containing the file name and the chunk index. The client derives the chunk index from a combination of the file name and the byte offset the application wants to read from.
The master replies with the corresponding chunk handle and the location of its replicas.
The client caches this information using the file name and chunk index as the key.
The client then sends a request to one of the replicas specified by the client, usually the one closest to it, specifying the chunk handle and the byte range for the requested data.
By caching the information from the master, further reads to the same chunk do not require any more client-master interactions until the cached information expires or the file is reopened.

Chunk Size

In typical Linux filesystems, a file is split into blocks and those blocks usually range from 0.5-65 kilobytes in size, with the default on most file systems being 4 kilobytes.

A block size is the unit of work for the file system, which means reading or writing any files is done in multiples of that block size.

In GFS, chunks are analogous to blocks, except that chunks are of a much larger size (64 MB). Having a large chunk size offers several advantages in this system:

The client will not need to interact with the master as much, since reads and writes on the same chunk will require only one initial request to the master to get the chunk location information, and more data will fit on a single chunk.
With large chunk sizes, a client is more likely to perform many operations on a given chunk, and so we can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time
Third, it can reduce the size of the metadata stored on the master, since it's keeping track of fewer chunks.

Google uses lazy space allocation to avoid wasting space due to internal fragmentation. Internal fragmentation means having unused portions of the 64 MB chunk. For example, if we allocate a 64 MB chunk and only fill up 10 MB, that's a lot of unused space.

According to this Stack Overflow answer,

Lazy space allocation means that the physical allocation of space is delayed as long as possible, until data at the size of the chunk size is accumulated.

From the rest of that answer, I think what this means is that the decision to allocate a new chunk is based solely on the data available, as opposed to using another partitioning scheme to allocate data to chunks.

This does not mean the chunks will always be filled up. A chunk which contains the file region for the end of a file will typically only be partially filled up.

Metadata

The master stores three types of metadata in memory:

File and chunk namespaces (i.e. directory hierarchy.)
The mapping from files to chunks.
The location of each chunk's replica.

The first two types listed are also persisted on the master's local disk. The third is not persisted; instead, the master asks each chunkserver about its chunks at master startup and when a chunkserver joins the cluster.

By having the chunkserver as the ultimate source of truth of each chunk's location, GFS eliminates some of the challenges of keeping the master and chunkservers in sync regularly.

The master keeps an operation log, where it stores the namespace and file-to-chunk mappings on local disk. It replicates this operation log on several machines, and GFS does not make changes to the metadata visible to clients until they have been persisted on all replicas.

After startup, the master can restore its file system state by replaying the operation log. It keeps this log small to minimize the startup time by periodically checkpointing it.

Consistency Model

The consistency guarantee for GFS is relaxed. It does not guarantee that all the replicas of a chunk are byte-wise identical. What it does guarantee is that every piece of data stored will be written at least once on each replica. This means that a replica may contain duplicates, and it is up to the application to deal with such anomalies.

From Table 1 above:

A file region is consistent when all the clients will see the same data for it, regardless of which replica they read from.
After a file data mutation, a region is defined if it is consistent and all the clients will see the effect of the mutation in its entirety.

The data mutations here may be writes or record appends. A write occurs when data is written at a file offset specified by the application.

Record Appends

Record appends cause data to be written atomically at least once even in the presence of concurrent mutations, but at an offset chosen by GFS.

If a record append succeeds on some replicas and fails on others, those successful appends are not rolled back. This means that if the client retries the operation, the successful replicas may have duplicates for that record.

Retrying the record append at a new file offset could mean that the offset chosen for the initial failed append operation is now blank in the file regions of the failed replicas; that is, if the region has not been modified before the retry. This blank region is known as a padding, and the existence of padding and duplicates in replicas are what make them inconsistent.

Applications that use GFS are left with the responsiblilty of dealing with these inconsistent file regions. These applications can include a unique ID with each record to filter out duplicates, and use checksums to detect and discard extra padding.

There is also the possibility of a client reading from a stale replica. Each chunk replica is given a version number that gets increased for each successful mutation. If the chunkserver hosting a chunk replica is down during a mutation, the chunk replica will become stale and will have an older version number. Stale replicas are not given to clients when they ask the master for the location of a chunk, and they are not involved in mutations either.

Despite this, because a client caches the location of a chunk, it may read from a stale replica before the information is refreshed. The impact of this is low because most operations to a chunk are append-only. This means that a stale replica usually returns a premature end of chunk, rather than outdated data for a value.

System Interactions

This section describes in more detail how the client, master and chunkservers interact to implement data mutations and atomic record appends.

Writes

When the master receives a modification operation for a particular chunk, the following happen:

a) The master finds the chunkservers which hold that chunk and grants a chunk lease to one of them.

The server with the lease is called the primary, while the others are secondaries.
The primary determines the order in which mutations are applied to all the replicas.

b) After the lease expires (typically after 60 seconds), the master is free to grant primary status to a different server for that chunk.

The master may sometimes try to revoke a lease before it expires (e.g. to prevent the mutation of a file when the file is being renamed. )
The primary may also request an indefinite extension of the lease as long as the chunk is still being modified.

c) The master may lose communication with a primary while the mutation is still happening. If this happens, it is fine for the master to grant a new lease to another replica as long as the lease timeout has expired.

Let's look at Figure 2 which illustrates the control flow of a write operation.

The numbered steps below correspond to each number in the diagram.

The client asks the master for all chunkservers.

The master grants a new lease to a replica (if none exist), increases the chunk version number, and tells all replicas to do the same after the mutation has been applied. It then replies to the client. After this, the client no longer has to talk to the master.

The client pushes the data to all the chunkservers, not necessarily to the primary first. The servers will initially store this data in an internal LRU buffer cache until the data is used.

Once the client receives the acknowledgement that this data has been pushed successfully, it sends the write request to the primary chunkserver. The primary decides what serial order to apply the mutations in and applies them to the chunk.

After applying the mutations, the primary forwards the write request and the serial number order to all the secondaries for them to apply in the same order.

All secondaries reply to the primary once they have completed the operation.

The primary replies to the client, indicating whether the operation was a success or an error. Note:
- If the write succeeds at the primary but fails at any of the secondaries, we'll have an inconsistent state and an error is returned to the client.
- The client can retry steps 3 through 7.

Atomic Record Appends

The system interactions for record appends are largely the same as discussed for writes, with the following exceptions:

In step 4, the primary first checks to see if appending the record to the current chunk would exceed the maximum size of 64MB. If so, the primary pads the chunk, notifies the secondaries to the same, and then tells the client to retry the request on the next chunk.
If the record append fails on any of the replicas, the client must retry the operation. As discussed in the Consistency section, this means that replicas of the same chunk may contain duplicates.
A record append is successful only when the data has been written at the same offset on all the replicas of a chunk.

Fault Tolerance

Fault Tolerance is achieved in GFS by implementing:

Fast Recovery - The master and the chunkservers are designed to restore their state and start in a matter of seconds.
Chunk Replication: Each chunk is replicated on multiple chunkservers on different racks. This ensures that some replicas are still available even if a rack is destroyed. The master is able to clone existing replicas as needed when chunkservers go offline or a replica is detected as stale or corrupted.
Master Replication: The master is also replicated for reliability. A state mutation is considered committed only when the operation log has been flushed to disk on all master replicas. When the master fails, it can restart almost immediately.

In addition, there are shadow masters which provide read-only access to the filesystem when the file is down. There may be a lag in replicating data from the primary master to its shadows, but these shadow masters help to improve availability.

Data Integrity

Checksumming is used by each chunkserver to detect the corruption of stored data.

From the course website ^[1]:

A checksum algorithm takes a block of bytes as input and returns a single number that's a function of all the input bytes. For example, a simple checksum might be the sum of all the bytes in the input (mod some big number). GFS stores the checksum of each chunk as well as the chunk.

When a chunkserver writes a chunk on its disk, it first computes the checksum of the new chunk, and saves the checksum on disk as well as the chunk. When a chunkserver reads a chunk from disk, it also reads the previously-saved checksum, re-computes a checksum from the chunk read from disk, and checks that the two checksums match.

If the data was corrupted by the disk, the checksums won't match, and the chunkserver will know to return an error. Separately, some GFS applications stored their own checksums, over application-defined records, inside GFS files, to distinguish between correct records and padding. CRC32 is an example of a checksum algorithm.

[1] GFS FAQ - Lecture Notes from MIT 6.824

Conclusion

This week's material brought some interesting ideas in the design of a distributed storage system. These include:

The decoupling of data flow from control flow.
Using large chunk sizes to reduce overhead.
The sequencing of writes through a primary replica.

However, having a single master eventually became less than ideal for Google's use case. As the number of files stored increased by thousands, it became harder to fit all the metadata for those files on the master. In addition, the number of clients also increased, leading to too much CPU load on the master.

Another challenge with GFS at Google was that the weak consistency model meant applications had to be designed to cope with those limitations. These limitations led to the creation of Colossus as a successor to GFS.

MIT 6.824: Lecture 2 - RPC and Threads

2020-04-25T11:49:39-00:00

This course is based on the Go programming language, and this post will introduce some features in Go that make it well suited for building concurrent and distributed applications.

Table of Contents

Threads
Remote Procedure Call (RPC)
- Dealing with failures
- RPC Semantics
  - Go RPC
Further Reading

Threads

Threads are the unit of execution on a processor. When a program is run on your computer, that starts up a process. That process can then be made up of one or more threads which execute different tasks. Some key things to note about threads are:

All the threads in a process share memory. They all have access to global variables.
Each thread has keeps its own stack, program counter and registers.

Why use threads?

Threads enable concurrency, which is important in distributed systems. Concurrency allows us to schedule multiple tasks on a single processor. These tasks are run in an interleaved manner and essentially share CPU time between themselves. For example: with I/O concurrency, instead of waiting for an I/O operation to complete before continuing execution (thereby rendering the CPU idle), threads allow us to perform other tasks while we wait.
Parallelism: We can perform multiple tasks in parallel on several cores. Unlike with just concurrency where only one task is making progress at a time (depending on which has its share of CPU time at that instant), parallelism allows multiple tasks to make process at the same time since they are executing on different CPU cores.
Convenience: Threads provide a convenient way to execute short-lived tasks in the background e.g. a master node continuously polling a worker to check if it's alive.

Go has Goroutines, which are lightweight threads for managing concurrency.

What if we can't have multiple threads?

There's a concept of Event Driven Programming where a process only has a single thread which listens for events and executes user specified functions when the event occurs. This concept is used by Node.js and the thread is known as the event loop.

The key thing here is that although the application appears to run on a single thread from the programmer's perspective, the runtime internally uses multiple threads to handle tasks. The main difference is that the programmer does not have to deal with these internal threads and the challenges of coordination between them. All the programmer has to do is specify callback functions to be executed on the main thread when those background tasks have completed.

When the single thread receives an event (like a button click or a task completion), it pauses its current task, executes the callback function for the event, and then returns to the paused job.

Downsides of Event-Driven Programming

You will need additional coordination between processes to gain the benefits of parallelism on a multi-core system. With Node.js, you can fire up child processes to be run on each CPU, but you need to handle coordination between those processes.
It's harder to implement this pattern (Though this is subjective, of course).

Threading Challenges

Deadlocks: These happen when two or more threads are waiting on each other in such a way that neither can progress.
Accessing shared data: What happens if two threads do n = n + 1 at the same time? Or one thread reads a value while another one increments it? This is known as a race condition. Using Go's sync.Mutex to add locks around the shared data is one way to solve this problem. An alternative to that is to avoid sharing mutable data. Go has a built-in data race detector
Coordination between threads: If one thread is producing data while another is consuming that data, it raises questions like "How can the consumer wait for data to be produced, and release the CPU while waiting?" and "How can the producer then wake up the consumer?"

Go has channels and WaitGroups for coordinating communication between threads.

Remote Procedure Call (RPC)

RPC is a means of client/server communication between processes on the same machine or different machines. Here, the client executes a procedure (function/method) on a remote service as if it were a local procedure call.

The steps that take place during an RPC are as follows ^[1]:

A client invokes a client stub procedure, passing parameters in the usual way. The client stub resides within the client's own address space.
The client stub marshalls the parameters into a message. Marshalling includes converting the representation of the parameters into a standard format, and copying each parameter into the message.
The client stub passes the message to the transport layer, which sends it to the remote server machine.
On the server, the transport layer passes the message to a server stub, which demarshalls the parameters and calls the desired server routine using the regular procedure call mechanism.
When the server procedure completes, it returns to the server stub (e.g., via a normal procedure call return), which marshalls the return values into a message. The server stub then hands the message to the transport layer.
The transport layer sends the result message back to the client transport layer, which hands the message back to the client stub.
The client stub demarshalls the return parameters and execution returns to the caller.

The main benefit of this is that it simplifies the process of writing distributed applications since RPC hides all the network code into stub functions. Programmers don't have to worry about details like data conversion and parsing, and opening and closing a connection.

Note: The client knows what server to talk to through binding. Go has an RPC library to ease this communication between processes. In Go's RPC library, the server name and port are passed as arguments to a method when setting up the connection.

Dealing with failures

From the perspective of a client, failure means sending a request to the server and not getting a response back within a particular time out. This can be caused by a number of things including lost packets, slow server, crashed server, and a broken network.

Dealing with this is tricky because the client would not know the actual status of its request. Possible scenarios are:

The server never saw the request.
The server executed the request but crashed just before sending a reply.
The server executed a request and sent the reply, but the network died before delivering the reply.

The simplest way to deal with a failure would be to just retransmit the request; however, if the server had already executed the request, resending it could mean the server executes the same request twice, which could lead to unwanted side effects. This failure handling method works well for idempotent requests i.e. operations that have the same effect when executed multiple times as if they were executed once. Many operations are not idempotent, and so we need a more general approach to handle failures.

RPC Semantics

An RPC implementation can use any of the following semantics for making requests :

At-Most-Once: At-most-once semantic ensures that a request will not be retried automatically by the client. In this case, resending a request is opt-in for the client. Therefore, without an explicit retry mechanism for a failed request, a request may be lost and never executed. If the request is retried, the server is responsible for detecting duplicate requests and ensuring that only one succeeds.
At-Least-Once: Here, a request may be executed one or more times and may not be lost. The client will keep retrying the request until it receives a positive acknowledgement that the request has been executed. This is appropriate for requests with no side effects (like read-only requests) and idempotent operations.
Exactly-Once: In this mode, requests can neither be duplicated nor lost. This is harder to achieve and the least fault tolerant because it requires that a response must be received from the server, and there can be no duplicates. If we have multiple servers and the one handling the initial request crashes, the other servers may not be able to tell whether the request was executed or not by the initial server, and it becomes a challenge agreeing on a decision for that.

Go RPC

Go RPC guarantees at-most-once semantic. If it doesn't get a reply, it will just return an error. The client can opt to retry a failed request, but it is up to the server to handle duplicate requests to maintain the at-most-once guarantee; what if the request actually executed but the reply got lost?

Some complexities related to at-most-once communication between processes are:

How do we guarantee that the ID of a request is unique between multiple clients? One way to do this is by generating a request ID which combines the unique client ID with a sequence number.
For detecting duplicates, how long should each request ID be kept for? We cannot keep all the request IDs indefinitely, so they have to get discarded at a point. A method for handling this could be for each client to include an extra identifier with each request. Let's call it X. The extra identifier will tell the server that it is safe to delete all request IDs that came before X.
How do we handle duplicate requests while the original request is still executing? We could have a "pending" flag next to each executing RPC and wait for it to complete, or simply ignore the new request.
What if the server crashes and restarts with the duplicate info being kept in memory? The server could write duplicate info to disk. The server could also replicate information about duplicates across multiple machines.

[1] Remote Procedure Call (RPC) - Lecture notes from Worchester Polytechnic Institute

MIT 6.824: Lecture 1 - MapReduce

2020-04-13T15:17:18-00:00

Background

I started a study group with some of my friends where we'll be going through this course. Over the next couple of weeks, I intend to upload my notes from studying each week's material.

MapReduce

This week's material focused on the MapReduce paradigm for data processing. The material included the seminal MapReduce paper by Jeff Dean and Sanjay Ghemawat, and an accompanying video lecture. Below are my notes from the materials and group discussion.

MapReduce is a system for parallelizing the computation of a large volume of data across multiple machines in a cluster. It achieves this by exposing a simple API for expressing these computations using two operations: Map and Reduce.

The Map task takes an input file and outputs a set of intermediate (key, value) pairs. The intermediate values with the same key are then grouped together and processed in the Reduce task for each distinct key.

Some examples of programs that can be expressed as MapReduce computations are:

Word Count in Documents: Here, the map function can emit a (key, value) pair for each occurrence of a word like (word, count). The reduce function can then add all the counts for the same word and emit a (word, total count) pair.
Distributed Grep: Grep is a regular expression search for a given pattern in a text document. To search across a large volume of documents, we could define a map function which emits a line if it matches the supplied pattern like (pattern, line). The reduce function then outputs all the lines from the intermediate values for the given key.
Distributed Sort: We can have a map function which extracts the key from each record and emits a (key, record) pair. Depending on the partitioning and ordering scheme, we can then have a reduce function that emits all the pairs unchanged. We'll go into more detail on the ordering scheme later on.

Implementation Details

The MapReduce interface can be implemented in many ways, so this section just details the implementation specific to Google at the time of writing this paper.

The Map function invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The Reduce invocations are split into R pieces based on a partitioning function defined on the intermediate key.

A sample MapReduce Job

The flow of execution when the MapReduce function is called by a user is as follows:

The MapReduce library splits the input into M files and starts up multiple instances of the program on many machines in the cluster.
One of the instances is the Master and the others are Workers. The master assigns map tasks to the available workers.
There are M map tasks and R reduce tasks to be assigned to workers.
When a worker is assigned a map task, it reads the input from its corresponding split, performs the specified operations, and then emits intermediate (key, value) pairs which are buffered in memory.
These buffered pairs are periodically written to local disk, partitioned into R regions according to the partitioning scheme. The locations of these buffered pairs on local disk are passed to the master, which then forwards these locations to the reduce workers.
A reduce worker uses remote procedure calls to read the buffered data from local disk based on the location it was forwarded by master. When all intermediate data has been read, it sorts the values according to the key and groups all occurrences for the same key together.
For each distinct intermediate key, the reduce worker passes its grouped values to the reduce function defined. The output of this reduce function is appended to a final output file for the partition.
The master wakes up the user program and releases control when all the map and reduce tasks have been completed.

The output of a MapReduce job is a set of R files (one per reduce task).

Dealing with Faults

Worker Failure: The master pings all its workers periodically. If no response is received from a worker after a set period of time, the worker is marked as failed. The tasks assigned to the failed worker are then reassigned to other workers.
Master Failure: As master failures were rare, their implementation simply aborted the whole execution if master failed, for the execution to be retried from the start. An alternative implementation could have made the master periodically checkpoint its state, so that a retry could pick up from where the execution left off.

Dealing with Network Resource Scarcity

Their implementation at the time of writing the paper used locality as a means of conserving network bandwidth. This means that the input files were kept close to where they will be processed to avoid the network trip of transferring these large files. The Master's scheduling algorithm took the file location into account when determining what workers should execute what input files.

Note, the lecture video for the week explained that as Google's networking infrastructure was expanded and upgraded in later years, they relied less on this locality optimization.

Dealing with Stragglers

Stragglers are machines that take longer time than usual to complete one of the last few map or reduce tasks. They addressed this by having the master schedule backup tasks when the computation is almost completed. A task is then marked as completed when either the primary or backup execution completes.

Some Other Interesting Features

Combiner Function: A user can specify a combiner function, which groups all the values for a key on a machine that performs a map task. The combiner function typically has the same code as the reduce function. The advantage of enabling this is that it can reduce the volume of the data being sent over the network to a reduce worker. I wonder why this isn't enabled by default?
Skipping Bad Records: Instead of the entire execution failing because of few bad records, there's a mechanism in place for bad records to be skipped. This is acceptable in some instances e.g. when doing statistical analysis on a large dataset.
Ordering Guarantees: The guarantee is that within a given reduce partition, the intermediate key value pairs are processed in increasing order of the keys. I was curious about where/how the sorting happens and it looks like in Hadoop MapReduce, data from the mappers are first sorted according to key in the mapper worker using Quicksort. This sorted data is then shuffled to the reduce workers. The reduce worker merges the sorted data from different mappers into one sorted set, similar to the merge routine in Mergesort.

Conclusion

Though no longer in use at Google for a number of reasons, MapReduce fundamentally changed the way large-scale data processing architectures are built. It abstracted the complexity of dealing with parallelism, fault-tolerance and load balancing by exposing a simple API that allowed programmers without experience with these systems to distribute the processing of large datasets across a cluster of computers.

Consistency Models

2020-04-04T20:59:17-00:00

Background

I was reading the documentation for Google's Cloud Spanner recently, and came across the claim that the consistency level guaranteed by the database is External Consistency. The documentation then went on to state that External Consistency is a stronger guarantee than Linearizability.

I found this confusing because I had come across resources that suggested that these terms referred to the same thing. I brought my confusion about these terms to a colleague, who kindly pointed me to one of the earliest sources of these terms to help clarify it.

This post contains my notes from studying section 3.1 of 'Information Storage in a Decentralized Computer System' by David K Gifford.

Note that only a small subset of consistency models for concurrent systems are covered here. For a more detailed treatment of this topic, you'll find this article useful.

Serial Consistency

Serial Consistency is achieved when a database makes it appear to a transaction that there is no other concurrently executing transaction.

So if we have two transactions:

Sample Transactions T₁ and T₂

The order in which the actions of each transaction is processed is called a schedule.

A serial schedule results when the transactions are executed one at a time to completion. We have 2 serial schedules for the example above:

Sa: {a11, a12, a13, a21, a22, a23}

Sb: {a21, a22, a23, a11, a12, a13}

A database provides serial consistency if it guarantees that the schedule to process a set of transactions is one of the possible serial schedules.

External Consistency and Linearizability

An external schedule is the unique serial schedule that is defined by the actual time order in which transactions complete. Any system which guarantees that the schedule it will use to process a set of transactions is equivalent to its external schedule, is said to provide external consistency.

Using the example above, if T₁ completes before T₂ starts to commit, then external consistency guarantees that the system will appear as if schedule Sa was used. This means that no clients observing the database will see a state that contains the effects of T₂ but not T₁.

Remember that these consistency models should be thought of in the context of a distributed database, meaning that these transactions can execute on different nodes, and clients can observe the database from different nodes as well.

What makes external consistency stronger than serializability is that in serializability, any of the possible serial schedules can be used, regardless of when each transaction actually committed.

Using the same example of T₁ completing before T₂ commits: in a serializable database, it is possible for a client to read from a database node in a state where it contains the effect of T₂, but not T₁. This can lead to inconsistencies in the data.

With external consistency we're saying that when a transaction has committed, all subsequent transactions will see the effect of that committed transaction in the DB, that's not the same guarantee provided by serializability.

Linearizability, on the other hand, does not say anything about the behaviour of transactions. It is a consistency level that deals more with the behaviour of a single object. It is a recency guarantee for a single object. while it might involve a transaction, its focus is on a recency guarantee for a single object, i.e. when one committed transaction has updated the value of an object, all other reads to that object must see the new value.

In summary, while linearizability is related to external consistency in some way, they're talking about different things.

Chapter 9 - Consistency and Consensus (Part Two)

2020-03-23T15:11:57-00:00

In the first part of Chapter 9, we looked at Linearizability and Causality as consistency guarantees and used those topics to discuss the difference between total order and partial order. We also briefly discussed Lamport timestamps and how they are used to enforce a total ordering of operations across multiple nodes.

We concluded by seeing that it's not enough to know the total order of operations after all the operations have been collected. A node might want to make a decision "in the moment", and so it needs to know what the order is at each point in time.

This part will focus on "Total Order Broadcast" and "Consensus", which help to solve the challenge described above.

Table of Contents

Total Order Broadcast
Distributed Transactions and Consensus
Conclusion

Total Order Broadcast

We discussed earlier how single-leader replication determines a total order of operations by sequencing all operations on a single CPU core on the leader. However, when the throughput needed is greater than a single leader can handle, there's a challenge of how to scale the system with multiple nodes, as well as handling failover if the leader fails.

Note that single-leader replication systems often only maintain ordering per partition. They do not have total ordering across all partitions. An example of such a system is Kafka. Total ordering across all partitions will require additional coordination.

Total Order Broadcast (or Atomic Broadcast) is a broadcast a protocol for exchanging messages between nodes. It requires that the following safety properties are always satisfied:

Reliable delivery: No messages are lost. A message delivered to one node must be delivered to all the nodes.
Totally ordered delivery: Messages are delivered to every node in the same order.

From Wikipedia: The broadcast is termed "atomic" because it either eventually completes correctly at all participants, or all participants abort without side effects.

An algorithm for total order broadcast must ensure that these properties are always satisfied, even in the face of network or node faults. In the face of failures, the algorithm must keep retrying so that messages can get through when the network is repaired.

Q: How can messages be delivered in the same order in multi-leader or leaderless replication systems?

A: In leaderless replication systems, a client typically directly sends its writes to several replicas and then uses a quorum to determine if it's successful or not. Leaderless replication does not enforce a particular order of writes. Multi-leader systems are typically not linearizable, so it doesn't apply here.

Using total order broadcast

Total order broadcast is exactly what is needed for database replication based on a principle known as state machine replication. It's stated as follows:

"If every message represents a write to the database, and every replica processes the same writes in the same order, then the replicas will remain consistent with each other (aside from any temporary replication lag)."

Kleppmann, Martin. Designing Data-Intensive Applications (Kindle Locations 8950-8951). O'Reilly Media. Kindle Edition.

It can also be used to implement serializable transactions:

”If every message represents a deterministic transaction to be executed as a stored procedure, and if every node processes those messages in the same order, then the partitions and replicas of the database are kept consistent with each other"

Kleppmann, Martin. Designing Data-Intensive Applications (Kindle Locations 8957-8958). O'Reilly Media. Kindle Edition.

One thing to note with total order broadcast is that the order is fixed at the time of message delivery. This means that a node cannot retroactively insert a message into an earlier position if subsequent messages have been delivered. Messages must be delivered in the right order. This makes total order broadcast stronger than timestamp ordering (since we know that time can move backward).

Total Order broadcast can also be seen as a way of creating a log (like a replication log, transaction log, or write-ahead log). Delivering a message is like appending to the log, and if all the nodes read from the log, they will see the same sequence of messages.

Another use of total order broadcast is for implementing fencing tokens. Each request to acquire the lock can be appended as a message to the log, and the messages can be given a sequence number in the order of their appearance in the log. This sequence number can then be used as a fencing token due to the fact that it's monotonically increasing.

Implementing linearizable storage using total order broadcast

Although total order broadcast is a reasonably strong guarantee, it is not quite as strong as linearizability. However, they are closely related.

Total Order Broadcast guarantees that messages will be delivered reliably in the same order, but it provides no guarantee about when the message will be delivered (so a read to a node may return stale data). Linearizability, on the other hand, is a recency guarantee, it guarantees that a read will see the latest value written.

These two concepts are closely related, and so we can implement linearizable storage using total order broadcast and vice versa.

Linearizable writes using total order broadcast

Linearizable writes are instantaneous writes, meaning that once a client has written a value, all other reads to that value must see the newly written value or the value of a later write.

Imagine that we are building a system where each user has a unique username, and we want to deal with a situation where multiple users concurrently try to grab the same username. (Aside: Now, the scenario I have in my head is one in which these users can write to different replicas (think multiple leaders) concurrently. I imagine that in a single leader system, we would simply use the first successful operation).

To ensure linearizable writes in this system using total order broadcast:

Each node can append a message to the log indicating the username they want to claim.
The nodes then read the log, waiting for the message they appended to be delivered back to them.
If the first message that a node receives with the username it wants is its own message, then the node is successful and can commit the username claim. All other nodes that want to claim this username can then abort their operations.

This works because if there are several concurrent writes, all the nodes will agree on which came first, and these messages are delivered to all the nodes in the same order.

This algorithm does not guarantee linearizable reads though. A client can still get stale reads if they read from an asynchronously updated store.

Linearizable reads using total order broadcast

There are a number of options for linearizable reads with total order broadcast:

When you want to read a value, you could append a message to the log, read the log and then perform the actual read of the value when the message you appended is delivered back to you. I think this works because since all nodes have to agree on the order of messages, you will always see the latest write that happened before your 'read message'. If client B sends a 'write message' to the log before client A's 'read message' to the same log, client A will see the effects of that 'write message'.
If you can fetch the position of the latest log message in a linearizable way, you can query that position, wait for all the entries up to that position to be delivered to you, and then actually perform the read. This is the idea used in Zookeeper's sync() operation.
You can always make reads from a replica that is synchronously updated on writes.

Implementing total order broadcast using linearizable storage

What we're essentially implementing is a mechanism for generating sequence numbers for each message we want to send. The easiest way to do this is to assume that we have a linearizable register which stores an integer and has an atomic increment-and-get operation.

The algorithm is this:

"For every message you want to send through total order broadcast, you increment-and-get the linearizable integer, and then attach the value you got from the register as a sequence number to the message."

Kleppmann, Martin. Designing Data-Intensive Applications (Kindle Locations 9023-9024). O'Reilly Media. Kindle Edition.

This way, we'll avoid race conditions on the integer and each message will have a unique sequence number.

Something to note is that unlike Lamport timestamps, the numbers gotten from incrementing the linearizable register form a sequence that has no gaps. The sequence won't jump from 4 to 6. Therefore, if a node has delivered a message with a sequence number of 4 and receives an incoming message with a sequence number of 6, it must wait for message 5 before it can deliver message 6 (because messages must be delivered in the same order to all nodes, and if it delivers 6 before 5 to other nodes, it is likely breaking some order).

Note: It can be proved that a linearizable compare-and-set (or increment-and-get) register and total order broadcast are both equivalents to consensus, meaning that if you can solve one of those problems, we can transform it into a solution for the others.

We'll discuss the consensus problem next.

Distributed Transactions and Consensus

Simply put, consensus means getting several nodes to agree on something. However, it turns out that this is not an easy problem to solve, and it is one of the fundamental problems in distributed computing.

Some situations where it is important for the nodes to agree include:

Leader election: In a single-leader setup, all the nodes need to agree on which node is the leader. If the nodes don't agree on who the leader is, it could lead to a split-brain situation in which multiple 'leaders' could accept writes, leading to inconsistency and data loss.
Atomic commit: If a transaction spans several nodes or partitions, there's a chance that it may fail on some nodes and succeed on others. However, to preserve the atomicity property of ACID transactions, it must either succeed or fail on all of them. We have to get all the nodes to agree on the outcome of the transaction. This is known as the atomic commit problem.

We'll address the atomic commit problem first before delving into other consensus scenarios.

Atomic Commit and Two-Phase Commit (2PC)

Two-phase commit is the most commonly used algorithm for implementing atomic commit. It is a kind of consensus algorithm, but not a very good one and we'll learn why soon.

We learned in Chapter 7 that the purpose of transaction atomicity is to prevent the database from getting in an inconsistent state in the event of failure.

From single-node to distributed atomic commit

Atomicity for single database node transactions is usually implemented by the storage engine. When a request is made to commit a transaction, the writes in the transaction are made durable (typically using a write-ahead log) and then a commit record is appended on disk. If the database crashes during this process, upon restarting, it decides whether to commit or rollback the transaction based on whether or not the commit record was written to the disk before the crash.

However, if multiple nodes are involved, it's not sufficient to simply send a commit request to all the nodes and then commit the transaction on each one. Some of the scenarios where multiple nodes could be involved are: a multi-object transaction in a partitioned database, or writing to a term-partitioned index.

It's possible that the commit succeeds on some nodes and fails on other nodes, which is a violation of the atomicity guarantee. Possible scenarios are:

Some nodes may detect a violation of a uniqueness constraint or something similar and may have to abort, while other nodes are able to commit successfully.
Some commit requests might get lost in the network, and may eventually abort due to a timeout, while other requests are successful.
Some nodes may crash before the commit record is fully written and then have to roll back on recovery, while other nodes successfully commit.

If some nodes commit a transaction while others abort it, the nodes will be in an inconsistent state. Note that a transaction commit on a node must be irrevocable. It cannot be retracted once it has been committed. The reason for this is that data becomes visible to other transactions once it has been committed by a transaction, and other clients may now rely on that data. Therefore, it's important that a node commits a transaction only when it is certain that all other nodes in the transaction will commit.

Introduction to Two-phase commit

Two-phase commit (or 2PC) is an algorithm used for achieving atomic transaction commit when multiple nodes are involved. 'Atomic' in the sense that either all nodes commit or all abort.

The key thing here is that the commit process is split into two phases: the prepare phase and the actual commit phase.

It achieves atomicity across multiple nodes by introducing a new component known as the coordinator. The coordinator can run in the same process as the service requesting the transaction or in an entirely different process. When the application is ready to commit a transaction, the two phases are as follows:

The coordinator sends a prepare request to all the nodes participating in the transaction, for which the nodes have to respond with essentially a 'YES' or 'NO' message.
If all the participants reply 'YES', then the coordinator will send a commit request in the second phase for them to actually perform the commit. However, if any of the nodes reply 'NO', the coordinator sends an abort request to all the participants.

In case it's still not clear how this protocol ensures atomicity while one-phase commit across multiple nodes does not, note that there are two essential "points of no return":

When a participant responds with "YES", it means that it must be able to commit under all circumstances. A power failure, crash, or memory issue cannot be an excuse for refusing to commit later. It must definitely be able to commit the transaction without error if needed.
When the coordinator decides and that decision is written to disk, the decision is irrevocable. It doesn't matter if the commit or abort request fails at first, it must be retried forever until it succeeds. If a participant crashes before it can complete the commit/abort request, the transaction will be committed in the meantime.

Coordinator Failure

If any of the prepare requests fails or times out during a 2PC, the coordinator will abort the transaction. If any commit or abort request fails, the coordinator will retry them indefinitely.

If the coordinator fails before it can send a prepare request, a participant can safely abort the transaction. However, once a participant has received a prepare request and voted "YES", it can no longer abort by itself. It has to wait to hear from the coordinator about whether or not it should commit the transaction. The downside of this is that if the coordinator crashes or the network fails after a participant has responded "YES", the participant can do nothing but wait. In this state, it is said to be in doubt or uncertain.

The reason why a participant has to wait for the coordinator in the event of a failure is that it does not know whether the failure extends to all participants or just itself. It's possible that the network failed after the commit request was sent to one of the participants. If the in doubt participants then decide to abort after a timeout due to not receiving from the coordinator, it will leave the database in an inconsistent state.

In principle, the in doubt participants could communicate among themselves to find out how each participant voted and then come to an agreement, but that is not part of the 2PC protocol.

This possibility of failure is why the coordinator must write its decision to a transaction log on disk before sending the request to the participants. When it recovers from a failure, it can read its transaction log to determine the status of all in-doubt transactions. Transactions without a commit record in the coordinator's log are aborted. In essence, the commit point of 2PC is a regular single-node atomic commit on the coordinator.

Three-phase commit

Two-phase commit is referred to as a blocking atomic commit protocol because of the fact that it can get stuck waiting for the coordinator to recover.

An alternative to 2PC that has been proposed is an algorithm called three-phase commit (3PC). The idea here is that it assumes a network with bounded delays and nodes with bounded response times. This means that when a delay exceeds that bound, a participant can safely assume that the coordinator has crashed.

However, most practical systems have unbounded network delays and process pauses, and so it cannot guarantee atomicity. If we wrongly declare the coordinator to be dead, the coordinator could resume and end up sending commit or abort requests, even when the participants have already decided. (Q: I wonder if this is something that can be avoided by ensuring that once a coordinator has been declared dead for a particular transaction, it cannot come back and send requests? Might be possible through some form of sequence numbers).

This difficulty in coming up with a perfect failure detector is why 2PC continues to be used today.

Distributed Transactions in Practice

Distributed transactions, especially those implemented with two-phase commit, are contentious because of the performance implications and operational problems they cause. This has led to many cloud services choosing not to implement them.

However, despite these limitations, it's useful to examine them in more detail as there are lessons that can be learned from them.

There are two types of distributed transaction which often get conflated:

Database-internal distributed transactions: This refers to transactions performed by a distributed database that spans multiple replicas or partitions. VoltDB and MySQL Cluster's NDB storage engine support such transactions. Here, all the nodes participating in the transaction are running the same database software.
Heterogenous distributed transactions: Here, the participants are two or more different technologies. For example, we could have two databases from different vendors, or even non-database systems such as message brokers. Although the systems may be entirely different under the hood, a distributed transaction has to ensure atomic commit across these systems.

Exactly-once message processing

With heterogeneous transactions, we can integrate diverse systems in powerful ways. For example, we can perform a transaction that spans across a message queue and a database. Say we want to acknowledge a message from a queue as processed if and only if the transaction for processing the message was successfully committed, we could perform this using distributed transactions. This can be implemented by atomically committing the message acknowledgment and the database writes in a single transaction.

If the transaction fails and the message is not acknowledged, the message broker can safely redeliver the message later.

An advantage of atomically committing a message together with the side effects of its processing is that it ensures that the message is effectively processed exactly once. If the transaction fails, the effects of processing the message can simply be rolled back.

However, this is only possible if all the systems involved in the transaction are able to use the same atomic commit protocol. For example, if a side effect of processing a message involves sending an email and the email server does not support two-phase commit, it will be difficult to roll-back the email. Processing the message multiple times may involve sending multiple emails.

Next, we'll discuss the atomic commit protocol that allows such heterogeneous distributed transactions.

XA Transactions

XA (eXtended Architecture) is a standard for implementing two-phase commit across heterogeneous technologies.

XA is a C API for interacting with a transaction coordinator, but bindings for the API exist in other languages.

It assumes that communication between your application and the participant databases/messaging services is done through a network driver (like JDBC) or a client library which supports XA. If the driver does support XA, it will call the XA API to find out whether an operation should be part of a distributed transaction - and if so, it sends the necessary information to the participant database server. The driver also exposes callbacks needed by the coordinator to interact with the participant, through which it can ask a participant to prepare, commit, or abort.

The transaction coordinator is what implements the XA API. The coordinator is usually just a library that's loaded into the same process as the application issuing the transaction. It keeps track of the participants involved in a transaction, their responses after asking them to prepare, and then uses a log to keep track of its commit/abort decision for each transaction.

Note that a participant database cannot contact the coordinator directly. All of the communication must go through its client library through the XA callbacks.

Holding locks while in doubt

The reason we care so much about transactions not being stuck in doubt is locking. Transactions often need to take row-level locks on any rows they modify, to prevent dirty writes. These locks must be held until the transaction commits or aborts.

If we're using a two-phase commit protocol and the coordinator crashes, the locks will be held until the coordinator is restarted. No other transaction can modify these rows while the locks are held.

The impact of this is that it can lead to large parts of your application being unavailable: If other transactions want to access the rows held by an in-doubt transaction, they will be blocked until the transaction is resolved.

Limitations of distributed transactions

While XA transactions are useful for coordinating transactions across heterogeneous data systems, we have seen that they can introduce major operational problems. One key insight here is that the transaction coordinator is a kind of database itself (in the sense that it keeps track of a transaction log which is durably persisted), and so it needs to be treated with the same level of importance as other databases. Some of the other limitations of distribute transactions are:

2PC needs all participants to respond before it can commit a transaction. As a result, if any part of the system is broken, the transaction will fail. This means that distributed transactions have a tendency of amplifying failures, which is not what we want when building fault-tolerant systems.
If the coordinator is not replicated across multiple machines, it becomes a single point of failure for the system.
XA needs to be compatible across a wide range of data systems and so it is a lowest common denominator meaning that it cannot have implementations that are specific to any system. For example, it cannot detect deadlocks across different systems, as that would require a standardized protocol with which different systems can inform each other on what locks are being held by another transaction. It also cannot work with Serializable Snapshot Isolation, as we would need a protocol for identifying conflicts across multiple systems.

Fault-Tolerant Consensus

In simple terms, consensus means getting several nodes to agree on something. For example, if we have several people concurrently trying to book the same meeting room or the same username, we can use a consensus algorithm to determine which one should be the winner.

In formal terms, we describe the consensus problem like this: One or more nodes may propose values, and the role of the consensus algorithm is to decide on one of those values. In the case of booking a meeting room, each node handling a user request may propose the username of the user making the request, and the consensus algorithm will decide on which user will get the room.

A consensus algorithm must satisfy the following properties:

Uniform agreement: No two nodes decide differently.
Integrity: No node decides twice.
Validity: If a node decides a value v, then v was proposed by some node.
Termination: Every node that does not crash eventually decides some value.

The core idea of consensus is captured in the uniform agreement and integrity properties: everyone must decide on the same outcome, and once the outcome has been decided, you cannot change your mind.

The validity property is mostly to rule out trivial solutions such as an algorithm that will always decide null regardless of what was proposed. An algorithm like that would satisfy the first two properties, but not the validity property.

The termination property is what ensures fault tolerance in consensus-based systems. Without this property, we could designate one node as the "dictator" and let it make all the decisions. However, if that node fails, the system will not be able to make a decision. We saw this situation in the case of two-phase commit which leaves participants in doubt.

What the termination property means is that a consensus algorithm cannot sit idle and do nothing forever i.e. it must make progress. If some nodes fail, the other nodes must reach a decision. Note that termination is a liveness property, while the other three are safety properties.

The consensus system model assumes that when a node "crashes", it disappears and never comes back. That means that any algorithm which must wait for a node to recover will not satisfy the termination property. However, note that this is subject to the assumption that fewer than half of the nodes crashed.

Note that the distinction between the safety and the liveness properties means that even if the termination property is not met, it cannot corrupt the consensus system by causing it to make invalid decisions. In addition, most consensus algorithms assume that there are no Byzantine faults. This means that if a node is Byzantine-faulty, it may break the safety properties of the protocol.

Consensus algorithms and total order broadcast

The most popular fault-tolerant consensus algorithms are Paxos, Zab, Raft, and Viewstamped Replication. However, most of these algorithms do not directly make use of the formal model described above i.e. proposing and deciding on a single value, while satisfying the liveness and safety properties. What these algorithms do is that they decide on a sequence of values, which makes them total order broadcast algorithms.

Recall from the discussion earlier that the following properties must be met for total order broadcast:

Messages must be delivered to all nodes in the same order.
No messages are lost.

If we look closely at these properties, total order broadcast can be seen as performing several rounds of consensus as all the nodes have to agree on what message goes next in the total order sequence. Each consensus decision can be seen as corresponding to one message delivery.

Viewstamped Replication, Raft, and Zab implement total order broadcast directly, because that is more efficient than doing repeated rounds of one-value-at-a-time consensus. In the case of Paxos, this optimization is known as Multi-Paxos.

Kleppmann, Martin. Designing Data-Intensive Applications (Kindle Locations 9474-9476). O'Reilly Media. Kindle Edition.

Single-leader replication and consensus

We've learned about single-leader replication takes all the writes to the leader and applies them to the followers in the same order, thereby keeping the replicas up to date. This is the same idea as total order broadcast, but interestingly, we haven't discussed consensus yet in the context of single-leader replication.

Consensus in single-leader replication depends on how the leader is chosen. If the leader is always chosen by a manual operator, there's a risk that it will not satisfy the termination property of consensus if the leader is unavailable for any reason.

Alternatively, some databases perform automatic leader election and failover by promoting a new leader if the old leader fails. However, in these systems, there is a risk of split-brain (where two nodes could think they're the leader) and so we still need all the nodes to agree on who the leader is.

So it looks like:

Consensus algorithms are actually total order broadcast algorithms -> total order broadcast algorithms are like single leader replication -> single-leader replication needs consensus to determine the leader - > (repeat cycle)

How do we break this cycle where it looks like to solve consensus, we must first solve consensus? We'll discuss that next.

Epoch numbering and quorums

The consensus protocols discussed above all use a leader internally. However, a key to note is that they don't guarantee that a leader is unique. They provide a weaker guarantee instead: The protocols define a monotonically increasing epoch number and the guarantee is that within each epoch, the leader is unique.

Whenever the current leader is thought to be dead, the nodes start a vote to elect a new leader. In each election round, the epoch number is incremented. If we have two leaders belonging to different epochs, the one with the higher epoch number will prevail.

Before a leader can decide anything, it must be sure that there is no leader with a higher epoch number than it. It does this by collecting votes from a quorum (typically the majority, but not always) of nodes for every decision that it wants to make. A node will vote for a proposal only if it is not aware of another leader with a higher epoch.

Therefore, we have two voting rounds in consensus protocols: one to elect a leader, and another to vote on a leader's proposal. The important thing is that there must be an overlap in the quorum of nodes that participate in both voting rounds. If a vote on a proposal succeeds, then at least one of the nodes that voted for it must have also been voted in the most recent leader election.

The biggest differences between 2PC and fault-tolerant consensus algorithms are that the coordinator in 2PC is not elected, and the latter only requires votes from a majority of nodes unlike in 2PC where all the participants must say "YES". In addition, consensus algorithms define a recovery process to get the nodes into a consistent state after a new leader is elected. These differences are what make consensus algorithms more fault-tolerant.

Limitations of consensus

Consensus algorithms have numerous advantages like bringing concrete safety properties, providing total order broadcast, and therefore linearizable operations in a fault-tolerant way, but they are not used everywhere because they come with a cost.

A potential downside of consensus systems is that they require a strict majority to operate. This means that to tolerate one failure, you need three nodes, and to tolerate two failures, a minimum of five nodes are needed.

Another challenge is that they rely on timeouts to detect failed nodes, and so in a system with variable network delays, it's possible for a node to falsely think that the leader has failed. This won't affect the safety properties, but it can lead to frequent leader elections which could harm system performance.

Membership and Coordination Services

Zookeeper and etcd are typically described as "distributed key-value stores" or "coordination and configuration services". These services look like databases which for which you can read and write the value of a given key, or iterate over keys.

However, it's important to note that these systems are not designed to be used as a general-purpose database. Zookeeper and etcd are designed to hold small amounts of data in memory, all of your application's data cannot be stored there. This small amount of data is then replicated across all the nodes using a fault-tolerant total order broadcast algorithm.

Some of the features provided by these services are:

Linearizable atomic operations: Zookeeper can be used to implement distributed locks using an atomic compare-and-set operation. If several nodes concurrently try to obtain a lock on a row, it can help to guarantee that only one of them will succeed and the operation will be atomic and linearizable.
Total ordering of operations: We've discussed fencing tokens before, which can be used to prevent the clients from conflicting with each other when they want to access a resource protected by a lock or lease. Zookeeper helps to provide this by giving each operation a monotonically increasing transaction ID and version number.
Failure detection: Clients maintain a long-lived session and Zookeeper servers, and both client and server periodically exchange 'heartbeats' to check that the other node is alive. If the heartbeats cease for a duration longer than the session timeout, Zookeeper will declare the session to be dead.
Change notifications: With Zookeeper, clients can be made aware of when other nodes (clients) join the cluster since the new node will write to Zookeeper. A client can also be made aware of when another client leaves the cluster.

Note that only the linearizable atomic operations here require consensus, but the other features make Zookeeper useful for coordination among distributed systems. Also note that an application developer will rarely interact with Zookeeper. It is often relied on indirectly via another project like Kafka, Hbase, Hadoop YARN etc.

Allocating work to nodes

When new nodes join a partitioned cluster, some of the partitions need to be moved from existing nodes to the new ones in order to rebalance the load. Similarly, when nodes fail or are removed from the cluster, the partitions that they held have to be moved to the remaining nodes. Zookeeper can help to achieve tasks like this through the use of atomic operations, change notifications and ephemeral nodes.

An important thing to note is that Zookeeper typically manages data that is quite slow-changing:

It represents information like “the node running on IP address 10.1.1.23 is the leader for partition 7,” and such assignments usually change on a timescale of minutes or hours.

Kleppmann, Martin. Designing Data-Intensive Applications (Kindle Locations 9615-9616). O'Reilly Media. Kindle Edition.

It is not intended for storing the runtime state of an application, which is likely to change thousands or millions of times per second. Apache BookKeeper is a tool used to replicate runtime state to other nodes.

Service discovery

Service discovery is the process of finding out the IP address that you need to connect to in order to reach a service. Zookeeper, Consul, and etc are often used for service discovery.

The main idea is that services will register their network endpoints in a service registry from which they can be discovered by other services. The read requests for a service's endpoint do not need to be linearizable (DNS is the traditional method of retrieving the IP address of a service name and for availability purposes, its reads are not linearizable).

Conclusion

This will be the last set of notes I'll post from the book in a while. The last section of the book is on "Derived Data" and is a lot more practical than the theory we have discussed so far. In the meantime, I intend to post another set of notes from this course which I will be starting soon.

Last updated on 19-06-2020 to fix a few typos.

Chapter 9 - Consistency and Consensus (Part One)

2020-03-14T22:19:35-00:00

Notes from Chapter 9 of Martin Kleppmann's 'Designing Data-Intensive Applications' book. This chapter is split into two parts.

In this chapter, we focus on some of the abstractions that applications can rely on in building fault-tolerant distributed systems. One of these is Consensus. Once there's a consensus implementation, applications can use it for things like leader election and state machine replication.

Table of Contents

Consistency Guarantees
- Linearizability
Ordering Guarantees
- Ordering and Causality
- Sequence Number Ordering

Consistency Guarantees

We've discussed eventual consistency in some of the earlier chapters as one of the consistency guarantees provided by some applications. It means that even though there might be delays in replicating data across multiple nodes, the data will eventually get to all nodes.

However, it is a very weak guarantee as it doesn't say when the replicas will converge, it just says that they will converge.

There are stronger consistency guarantees that can be provided, which we'll touch on in this chapter, but these come at a cost. These stronger guarantees often have worse performance or are less fault-tolerant than systems with weaker guarantees.

Linearizability

The idea behind Linearizability is that the database should always appear as if there is only one copy of the data. This means that making the same request on multiple nodes should always give the same response as long as no update is made between those requests.

It is also a recency guarantee, meaning that the value read must be the most recent or up-to-date value, and is not from a stale cache. Basically, as soon as a client successfully completes a write, all other clients must see the value just written.

If one client's read returns a new value, all subsequent reads must also return the new value.

Note: In the book, Linearizability is said to also be known as atomic consistency, strong consistency, immediate consistency or external consistency. However, Google's Cloud Spanner has a different idea and distinguishes between some of those terms. This distinction is explained in another post.

Linearizability vs Serializability

Linearizability is a recency guarantee on reads and writes of a single object. This guarantee does not group multiple operations together into a transaction (meaning it cannot protect against a problem like write skew, where a transaction makes a write based on a value it read earlier that has now been updated by another concurrently running transaction).

Serializability is an isolation property of transactions that guarantees that transactions behave the same as if they had executed in some serial order i.e. each transaction is completed before the next one starts. There is no guarantee on what serial order these transactions appear to run in, all that matters is that it is a serial order.

When a database provides both serializability and linearizability, the guarantee is known as strict serializability or strong one-copy serializability. I believe external consistency and strict serializability provide the same guarantees.

Two Phase-Locking and Actual Serial Execution are implementations of serializability that are also linearizable. However, serializable snapshot isolation is not linearizable, since a transaction will be reading values from a consistent snapshot.

Question: If serializable snapshot isolation is well implemented by ensuring that it detects writes in a transaction that may affect prior reads (from a consistent snapshot) or that it detects stale reads, wouldn't that make it linearizable as one of these transactions will be aborted and will thus preserve the recency guarantee?

Answer: I'm guessing the risk here is that the stale read might have returned a value now being used outside of the database, which then violates the linearizability guarantee.

Note that a stale read is not a violation of serializability, see here. If Transactions A & B are concurrent and Transaction A commits before Transaction B, serializability is still preserved if the database makes it look like the operations in Transaction B happened before those in Transaction A. The key thing is that the transactions appear to be executed one after the other.

Relying on Linearizability

As good as linearizability is as a guarantee, it is not critical for all applications. However, there are examples of where linearizability is important for making a system work correctly, and we'll cover them here.

Locking and leader election

A system with a single-leader replication model must ensure that there's only ever one leader at a time. One way to implement leader election is by using a lock. All the eligible nodes start up and try to acquire a lock and the successful one becomes the leader.

This lock must be linearizable: once a node owns the lock, all the other nodes must see that it is that node that owns the lock.

Apache ZooKeeper and etcd are often used to implement distributed locks and leader election.

Constraints and uniqueness guarantees

When multiple users are concurrently trying to register a value that must be unique, each user can be thought of as acquiring a lock on that value. E.g. a username or email address system.

We see similar issues in examples like ensuring that a bank account never goes negative, not selling more items than is available in stock, not concurrently booking the same seat on a flight, or in a theater for two people. For these constraints to be implemented properly, there needs to be a single up to date value (the account balance, the stock level, the seat occupancy) that all nodes agree on.

However, note that some of these constraints can be treated loosely and are not always critical, so linearizability may not be needed.

Implementing Linearizable Systems

Seeing that linearizability means the system behaves as if there is only one copy of the data, the simplest way to implement it will be to actually have just one copy of the data. However, that won't be fault-tolerant if the node that has the single copy becomes unavailable.

Since replication is the most common way to make a system fault-tolerant, we'll compare different replication methods here and discuss whether they can be made linearizable.

Single-leader replication (potentially linearizable): If we make every read from the leader or from synchronously updated followers, the system has the potential to be linearizable. However, there is no absolute guarantee as the system can still be non-linearizable either by design (because it uses snapshot isolation) or due to concurrency bugs.

Using the leader for reads also implies that there is an assumption that we'll always know who the leader is. Issues like split-brain can mean that a single-leader system can violate linearizability.

Multi-leader replication (not linearizable): These systems are generally not linearizable since they can process writes concurrently, and the writes are typically asynchronously replicated to other nodes. It means that clients can view different values of a register (single object) if they read from different nodes.
Consensus Algorithms (linearizable): We haven't dealt with this yet but these systems are typically linearizable. They are similar to single-leader replication, but they contain additional measures to prevent stale replicas and split-brain. As a result, consensus protocols are used to implement linearizable storage safely. Zookeeper and Etcd work this way.

Leaderless replication (probably not linearizable): Recall from Chapter 5 that these systems typically require quorum reads and writes where w + r > n. While these can be linearizable, they are almost certainly non-linearizable under certain circumstances, like when "Last write wins" is used as the conflict resolution method based on time-of-day clocks. These are typically non-linearizable because we know that clock timestamps are not guaranteed to be consistent with the actual ordering of events due to clock skew. Another circumstance where non-linearizability is almost guaranteed is when sloppy quorums are used.

Even with strict quorums, there is the possibility of non-linearizability due to concurrency bugs. If we have 3 nodes in a cluster and set w = 3 and r = 2, the quorum condition is met. However, if a client is writing to 3 nodes and two clients concurrently read from 2 of those 3 nodes, they may see different values for a register as a result of network delays in writing to all the nodes.

However, it is possible to make these dynamo-style quorums linearizable at the cost of reduced performance. To do this, a reader must perform read repair synchronously before returning results, and a writer must read the latest state of a quorum of nodes before sending its write.

The Cost of Linearizability

While linearizability is often desirable, the performance costs mean that it is not always an ideal guarantee.

Consider a scenario where we have two data centers and there's a network interruption between those data centers:

In a multi-leader database setup, the operations can continue in each data center normally since the writes can be queued up until the network link is restored and replication can happen asynchronously.
In a single-leader setup, the leader must be in one of the data centers. Therefore, clients connected to a follower data center will not be able to contact the leader and cannot make any writes, nor any linearizable reads (their reads will be stale if the leader keeps getting updated). An application that requires linearizable reads and writes will become unavailable in the data centers which cannot contact the leader.
Clients that can contact the leader data center directly will not witness any problems, since the application continues to work normally there.

The CAP Theorem

The CAP theorem is a popular theorem in Distributed Systems that is often misunderstood. It describes a trade-off in building distributed systems. In relation to the scenario above, this trade-off is as follows:

If an application requires linearizability and some replicas are disconnected from other replicas due to a network problem, then those replicas cannot process requests while they are disconnected: the replicas must either wait until the network problem is fixed or return an error. These replicas are then unavailable.
If the application does not require linearizability, it can be written in a way that each replica can process requests independently even when disconnected from other replicas. Therefore, the application can remain available in the face of a network problem, but the behaviour is not linearizable.

In the original definition of the CAP Theorem, the behaviour described for Consistency is linearizability. Availability means that any non-failing node must return a response that contains the results of the requested work i.e., not a 500 error or a timeout message.

Therefore, in the face of network partitions or faults, a system has to choose between either total availability or total linearizability. That's the CAP Theorem in simple terms.

Applications that do not require linearizability are more tolerant of network problems since the nodes can continue to serve requests.

Note that while the CAP Theorem has been useful, the definition is quite narrow in scope (it only considers Linearizability as the consistency model and network partitions (i.e. nodes in a network disconnected from each other) as the only types of faults, it says nothing about network delays or dead nodes.

You can read a critique of the CAP theorem in this article, which also proposes alternative ways to analyze systems.

Linearizability and network delays

Fault tolerance is not the only reason for dropping linearizability, performance is another reason why it sometimes gets dropped.

Interestingly, RAM on a modern multi-core CPU is not linearizable. This means that if a thread running on one CPU core writes to a memory address, a thread on another CPU core is not guaranteed to read the latest value written (unless a fence or memory barrier is used).

From StackOverflow:

A memory fence/barrier is a class of instructions that mean memory reads/writes occur in the order you expect. For example a 'full fence' means all reads/writes before the fence are committed before those after the fence.

This happens because every CPU core has its own memory cache and store buffer, and memory access goes to the cache by default. Changes are asynchronously written out to main memory. Accessing data in the cache is faster than going to the main memory, so this feature is useful for good performance on modern CPUs.

We can't say that this tradeoff was made for availability purposes, because we wouldn't expect on CPU core to continue to function properly while disconnected from the rest of the computer.

Linearizability is always slow, not just during a network fault. There's a proof in this paper that if you want linearizability, the response time of read and write requests is at least proportional to the uncertainty of delays in the network.

The response time will certainly be high in networks with highly variable delays. Weaker consistency models can be much faster than linearizability and as it is with everything, there's always a tradeoff.

In Chapter 12, there are some approaches suggested for avoiding linearizability without sacrificing correctness.

Ordering Guarantees

Ordering has been mentioned a lot in this book because it is such a fundamental idea in distributed systems. Some of the contexts in which we've discussed it so far are:

For single-leader replication. The main purpose of the leader is to determine the order of writes in the replication log i.e. the order in which followers apply writes. Without a single leader, we can have conflicts due to concurrent operations.
Serializability: Serializability is about ensuring that transactions behave as if they were executed in some sequential order.
Timestamps and clocks in distributed systems are an attempt to introduce order into a disorderly world e.g. to determine which one of two writes happened later.

Ordering and Causality

One of the reasons why ordering keeps coming up is that it helps preserve causality. With causality, an ordering of events is guaranteed such that cause always comes before effect. If one event happened before another, causality will ensure that that relationship is captured i.e. the happens-before relationship. This is useful because if one event happens as a result of another one, it can lead to inconsistencies in the system if that order is not captured. Some examples of this are:

If a question leads to an answer, then an observer should not see the answer before the question.
When a row is first created and then updated, a replica should not see the instruction to update the row before the creation instruction.
When we discussed snapshot isolation for transactions, we mentioned that the idea is for a transaction to read from a consistent snapshot. Consistent here means consistent with causality i.e. when we read from a snapshot, the effects of all the operations that happened causally before the snapshot was taken are visible in that snapshot, but no operations that happened causally afterward can be seen.

A system that obeys the ordering imposed by causality is said to be causally consistent. For example, snapshot isolation provides causal consistency, since when you read some data from it, you must also be able to see any data that causally precedes it (assuming it has not be deleted within the transaction).

The causal order is not a total order

If elements are in a total order, it means that they can always be compared. That is, with any two elements, you can always see which one is greater and which is smaller.

With a partial order, we can sometimes compare the elements and say which is bigger or smaller, but in other cases the elements are incomparable. For example, mathematical sets are not totally ordered. You can't compare {a, b} with {b, c}.

This difference between total order and a partial order is reflected when we compare Linearizability and Causality as consistency models:

Linearizability

We have a total order of operations in a linearizable system. If the system behaves as if there is only one copy of the data, and every operation is atomic (meaning we can always point to before and after that operation), then we can always say which operation happened first.

Causality

Two operations are ordered if they are causally related (i.e. we can say which happened before the other), but are incomparable if they are concurrent. With concurrent operations, we can't say that one happened before the other.

This definition means that there is no concurrency in a linearizable database. We can always say which operations happened before the other.

The version history of a system like Git is similar to a graph of causal dependencies. One commit often happens after another, but sometimes they branch off, and we create merges when those concurrently created commits are combined.

Linearizability is stronger than causal consistency

The relationship between linearizability and causal order is that linearizability implies causality. Any system that is linearizable will preserve causality out of the box.

This is part of what makes linearizable systems easy to understand. However, given the cost of linearizability that we've discussed above, many distributed systems have dropped linearizability.

Fortunately, linearizability is not the only way of preserving causality. Causal consistency is actually the strongest possible consistency model that does not slow down due to network delays, and also remains available in the face of network failures. The caveat here is that in the face of network failures, clients must stick to the same server, given that the server captures the effect of all operations that happened causally before the partition.

Capturing causal dependencies

Causal consistency captures the notion that causally-related operations should appear in the same order on all processes—though processes may disagree about the order of causally independent operations - Jepsen

For a causally consistent database, when a replica processes an operation, it needs to ensure that all the operations that happened before it have already been processed; if a preceding operation is missing, the system must hold off on processing the later one until the preceding operation has been processed.

The hard part is determining how to describe the "knowledge" of a node in a system. If a node had seen the value of X when it issued the write Y, X and Y must be causally related.

We discussed 'Detecting Concurrent Writes' earlier where we focused on causality in a leaderless datastore and detecting concurrent writes to the same key in order to prevent lost updates. For causal consistency though, we need to go beyond just keeping track of a single key, but instead tracking causal dependencies across the entire database.

To determine causal ordering, the database needs to keep track of which version of the data was read by an application.

Sequence Number Ordering

A good way of keeping track of causal dependencies in a database is by using sequence numbers or timestamps to order the events. This timestamp can be a logical clock which is an algorithm that generates monotonically increasing numbers for each operation. These sequence numbers provide a total order meaning that if we have two sequence numbers, we can always determine which is greater.

The important thing is to create sequence numbers in a total order that is consistent with causality meaning that if operation A causally happened before B, then the sequence number for A must be lower than that of B. We can order concurrent operations arbitrarily.

With single-leader databases, the replication log defines a total order of write operations that is consistent with causality. Here, the leader can assign a monotonically increasing sequence number to each operation in the log. A follower that applies the writes in the order they appear in the replication log will always be in a causally consistent state.

Noncausal sequence number generators

In a multi-leader or leaderless database, generating sequence numbers for operations can be done in different ways such as:

Ensuring that each node generates an independent set of sequence numbers e.g. if we have two nodes, one node can generate even numbers while the other can generate odd numbers.
A timestamp from a time-of-day can be attached to each operation. We've discussed why this is unreliable previously*.*
We can preallocate blocks of sequence numbers. E.g node A could claim a block of numbers from 1 to 1000, and node B could claim the block from 1001 to 2000.

However, while these options perform better than pushing all operations through a single leader which increments the counter, the problem with them is that they these sequence number are not consistent with causality. They do not capture ordering across different nodes.

If we used the third option, for example, an operation numbered at 1100 on node B could have happened before operation 50 on node A if they process a different number of operations per second. There is no way to capture that using these methods.

Lamport Timestamps

This is one of the most important topics in the field of distributed systems. It’s a simple method for generating sequence numbers across multiple nodes that is consistent with causality.

The idea here is that each node has a unique identifier, and also keeps a counter of the number of operations it has processed. The Lamport timestamp is then a pair of (counter, nodeID). Multiple nodes can have the same counter value, but including the node ID in the timestamp makes it unique.

Lamport timestamps provide a total ordering: if there are two timestamps, the one with the greater counter value is the greater timestamp; if the counter values are the same, then we pick the one with the greater node ID as the greater timestamp.

Quoting the book, what makes Lamport timestamps consistent with causality is the following:

Every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request. When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.

With every operation, the node increases the maximum counter value it has seen by 1.

Consider the diagram below:

Figure 1 - Lamport Timestamps Illustration.

In this figure, the nodes and clients initially have a counter value of 0:

When client A first writes to node 1, node 1 increases its counter value to 1. (1, 1)
Client A then writes to node 2, providing node 2 with its counter value of 1. Node 2's current counter value is 0, so it first sets its value to 1, increases it to 2 for the new operation and then returns the new value. (2, 2)
When client B sends its request to node 2, node 2 has a greater counter value than client B, so it increases its current value to 3 for the new operation and returns it to client B. (3, 2)
Finally, client A writes to node 1 and that returns a new counter value. (3, 1)

A possible ordering of these operations is (1,1) -> (2, 2) -> (3, 1) -> (3,2), if in the case of the same counter value, our ordering scheme gives precedence to the node with the lower ID value.

This ordering showcases a limitation of Lamport timestamps. Even though operation (3,2) appears to complete before (3,1), the ordering does not reflect that.

The fact that those two have the same counter value means that they are concurrent and the operations do not know about each other, but Lamport timestamps must enforce a total ordering. With the ordering from Lamport timestamps, you cannot tell whether two operations are concurrent or whether they are causally dependent.

Basically, if two events are causally related, the Lamport timestamp ordering will always obey causality. But if one event appears before another in the ordering, it does not mean that they are causally related.

Version Vectors can help distinguish whether two operations are concurrent or whether one causally depends on the other, but Lamport timestamps have the advantage that they are more compact.

Aside: I think Lamport timestamp ordering is also sufficient to provide sequential consistency.

Timestamp ordering is not sufficient

Although Lamport timestamps are great for defining a total order that is consistent with causality, they do not solve some common problems in distributed systems.

The key thing to note here is that they only define a total order of operations after you have collected all the operations. If one operation needs to decide right now whether a decision should be made, it might need to check with every other node that there's no concurrently executing operation that could affect its decision. Any of the other nodes being down will bring the system to a halt, which is not good for fault tolerance.

For example, if two users concurrently try to create an account with the same username, only one of them should succeed. It might seem as though we could simply pick the one with the lower timestamp as the winner and let the one with the greater timestamp fail. However, if a node needs to decide right now, it might simply not be aware that another node is in the process of concurrently creating an account, or might not know what timestamp the other node may assign to the operation.

It's not enough to have a total ordering of operations, it's also important to know when the order is finalized i.e. what that order is at each point in time.

In the second part of these notes, we'll look at ways to solve the challenge of knowing the order of operations at each point in time.

Last Updated: 15-12-2022.

Data Storage on Your Computer's Disk - Part 2; On Indexes

2020-02-09T16:52:25-00:00

This is the second part of the series on 'Data Storage on Disk'. This post will focus on database indexes and the underlying data structure of in many relational databases today: B-Trees.

Table of Contents

Recap of Previous Post.
Heaps
- Navigating Through a Heap
Indexes
Conclusion
Further Reading
- On Heaps
- On Indexes and B-Trees

Recap of Previous Post.

In the previous post, we learned that rows in your database table are mapped as records internally, and those records are organized into pages on your computer's disk. We also learned about Temporal and Spatial Locality, which are two principles that help determine how data pages are loaded from disk into the in-memory cache.

We concluded by learning about write-ahead logs, and why they are particularly useful in the context of database transactions.

In this post, we'll dive even deeper and learn about how records are organized within pages, how pages are organized within files on disk, and how that organization makes it faster to search for records in your database. In short, we'll learn what database indexes are.

In many relational databases today, pages belonging to a table can be organized on your computer's disk in two main ways:

As a Heap ( A table without a Clustered Index - we'll learn what this means soon.)
As a Clustered Index

Heaps

When the pages of a table are in a heap, it means that they are structured in the database without any logical order. The records that belong to the pages in the heap also have no logical order. A row inserted in a heap table is 'heaped' on the existing rows until a page gets filled up.

To make this clearer, let's look at the image below which represents a Person table structured as a heap, with each row containing values for the Name, Age and Location columns. In this example, there is no obvious sort order of the rows. A table with a logical sort order would have these records sorted either alphabetically by the Name or Location columns, or numerically by the Age column.

Figure 1 - Representation of a records in a Heap table.

Navigating Through a Heap

With no logical order to how records are arranged, it becomes harder to locate individual records in a heap. Records are identified by a row ID (RID) which is a pointer made up of the file ID, page ID and slot ID that the record is located on, represented as (FileID:PageID:SlotID).

To keep track of the location of data pages which belong to a heap table, SQLServer uses a special kind of page known as an Index Allocation Map (IAM) page. An IAM page can manage data of only one heap table, and each heap table in a database is assigned at least one IAM page.

An IAM page tracks about 4GB of data which is equivalent to ~500K pages. If the size of the heap table data exceeds 4GB, another IAM page is allocated to the heap table to track the next 4GB of data. The first IAM page then contains a pointer to the next one.

Note that an IAM page does not track individual data pages. Instead, it divides the pages into groups of eight known as extents. An extent is the smallest unit that an IAM page will track. This abstraction makes it easier to manage the pages.

To scan a heap table, SQLServer will point to the first IAM page of the table and then scan each data page in each extent that the IAM page tracks.

The IAM page is useful here because it helps to link the data pages of heap table. As there is no link between the data pages, the IAM page is a way to navigate between the pages in the table.

To find any row stored in a heap in the absence of an index (I know I'm getting ahead of myself here), the entire heap table must be scanned row-by-row. This is less than ideal for a large table.

In addition, when you make a query for a set of records in a heap table, there is no guaranteed order for the set of results because there is no order to how they are stored.

Note that records are initially stored in the order of insertion to a table, but they can be moved around within the heap to store them more efficiently (e.g. if a record is updated, and they need more space after an update then they were initially allocated), and so the storage order is unpredictable.

When a record is moved around in a heap, a forwarding pointer is usually left behind in its old location. This pointer serves to redirect any references made to the record in its old position to the new position.

We'll briefly resume the discussion on Heap Tables once we've covered Clustered and Non-Clustered indexes below.

Indexes

So far, we have learned that data rows in your database table are stored as records on disk and organized into data pages. These pages are part of what then make up a table.

We've also briefly covered the heap table as one way that pages are organized for a table. When discussing heaps, we saw that the lack of a logical order in arranging the records on a page (and the pages on the disk) makes it more difficult to search for a particular record in your table. You would have to scan each row one by one.

Now, what if there was a way to organize your records on a page, and your pages on the disk to make it easy to find a record or group of records? Well of course there is, enter Indexes.

An index in a database is similar to an index at the back of a textbook. If you're looking for a particular topic in your textbook and you're not sure of what page it's covered on, you would use the index at the back of the book because you know it's sorted alphabetically and it makes finding stuff easier.

In databases, an index works the same way, except that the 'topic' you're searching for in your textbook is now analogous to a 'value' in your database column. When you create an index on a column, you are making all the values on that column ordered in a way that makes it easier to access each one.

To understand how indexes work, let's talk about a data structure that is the backbone of many relational databases today known as the B-Tree.

B-Trees

For the unfamiliar, a Tree is a data structure made up of elements known as nodes. There is a node at the top known as the root node. The root node can act as a parent to other nodes known as its children. Each child node can then be a parent to other nodes in the tree. This cycle continues until we get to the leaf nodes. A leaf node is a node without any children. In the diagram below, Node A is the root node and Nodes D, E and F are the leaf nodes in that tree. Nodes B and C are what are known as intermediate nodes.

Figure 2 - Tree Data Structure ^[1].

A B-Tree is a special kind of tree that is widely used in implementing database indexes. Some of the properties which make it specific to this use case are:

Each node in a B-Tree is a database page (could be a data page or an index page).
Each page is responsible for keeping track of an ordered range of keys. The keys correspond to different values for a column (or set of columns) that we create an index on and are typically arranged in ascending order.
A page contains either the row for a specific key in its range or a reference to a child page where the key can be found.

In case you're wondering, no one knows for sure what the 'B' in B-Tree stands for (except maybe the creators).

Let's look at the diagram below to make this clearer:

Figure 3 - Representation of a records in a B-Tree.

This diagram is an example representing an index on a 'name' column in a table. Suppose you want to retrieve the row which has the column's value as 'Dan'. A search on this tree will proceed as follows:

Step One

Starting from the root page of the tree, we see that it contains the record for the name 'Bayo' and references (ref) to child pages for other records. If we were searching for 'Bayo', the search would end here! But we're not, and so the search continues.

An interesting property of this tree is that child pages to the left of the current page contain keys which are less than the smallest key on the current page, and right side child page keys are greater than or equal to the largest key on the current page.

With that in mind, 'Dan' comes after 'Bayo' in alphabetical order, and so we look at the right side child page.

Step Two

On the next level of the tree, we see that 'Ben' and 'Dele' are the only name records located on the right side child page of the root. However, the page contains references to the names that come before 'Ben', fall between 'Ben' and 'Dele', and come after 'Dele'. 'Dan' falls between those two names and so we follow the reference to the child page on the final level.

Step Three

When we follow that 'ref' entry, it leads us to a child page where the 'Dan' entry can be found! That marks the end of our search. This child page is known as a leaf page as it contains no references to other children.

Now, this is a really simple example that does not cover all the nuances of a B-Tree, but the idea is the same even for more complex examples.

Some key things to note are that:

The root page can have more than one key on it.
We eliminate half of the possibilities at each level of a search. This is the binary search algorithm in action.
The advantage of the B-Tree over Heap Tables for searching is that it reduces our worst-case lookup time for a key from O(n) to O(log n). This means that while searching a heap table with a million records could involve a million operations in the worst case, searching a B-Tree with the same number of records will require only about 20 operations at worst.

"So how does the index relate to the actual data being stored?"

This was a question that I had after first learning about B-Trees. I wasn't sure if the indexing structure contained the data itself, or if it was just a pointer to the data stored somewhere else.

Well, the answer is that it can be either, depending on what type of index is used. Two of the ways in which an indexed database table can be organized are as a clustered index and/or non-clustered indexes.

For a clustered index, the data associated with a key (column value) is stored on the same page as the key. The leaf nodes for a clustered index are always data pages. Once you can access a particular key, you have the information for the rest of the row stored on the same page. In MySQL and SQLServer, the primary key of a table is always a clustered index, and there's typically only one clustered index per table to avoid duplicating data.

On the other hand, non-clustered indexes store the key and a pointer to the underlying data located somewhere else. The leaf nodes here are not data pages, but index pages which contain a pointer for individual rows. This pointer can be to a heap or to a clustered index.

If the pointer is to a heap, then the pointer is the row identifier(RID) used to locate the record in the heap table.

This is what makes the forwarding pointer discussed under the 'Heaps' section useful. Imagine that we have 20 references to a particular record in a heap across multiple non-clustered indexes. Without the forwarding pointer, if the record gets moved around, we would have to update the RID across multiple indexes. Fortunately, we can still retain our reference to the old RID in our indexes because of the presence of the forwarding pointer at that location.
If the pointer is to a clustered index, then it points to the location of the primary key on the clustered index. MySQL's InnoDB storage engine and SQLServer point non-clustered indexes to a clustered index.

Write Amplification in B-Trees

For all the advantages of the standard B-Tree over a Heap table, it is not a perfect structure. One disadvantage it has is that one write to the database may involve multiple write operations to the disk.

To add a new key to a B-Tree, you first need to find the page that holds the range of keys that the key falls in. If the page does not have enough free space to hold the new key, the page is split into two child pages, and is then updated to hold references to these new pages.

The fact that one operation may require multiple pages to be overwritten is dangerous because it means that if the database crashes in the middle of overwriting the pages, the database is will now be in an inconsistent state.

To prevent this inconsistency, B-Tree implementations also include a write-ahead log. Before any changes are applied to pages in the tree, they must first be written to the durable write-ahead log.

This event - where one write to the database can lead to multiple writes on disk - is known as write amplification.

B+ Trees

Many relational databases today actually use a variant of the B-Tree known as the B+ Tree. The B+ Tree representation of the standard B-Tree in Figure 3 is shown below.

Figure 4 - Representation of a records in a B+ Tree.

The main differences are these:

Only the leaf nodes in a B+ tree contain data for the records. In a standard B-Tree, an internal/intermediate page may contain both the data for some records as well as references to child nodes. The advantage of not storing any data on the internal nodes is that more keys can fit on each page, which can lead to fewer page splits and reduce the depth of the tree.
Each leaf page in a B+ Tree is linked to its neighbors.

This has the advantage that doing a full scan of all the records in a tree will only require a linear pass through the leaf nodes. Doing this in a B-Tree will require traversing through all the levels in the tree as each level can contain the data for a record.

The downside of B+ Trees compared to B-Trees is that unlike with B-Trees where you can have an 'early exit' if you find a key's data on an internal page, you would have to traverse through all the levels of a B+ Tree to get to the leaf page which has data for a key. However, I reckon that majority of the keys will be on leaf pages for both trees anyway, which is why most databases opt for the B+ Tree instead.

Conclusion

We're not done with the discussion on B-Trees. In the next and final post of the series, we'll learn about LSM Trees, which are another type of indexing structure commonly used today. We'll compare them with B-Trees by exploring the pros and cons of each. We'll briefly learn about other types of indexes and some considerations when choosing indexes for your database table.

[1] By Victor S.Adamchik, CMU, 2009 - Own work, https://www.cs.cmu.edu/~adamchik/15-121/lectures/Trees/trees.html

Chapter 8 - The Trouble with Distributed Systems

2019-12-26T21:19:03-00:00

Notes from Chapter 8 of Martin Kleppmann's 'Designing Data-Intensive Applications' book

In this chapter, we'll look at the things that may go wrong in distributed systems. We'll cover problems with network, clocks and timing issues, and other faults.

Table of Contents

Faults and Partial Failures
- Cloud Computing and Supercomputing
Unreliable Networks
Unreliable Clocks
Knowledge, Truth, and Lies

Faults and Partial Failures

Distributed systems differ from single node computers in that: unlike single node computers where either the system is completely working or completely broken, we can have partial failures in distributed systems.

What makes partial failures more difficult to deal with is that they are nondeterministic. It may sometimes work, and sometimes fail.

Cloud Computing and Supercomputing

We have high-performance computing (HPC) and cloud computing on both extremes of philosophies for building large-scale applications.

High performance computers or Supercomputers have thousands of CPUs used for computationally expensive tasks like weather forecasting. In general, a job will checkpoint the state of its computation and store it durably from time to time. If a node fails, the whole cluster is brought down. The state of computation is restarted from the last checkpoint. This makes supercomputers similar to single node computers.

Nowadays, many internet services need high availability. It's not acceptable to bring down the cluster due to failure in a node.

To make distributed systems work, we must endeavor to build a reliable system from unreliable components.

We sometimes like to assume that faults are rare and then don't account for them, but we must design with fault tolerance in mind.

Building a reliable system from unreliable components is not a unique idea to distributed systems, and is used in other areas as well. E.g.:

In information theory, the idea is to have reliable communication over an unreliable channel. We achieve this using error correcting codes.
In networking, IP is unreliable as it may drop, delay, duplicate, or reorder packets. TCP provides a more reliable layer on top of that as it re-transmits missing packets, eliminates duplicates, and reassembles packets in the right order.

Note that even though a system can be more reliable than the parts that it's made of, there's a limit to the level of reliability that can be attained. Error-correcting codes can only deal with a number of single-bit errors and TCP cannot remove delays in the network.

Unreliable Networks

As stated earlier, this book focuses on shared-nothing systems which communicate with each other via a network. The advantage of this approach is that it is comparatively cheap, as it requires no special hardware. We can have a bunch of regular machines as part of the system.

Note that the internet and most internal networks in datacenters are asynchronous packet networks. This means that: one node can send a message to another node, but have no guarantee about when the message will arrive, or whether the message will actually arrive at all. Unfortunately, with this approach, many things could go wrong:

The request may have been lost. E.g. The network cable might have been unplugged.
The remote node may have successfully processed the request, but then the response got lost on the network. E.g. a misconfigured network switch.
The remote node may have failed.
The request may be waiting in a queue to be delivered later. E.g if the network or recipient is overloaded.
The response from the remote node may have been delayed or will be delivered later.

In essence, the sender is unable to tell whether the packet was delivered unless it receives a response message from the recipient. It's impossible to distinguish these issues in an asynchronous network.

These issues are typically handled with a timeout, but that still gives no information about the state of the request.

Network Faults in Practice

Although network faults may be rare, the fact that they can happen means that software needs to be able to handle them. Handling network faults does not necessarily mean tolerating them, a simple approach can just be to show an error message to users. However, there has to be work done to know how the software reacts to network problems, and also ensure that the system can recover from them.

It might be a good idea to deliberately trigger network problems and test the system's response - Chaos Monkey.

Detecting Faults

It's important to automatically detect network faults, as they might be linked to faulty nodes. Detecting faults quickly ensures that:

A load balancer can stop sending requests to a dead node.
A new follower can be promoted to a leader if the leader fails in a single-leader replication.

Due to the uncertainty about the network, it's difficult to tell whether a node is working or not. There are some specific ways to tell though, such as:

If the machine on which the node is running is reachable, but no process is listening on the destination port (e.g. because the process crashed), the OS will close or refuse TCP connections. However, if the node crashed while processing the request, there's no way of knowing how much data was processed by the remote node.
If you have access to the management interface of the network switches in your datacenter, they can be queried to detect link failures at hardware level. Of course this is not applicable if you're connecting over the internet or have no access to the datacenter.
In a situation where a process on a node is crashed, but the node's OS is still running, there can be a script which notifies other nodes when a node has crashed. This will allow another node to take over. This approach is used in Hbase.

Timeouts and Unbounded Delays

We have mentioned that timeouts are often used to detect a fault. However, there is no simple answer too how long a timeout should be. It simply depends.

With a long timeout, it means there can be a wait until a node is declared dead. This means users will have to wait a while or see error messages.

On the other hand, a short timeout means that nodes can be declared dead prematurely, even when it only suffers a temporary breakdown (e.g. due to a load spike on the node or the network). There are downsides of declaring a node dead prematurely:

Its responsibilities need to be transferred to other nodes, which can place additional load on the other nodes and the network. This can lead to a cascading failure as other nodes can become slow to respond.
If the node is in the middle of performing an action and another node takes over, the action may be performed twice.

In an ideal system, we could have a guarantee that the maximum delay for packets transmission will be d, and the node always handles a response within time r. In this kind of system, we could set a timeout for 2d + r and it'll be reasonable.

However, in most systems, we do not have either of those guarantees. Asynchronous networks have unbounded delays.

Network Congestion and Queueing

Queueing is the most common cause of the variability of network packet delays (i.e. unbounded delays). Queues can be formed at different points:

If several nodes try to send packets to the same destination, the packets must be queued up by the network switch and fed into the destination network link one by one. On a busy network link, a packet may have to wait a while before it can get a slot. A packet can even get dropped if the switch queue fills up, and it needs to be resent. This can happen even if the network is functioning properly.
When a packet reaches its destination node, if all the CPU cores are currently too busy to handle the request, the request needs to be queued by the operating system until it can handle it.
TCP performs flow control, where a node limits its rate of sending to avoid overloading a network link or the receiving node. This means additional queueing at the sender even before the data enters the network.

Aside: TCP vs UDP(Transmission Control Protocol vs User-Datagram Protocol)

TCP is a reliable network transmission protocol while UDP is unreliable. It means that TCP implements:

Flow Control
Acknowledgement and Retransmission
Sequencing: Ensuring that messages arrive in the right order even if packets are dropped.

Any messages not acknowledged will be retransmitted in TCP.

UDP is used in latency-sensitive applications like videoconferencing and Voice over IP, where there's less tolerance for delays. In UDP, delayed data is probably worthless so it does not try to retransmit it. E.g. in phone calls, instead of retransmitting, it simply fills the missing packet's time slot with silence. The retry happens at the human layer: "Could you repeat that please?".

In essence, timeouts should typically be chosen experimentally: measure the distribution of network round trip times over an extended period, and over many machines to determine the expected variability of delays.

The Phi Accrual failure detector used in Akka and Cassandra measure response times and automatically adjust timeouts based on the observed response time distribution.

Synchronous Versus Asynchronous Networks

A question that you might have is: why don't we make the network reliable at a hardware level so the software does not need to worry about it?

To address this, it's worth looking at the traditional fixed-line telephone network (non-cellular, non-VoIP) which is apparently very reliable and rarely drops messages.

The way it works is that:

When a call is made over the network, it creates a circuit.
This circuit has a fixed, guaranteed bandwidth for the call which remains in place until the call ends.
This network is synchronous, and it does not suffer from queueing, since the required amount of space for the call has already been reserved. Because there is no queueing, it has a bounded delay.

Note that this approach differs from a TCP connection. While there is a fixed amount of reserved bandwidth here that no one else can use while the circuit is established, TCP packets will typically grab whatever network bandwidth is available.

Datacenter networks and the internet make use of the TCP approach of packet switching rather than establishing circuits, because they are optimizing for bursty traffic. Unlike an audio or video call where the number of bits transferred per second is fairly constant, the traffic through the internet is unpredictable. We could be requesting a web page, or sending an email, or transferring a file etc. The goal is to just complete it as quickly as possible.

Using circuits for bursty data transfer will waste network capacity and could make transfers unnecessarily slow, as we would have to guess how much bandwidth to allocate beforehand. TCP dynamically adapts the data transfer rate to the available network capacity.

There's ongoing research to use quality of service and admission control to emulate circuit switching on packet networks, or provide statistically bounded delays.

Unreliable Clocks

Clocks and time are important in distributed systems. Applications use clocks to answer questions like:

Has a request timed out yet?
When was a request received?
How long did a user spend on a site?
When does a cache entry expire? Etc.

Some questions measure duration, while some describe points in time.

Time is tricky because each machine on a network has its own clock, and some may be faster or slower than others. Clocks can be synchronized to a degree though, by using the Network Time Protocol (NTP). It works by adjusting clocks using time reported from a group of servers. The group of servers get their time from a GPS receiver.

Monotonic vs Time-of-Day Clocks

Modern computers have at least two different kinds of clocks: a time-of-day clock and a monotonic clock.

Time-of-day clocks

These are like standard clocks, which just return the current date and time according to a calendar. These clocks are typically synchronized with NTP, which means that timestamps should match across machine ideally.

Note that if the local clock is too far ahead of NTP, it may appear to jump back to a previous point in time. It could also jump due to leap seconds. The tendency of these clocks to jump make them unsuitable for measuring elapsed time.

Monotonic Clocks

These clocks are suitable for measuring a duration like a timeout or response time. They are guaranteed to move forward in time (unlike a time-of-day clock which may jump back in time).

System.nanoTime() in Java is a monotonic clock. With a monotonic clock, you can check the value at a point in time, perform an action, then check the value again and then use the difference between the two values to measure time elapsed between the two checks.

Monotonic clocks are fine for measuring elapsed time, because they do not assume any synchronization between nodes' clocks.

Clock Synchronization and Accuracy

Unlike monotonic clocks which don't need synchronization, time-of-day clocks must be synchronized with an NTP server or another external time source. However, NTP and hard clocks are not as reliable or accurate as one might hope. For example:

The quartz clock in a computer isn't very accurate. It may go faster or slower than it should. It can vary depending on the temperature of the machine. Basically, it drifts.
If the computer's clock drifts too much from an NTP server, it may refuse to synchronize, or the computer's clock will be forcibly reset. Any application that's observing this app may see time go backward or jump forward.
NTP synchronization is only as good as the network delay, so there's a limit to accuracy on a congested network with variable packet delays. An experiment showed that only a minimum error of 35ms is achievable when synchronizing over the internet, though the error can be up to a second when we have occasional spikes in network delay.
Leap seconds will lead to a minute that is 59 seconds or 61 seconds long. If a system isn't designed to handle leap seconds, this can lead to a crash. (http://www.somebits.com/weblog/tech/bad/leap-second-2012.html)
The hardware clock is virtualized in virtual machines. When multiple virtual machines share a CPU core, each VM is paused for tens of milliseconds while another VM is running. For an application running on a VM, it can look like the clock suddenly jumped forward.

Nevertheless, it's possible to achieve very good clock accuracy with significant investment into resources. For example, the MiFID II European regulation for financial companies mandates that HFT funds synchronize their clocks within 100 microseconds of UTC, to help debug market anomalies like "flash crashes".

Relying on Synchronized Clocks

While clocks may seem simple and straightforward, they have a good number of pitfalls. Some of the issues that may arise are:

Time of day clocks can move backward in time
A day may not exactly have 86400 seconds
The time on a node may differ from the time on another.

Like with unreliable networks, robust software must be prepared to deal with incorrect clocks. Dealing with incorrect clocks can be even trickier because the problems caused by this can easily go unnoticed. A faulty CPU or misconfigured network is easier to detect, as the system would not work at all. However, for a defective clock, things will generally look fine. We're more likely to experience silent and subtle data loss than a dramatic crash.

Therefore, if a software requires synchronized clocks, it's essential to monitor the clock offsets between all machines. A node whose clock drifts too far from the others should be labelled as a dead node and removed from the cluster.

Timestamps for ordering events.

Time-of-day clocks are commonly used for ordering events in some systems and they often use the last write wins conflict resolution strategy. Some of these systems are Cassandra and Riak, typically multi-leader replication and leaderless databases. Some implementations of this generate the timestamp on the client's side rather than on the server, but this does not change the problems of LWW which include:

Writes can mysteriously disappear.
It's impossible to distinguish between concurrent writes and causal writes (where one write depends on another)
Two nodes can independently generate writes with the same timestamp.

Q: Could NTP synchronization be made accurate enough that such incorrect orderings cannot occur?

A: Probably not. NTP's synchronization accuracy is also limited by the network round-trip time, in addition to other error sources like quartz drift.

Logical clocks are a safer alternative for ordering events than an oscillating quartz crystal. They measure the relative ordering of events, rather than actual elapsed time which physical clocks (like time-of-day and monotonic clocks).

Clock readings have a confidence interval

Clock readings typically have an uncertainty range, like a margin of error. However, most systems don't expose this uncertainty. An exception to this is Google's TrueTime API which is used in Spanner, and gives a confidence interval on the local clock.

Synchronized clocks for global snapshots

Snapshot isolation is commonly implemented by giving each transaction a montonically increasing ID. If a write happened later than the snapshot (i.e. it has a transaction ID greater than the snapshot), the write is invisible to the snapshot transaction. This is easier to manage on a single-node database, as we can use a simple counter.

For a distributed database though, it is more difficult to coordinate a monotonically increasing transaction ID. The transaction ID must reflect causality. If transaction B reads a value written by transaction A, B must have a higher transaction ID than A for it to be consistent.

If we didn’t have uncertainty about clock accuracy, the timestamps from the synchronized time-of-day clocks would be suitable as transaction IDs as later transactions will have a higher timestamp.

However, Google's Spanner implements snapshot isolation across datacenters this way:

Spanner implements snapshot isolation across datacenters in this way. It uses the clock’s confidence interval as reported by the TrueTime API, and is based on the following observation: if you have two confidence intervals, each consisting of an earliest and latest possible timestamp (A = [A_earliest, A_latest] and B = [B_earliest, B_latest]), and those two intervals do not overlap (i.e., A_earliest< A_latest < B_earliest < B_latest), then B definitely happened after A — there can be no doubt. Only if the intervals overlap are we unsure in which order A and B happened.

Kleppmann, Martin. Designing Data-Intensive Applications (Kindle Locations 7547-7554). O'Reilly Media. Kindle Edition.

To ensure that transaction timestamps reflect causality, Spanner waits for the length of the confidence interval before committing a read-write transaction. This means that any transaction that reads the data is at a sufficiently later time, so the confidence intervals do not overlap. For example, if the confidence interval is 7ms, a read-write transaction will wait for 7ms before committing. Remember that with snapshot isolation, a transaction can't read anything that wasn't committed when it started. Therefore, we can be sure that any transaction that reads the now committed read-write transaction happened at a sufficiently later time.

To keep the wait time as small as possible, Google uses a GPS receiver in each datacenter, which allows clocks to be synchronized within about 7ms.

Process Pauses

A node in a distributed system must assume that its execution can be paused for a significant amount of time at any point, even in the middle of a function. When this pause happens, the rest of the system keeps moving and may declare the paused node dead because it's not responding. This paused node may eventually continue running, without noticing that it was asleep until it checks the clock later.

A distributed system must tailor for these pauses which can be caused by:

Garbage collectors which stop all running threads.
In virtualized environments, a VM can be suspended and resumed e.g. for live migration of a VM from one host to another without a reboot. Suspending the VM means pausing the execution of all processes and saving memory contents to disk. Resuming it means restoring the memory contents and continuing execution.
On laptops, the execution of a process could be paused and resumed arbitrarily e.g. when a user closes their laptop lid.
IO operations could also lead to delays.

There's active research into limiting the impact of Garbage Collection pauses. Some of the options are:

Treat GC pauses like brief planned node outages. When a node requires a GC pause, the runtime can warn the application and stop sending requests to that node. It could also wait for it to process outstanding requests and perform GC while no requests are in progress. This hides the GC pauses from clients and reduces high percentiles of response times. This approach is used in some latency-sensitive financial trading systems (https://cdn2.hubspot.net/hubfs/1624455/Website_2016/Content/Whitepapers/Cinnober%20on%20GC%20pause%20free%20Java%20applications.pdf)
Use the GC only for short-lived objects that are fast to collect, and restart processes periodically, before they accumulate enough long-lived objects to requires a full GC of long-lived objects. A node can be restarted at a time, and traffic can be shifted away from the node before the planned restart, like in a rolling upgrade.

These options can't fully prevent GC pauses, but they can reduce their impact on the application.

Knowledge, Truth, and Lies

So far, we have discussed some of the distributed systems problems that can occur, which include: unreliable networks, unreliable clocks, faulty nodes, processing pauses etc. We've also discussed how distributed systems differ from programs running on a single node: there's no shared memory, there's only message passing via an unreliable network with variable delays.

As a result of these issues, a node in a distributed system cannot know anything for sure. It can only guess based on the messages it receives (or doesn't receive) via the network. There has to be a consensus.

In this section, we'll explore the concept of knowledge and truth, and guarantees we can provide under certain assumptions in a distributed system.

The Truth is Defined by the Majority

A node cannot trust its assessment of a situation. A node may think it's the leader, while the other nodes have elected a new one; it may think it's alive, while other nodes have declared it dead. As a result, many distributed algorithms rely on a quorum for making decisions i.e. decisions require a minimum number of votes from several nodes in order to reduce dependence on a single node.

The quorum is typically an absolute majority of more than half the nodes. This is typically safe because there can only be one majority in a system at a time.

The Leader and the Lock

A system often requires there to only be one of a particular thing. For example:

Only one leader for a database partition.
Only one transaction is allowed to hold the lock for a particular resource or object.
Only one user is allowed to register a particular username.

Due to the fact that a node can believe it’s the "chosen one" even when it isn't, the system must be designed to handle such situations and avoid problems like split-brain.

Fencing tokens

One of the ways by which systems handle a situation where a node is under a false belief of being "the chosen one", thereby disrupting the rest of the system, is by using fencing tokens.

Basically, each time a lock server grants a lock or a lease, it also generates a fencing token (a number that increases every time a lock is granted). We can then require that any client which wants to send a write request to the storage service must include the current fencing token.

The lock server will then perform validation on any request with the fencing token included and reject it if it has generated a fencing token with a higher number.

For applications using ZooKeeper as a lock service, the transaction ID zxid or node version cversion can be used as the fencing token, since they are guaranteed to be monotonically increasing - which is a required property for a fencing token.

Byzantine Faults

Fencing tokens can help detect and block a node that is not deliberately acting in error (e.g. because it hasn't yet realized that its lease has expired). However, for a node that is deliberately acting in error, it could simply send messages with a fake fencing token.

In this book, nodes are assumed to be unreliable but honest: any node that does respond is assumed to be telling the truth to the bests of its knowledge.

If there's a risk that nodes may "lie" (e.g. by sending corrupted messages or faulty responses), it becomes a much harder problem to deal with. That behavior is known as a Byzantine fault and systems that are designed to handle these faults are Byzantine Fault Tolerant Systems.

A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network.

Kleppmann, Martin. Designing Data-Intensive Applications (Kindle Locations 7812-7813). O'Reilly Media. Kindle Edition.

Dealing with Byzantine faults is relevant in specific circumstances like:

In an aerospace environment, data in a computer's memory may become corrupted due to radiation, leading it to respond to other nodes in unpredictable ways. The system has to be equipped to handle this to prevent plane crashes. Therefore, flight control systems must tolerate Byzantine faults.
In a system with multiple participating organizations (e.g peer-to-peer networks like Bitcoin), some participants may attempt to cheat or defraud others. It's not safe to simply trust another node's messages.

In most server-side data systems, however, the cost of deploying Byzantine fault-tolerant solutions makes them not practical. Web applications need some other controls though to prevent malicious behavior and that's why: input validation, sanitization and output escaping are important.

Weak forms of lying

There are weaker forms of "lying" which are not full-blown Byzantine faults that we can protect against. For example:

Network packets get corrupted due to hardware issues or bugs in the OS. These are usually caught by the checksums built into TCP and UDP, but they sometimes evade detection. Checksums in the application-level protocol is a simple measure that can provide protection against such corruption.
A publicly accessible application must carefully sanitize all inputs from users.
NTP clients can be synchronized with multiple server addresses, instead of just one server. This makes NTP more robust. One erroneous server among multiple good ones will have minimal impact compared to if it was the only server.

System Model and Reality

When coming up with distributed systems algorithms, we need to write them in a way that doesn't depend too much on the hardware and software details. Basically, we need an abstraction for what the algorithm may assume, and the types of faults that we can expect in a system. This abstraction is known as a system model.

System Model for Timing Assumptions

Synchronous model: In this model, we assume that there's a bounded network delay, bounded process pause, and bounded clock error. That is, although there might be errors or delays, it will never exceed a fixed upper bound.
Partially synchronous model: This model assumes that the system behaves like a synchronous model most of the time, but sometimes exceeds the bounds for network delay, process pauses and pause drifts. This is the most realistic model for timing assumptions, since many systems work correctly most of the time but occasionally exceed the upper bound.
Asynchronous model: Here, the system is not allowed to make any timing assumptions. It does not have a clock, and so it doesn't use timeouts. This model is very restrictive.

System Model for Node Failures

Crash-stop faults: This model assumes that a node can fail only by crashing, and the node never comes back. That is, once it stops responding, its gone forever.
Crash-recovery faults: This model assumes that nodes may crash at any moment, but also start responding again after some unknown time. Nodes here are assumed to have stable storage that gets preserved across crashes, and the in-memory state is assumed to be lost.
Byzantine faults: Nodes may do anything, including trying to trick other nodes.

For modeling real system, the most useful model is generally the partially synchronous model with crash-recovery faults.

Correctness of an algorithm

For an algorithm to be correct, it must have certain properties. E.g. For a sorting algorithm to be correct, it must have certain properties expected of its output. Such as that the element further to the left is smaller than the element further to the right.

Likewise, we can define properties for what it means for a distributed algorithm to be correct. For example, for generating fencing tokens, the algorithm may be required to satisfy the following properties:

Uniqueness: No two requests for a fencing token must return the same value.
Monotonic sequence: It must always increase.
Availability: A node that requests a fencing token and doesn't crash must eventually receive a response.

It's possible for some properties to hold, while others don't. How do we know which distinguish between which properties must hold and which could tolerate caveats? The next section helps to answer that.

Safety and liveness

There are two different kinds of properties that we can distinguish between: safety and liveness properties. In the example above for a fencing token, uniqueness and monotonic sequence are safety properties, while availability is a liveness property.

Safety properties are informally defined as: nothing bad happens, while liveness properties are defined as something good eventually happens.

These informal definitions are subjective (what's good or bad, really) and it’s best not to read too much into them.

The actual definitions of safety and liveness are precise and mathematical:

If a safety property is violated, we can point at a particular point in time at which it was broken (for example, if the uniqueness property was violated, we can identify the particular operation in which a duplicate fencing token was returned). After a safety property has been violated, the violation cannot be undone — the damage is already done.
A liveness property works the other way round: it may not hold at some point in time (for example, a node may have sent a request but not yet received a response), but there is always hope that it may be satisfied in the future (namely by receiving a response).

Kleppmann, Martin. Designing Data-Intensive Applications (Kindle Locations 7924-7927). O'Reilly Media. Kindle Edition.

For distributed algorithms, it is commonly required that safety properties always hold, in all possible situations of a system model. Therefore, even in the occurrence of all nodes crashing, or entire network failures, the algorithm must always return a correct result.

On the other hand, we are allowed to make caveats with liveness properties. E.g. we could say that a request will only receive a response if majority of nodes have not crashed, and only if the network recovers from an outage eventually.

Mapping system models to the real world

While these theoretical abstractions are useful, reality must also be considered when designing algorithms. We may sometimes have to include code to handle situations where something that was assumed to be impossible actually happens.

Proving the correctness of an algorithm does not mean that the implementation on a real system will always behave correctly. However, theoretical analysis is still a good first step because it can uncover problems that may remain hidden for a long time.

Theoretical analysis and empirical testing are equally important.

Chapter 7 - Transactions

2019-12-15T01:52:41-00:00

My notes from Chapter 7 of 'Designing Data-Intensive Applications' by Martin Kleppmann.

Table of Contents

The Meaning of ACID
Single-Object and Multi-Object Operations
Weak Isolation Levels
Serializability

Transactions were created to simplify the programming model for applications accessing a database.

All the reads and writes in a transaction are executed as one operation: either the entire operation succeeds (commit) or it fails (abort, rollback).

NoSQL databases started gaining popularity in the late 2000s and aimed to improve the status quo of relational databases by offering new data models, and including replication and partitioning by default. However, many of these new models didn't implement transactions, or watered down the meaning of the word to describe a weaker set of guarantees than had previously been understood.

As these distributed databases started to emerge, the belief that transactions oppose scalability became popular; that a system would have to abandon transactions in order to maintain good performance and high availability. This is not true though.

Like every technical design choice, there are advantages and disadvantages of using transactions. It's a tradeoff.

The Meaning of ACID

ACID stands for Atomicity, Consistency, Isolation, and Durability. It is often used to describe the safety guarantees provided by transactions.

However, different databases have different implementations of ACID, and there are ambiguous definitions of a term like Isolation. ACID has essentially become a marketing term now.

Systems that don't meet the ACID criteria are sometimes called BASE: Basically Available, Soft state, and Eventual Consistency, which can mean almost anything you want.

Atomicity

The word atomic in general means something that cannot be broken down into smaller parts.

In the context of a database transaction, atomicity refers to the ability to abort a transaction on error and have all the writes from the transaction discarded.

Consistency

Consistency in the context of ACID is an application-level constraint. It means that there are certain statements about your data (invariants) that must always be true.

It is up to the application to define what those invariants are so the transaction can preserve them correctly.

Unlike the other terms, consistency is a property of the application not the database, but transactions enforce those rules. These invariants are typically enforced through database constraints like uniqueness.

Isolation

Isolation is the property of ensuring that concurrently executing transactions are isolated from each other i.e. they don't step on each other's toes. They pretend like they don't know about each other.

Isolation ensures that when concurrently executing transactions are committed, the result is the same as if they had run serially, even though they may have run concurrently.

In practice though, serializable isolation is rarely used because of the performance penalty it carries.

Durability

Durability is the promise that when a transaction is committed successfully, any data that has been written will not be forgotten, even in the event of a hardware fault or database crashes.

In single node databases, durability means that the data has been written to nonvolatile storage like a hard drive or SSD. It also usually involves a write-ahead log or similar which helps with recovery in case the data structures on disk are corrupted.

In a replicated database, durability often means that data has been successfully copied to a number of nodes.

However, perfect durability does not exist. If all the hard disks and backups are destroyed at the same time, there's nothing the database can do to save you. In replicated systems for example, faults can be correlated (say a power outage or a bug that crashes every node that has a particular input) and can knock out all replicas are once.

Single-Object and Multi-Object Operations

The definitions of atomicity and isolation so far assume that several objects (rows, documents, records) will be modified at once. These are known as multi-object transactions and are often needed if several pieces of data are to be kept in sync.

We need a way to determine which read and write operations belong to the same transaction.

In relational databases, this is done based on the client's TCP connection to the database server: on any particular connection, everything between BEGIN TRANSACTION and a COMMIT is considered to be part of the same transaction.

Single-object writes

Atomicity and isolation also apply when a single object is being changed. E.g.

If a 20KB JSON document is being written to a database and the network connection is interrupted after the first 10KB have been sent, does the database store the 10KB fragment of JSON?
If power fails while the database is in the middle of overwriting the previous value on disk, will we have the previous and new values spliced together?
If another client reads a document while it's being updated, will it see a partially updated value.

These issues are why storage engines almost universally aim to provide atomicity and isolation on the level of a single object (such as a key-value pair) on one node.

Atomicity can be implemented by using a log for crash recovery, while isolation can be implemented using a lock on each object.

These single object operations are useful, but are not transactions in the typical sense of the word. A transaction is generally considered as a mechanism for grouping multiple operations on multiple objects into a single unit of execution.

The need for multi-object transactions

Some distributed datastores have abandoned multi-object transactions because they are difficult to implement across partitions, and can hinder performance when high availability/performance is required.

There are some use cases where multi-object operations need to be coordinated e.g.

When we are adding new rows to a table which have references to a row in another table using foreign keys. The foreign keys have to be coordinated across the tables and must be correct and up to date.
In a document data model, when denormalized information needs to be updated, several documents often need to be updated in one go.
In databases with secondary indexes (i.e. almost everything except pure key-value stores), the indexes also need to be updated every time a value is changed. That is, the indexes needs to be updated with the new records.

These can be implemented without transactions, but error handling is more complex without atomicity and isolation (to prevent concurrency problems).

Handling Errors And Aborts

The idea of atomicity is to make it so that retrying failed transactions is safe. However, it's not so straightforward. There are a number of things to take into consideration:

What if the transaction actually succeeded but the network failed while the server tried to acknowledge the successful commit to the client (so the client thinks it failed), retrying the transaction will cause it to be performed twice - unless there's a de-duplication mechanism in place.
If the error is due to overload, retrying the transaction will only compound the problem.
If a transaction has side effects outside of the database, those side effects may happen even if the transaction is aborted. E.g. Sending an email.

Weak Isolation Levels

Concurrency issues (e.g. race conditions) happen when one transaction reads data that is concurrently modified by another transaction, or when two transactions simultaneously modify the same data.

Concurrency issues are often hard to reproduce or test.

Transaction Isolation is the means by which databases typically try to hide concurrency issues from application developers.

Serializable Isolation is the ideal isolation required, as the database guarantees that transactions have the same effect as if they ran serially (one after another). However, this form has a performance cost which most databases don't want to pay. As a result, databases use weaker levels of isolation which prevent against some concurrency issues, but not all.

Concurrency bugs caused by weak transaction isolation are not just theoretical. They are real issues which have led to loss of money, data corruption, and even an investigation by financial auditors. (e.g. https://bitcointalk.org/index.php?topic=499580)

Read Committed

The core characteristics of this isolation level are that it prevents dirty reads and dirty writes.

Dirty Reads

If an object has been updated in a transaction but has not yet been committed, the act of any transaction being able to see that uncommitted data is known as a dirty read i.e. when reading data from the database, you will only see data that has been committed.

Dirty Writes

If an object has been updated in a transaction but has not yet been committed, the act of any transaction being able to overwrite the uncommitted value is a dirty write. i.e. when writing to the database, you will only overwrite data that has been committed.

While this isolation level is commonly used, it does not prevent certain race conditions. For example:

Imagine that I read a value as '30' and increment it in one transaction, and another transaction reads that value before the increment operation. If that new transaction also tries to increment it even after my transaction has committed, they would be incrementing it based on the earlier value seen as '30' before any locks were added when modifying the object. This is because the isolation level is implemented for objects that have been modified. When the new transaction happens, the value has not yet been modified.

Implementing read committed

This is the default setting in many databases like Oracle 11g, PostgreSQL, SQL Server and other databases.

Most databases prevent dirty-writes by using row-level locks i.e. when a transaction wants to modify an object, it must first acquire a lock on the object. It must then hold the lock until the transaction is committed or aborted. Only one transaction can hold a lock at a time.

Preventing dirty reads can also be implemented in a similar fashion. One can require that any transaction that wants to read an object should briefly acquire the lock and release it again after reading. This way, any write on an object that hasn't been committed cannot be read since the transaction that performed the write would still hold the lock.

This approach of requiring locks before reading is inefficient is practice because one long-running write transaction can force other read-only transactions to wait for a long time. Because of this, dirty reads are prevented by most databases using this approach: for every object that is written, the database remembers both the old committed value and the new value set by the transaction which holds the write lock. Any transactions that want to read the object are simply given the old value until the new value is committed.

Snapshot Isolation and Repeatable Read

With the read committed isolation level, there is still room for concurrency bugs. One of the anomalies that can happen is a non-repeatable read or a read skew.

A read skew means that you might read the value of an object in one transaction before a separate transaction begins, and when that separate transaction ends, that value has changed into a new value. This happens because the read committed isolation only applies a lock on values that are about to be modified.

Thus, a long running read-only transaction can have situations where the value of an object or multiple objects changes between when the transaction starts and when it ends, which can lead to inconsistencies.

Basically, an example flow for this kind of anomaly is this:

Transaction A begins and reads objects from DB
Transaction B begins and updates some of those objects (this can happen since Transaction A won't have a lock on those objects, as it's read-only).
If Transaction A re-reads those objects within the same transaction for whatever reason, those values will have changed and the earlier read is non-repeatable.

A Fuzzy or Non-Repeatable Read occurs when a value that has been read by a still in-flight transaction is overwritten by another transaction. Even without a second read of the value actually occurring this can still cause database invariants to be violated"

Source: https://blog.acolyer.org/2016/02/24/a-critique-of-ansi-sql-isolation-levels/

Read skew is considered acceptable under read committed isolation, but some situations cannot tolerate that temporary inconsistency:

Backups: A backup requires making a copy of a database which can take long hours. During this time, writes will be made to the database. It's possible that some parts of the backup will contain an older version of the data, and other parts will have a newer version. These inconsistencies will become permanent on the database level.
Analytic queries and integrity checks: Long running analytics queries could end up returning incorrect data if the data in the db has changed over the course of the run.

To solve this problem, Snapshot isolation is commonly used. The main idea is that each transaction reads a consistent snapshot of the database - that is, a transaction will only see all the data that was committed in the database at the start of the transaction. Even if another transaction changes the data, it won't be seen by the current transaction.

This kind of isolation is especially beneficial for long-running, read only queries like backups and analytics, as the data on which they operate remains the same throughout the transaction.

Implementing snapshot isolation

A core principle of snapshot isolation is this:

Readers never block writers, and writers never block readers.

Implementations of snapshot isolation typically use write locks to prevent dirty writes, but have an alternate mechanism for preventing dirty reads.

Write locks mean that a transaction that makes a write to an object can block the progress of another transaction that makes a write to the same object.

To implement snapshot isolation, databases potentially keep multiple different committed version of the same object. Due to the fact that it maintains several versions of an object side by side, the technique is known as multi-version concurrency control (MVCC).

For a database providing only read committed isolation, we would only need to keep two versions of an object: the committed version and the overwritten-but-uncommitted version. However, with snapshot isolation, we keep different versions of the same object. The scenario below explains why:

If Transaction A has a snapshot of the database and Transaction B has the same snapshot of the database. If transaction A commits before Transaction B, the database still needs to keep track of the snapshot being used by Transaction B, and the new committed value of Transaction A.
This can continue if there's a Transaction C, D, etc.

That's why we could have multiple versions of the same object.

Note that storage engines that support snapshot isolation typically use MVCC for their read committed isolation level as well.

MVCC-based snapshot isolation is typically implemented by given each transaction a unique, always-increasing transaction ID. Any writes to the database by a transaction are tagged with the transaction ID of the writer. Each row in the table is tagged with a created_by and deleted_by field which has the transaction ID that performed the creation or deletion (when applicable).

The transaction IDs are used as follows:

At the start of each transaction, the database notes all the other transactions that are in progress (i.e. not committed or aborted yet). Any writes made by these transactions are ignored, even if they commit later.
Any writes made by transactions with a later transaction ID than the current one are also ignored, regardless of whether they have committed or not.
All other writes are visible to the application's queries.

Indexes and snapshot isolation

One option with indexes and snapshot isolation is to have the index point to all the versions of an object and require any index query to filter out object versions which are not visible to the current transaction.

Some databases like CouchDB and Datomic use an append-only B-tree which does not overwrites pages of the tree when they are updated or modified, but creates a new copy of each modified page. PostgreSQL has optimizations to avoid index updates if different versions of the same object can fit on the same page.

Repeatable read and naming confusion

SQL Isolation levels are not standardized, and so some databases refer to an implementation of snapshot isolation as serializable (e.g. Oracle) or repeatable read (e.g. PostgreSQL and MySQL).

DB2 refers to serializability as "repeatable read".

In summary, naming here is a shitshow in database town.

Preventing Lost Updates

So far, we have only discussed how to prevent dirty writes, but haven't spoken about another problem that could occur: the lost update problem. Basically, if two writes concurrently update a value, what's going to happen?

The lost update problem mainly occurs when an application reads a value, modifies it, and writes back the modified value (called a read-modify-write cycle). If two transactions try to do this concurrently, one of the updates can be lost as the second write does not include the first modification.

The key difference between this and a dirty write is that: if you overwrite a value that has been committed, it's no longer a dirty write. A dirty write happens when in a transaction, you overwrite a value which has been updated in another uncommitted transaction.

This can happen in different scenarios:

If a counter needs to be incremented. It requires reading the current value, calculating the new value, and writing back the updated value. If two transactions increment the counter by different values, one of those updates will be lost.
Making a local change to a complex value. E.g. Adding an element to a list within a JSON document.
Two users editing a wiki page at the same time, where each user's changed is saved by sending the entire page contents to the server, it will overwrite whatever is in the database.

A variety of solutions have been developed to deal with this scenario:

Atomic Write Operations: Many databases support atomic updates, which remove the need for the read-modify-write cycles. Atomic updates look like:

UPDATE counters SET counter= counter + 1 WHERE key = 'foo'

Atomic operations are usually implemented by taking an exclusive lock on the object when it's read to prevent any other transaction from reading it until the update has been applied.

Another option for implementing atomic writes is to ensure that all atomic operations run on the same database thread.

However, not all database updates fit into this model. Some updates require more complex logic and won't benefit from atomic writes.
Explicit Locking: Another option for preventing lost updates is to explicitly lock objects which are going to be updated. You may need to specify in your application's code through an ORM or directly in SQL that the rows returned from a query should be locked.

Automatically detecting lost updates

The methods discussed above (atomic operations and locks) are good ways of preventing lost updates as they force the read-modify-write cycles to occur sequentially. One alternative is to allow concurrent cycles to execute in parallel and let the transaction manager detect a lost update.

When a lost update is detected, the transaction can be aborted and it can be forced to retry its read-modify-write cycle.

This approach has an advantage that the check for lost updates can be performed efficiently with snapshot isolation.

PostgreSQL, Oracle and SQL Server automatically detect lost updates.

Another advantage is that application code does not have to use any special database features like atomic locks or explicit locking to prevent lost updates.

Compare-and-set

In databases which don't provide transactions, an atomic compare-and-set operation is usually found.

What this means is that when an operation wants to update a value, it reads the previous value and only completes if the value at the time of the update is the same as the value it read earlier.

This is safe as long as the database is not comparing the current value against an old snapshot.

Conflict resolution and replication

Locks and compare-and-set operations assume that there's a single up-to-date copy of the data. However, for databases with multi-leader or leaderless replication, there's no guarantee of a single up-to-date copy of data.

A common approach in replicated databases is to allow concurrent writes create several conflicting versions of a value, and allow application code or special data structures to be used to resolve the conflicts.

Write Skew and Phantoms

To recap the two race conditions we have treated so far:

Dirty Writes: The ability of one running transaction to overwrite an update made by another running, uncommitted transaction. If a transaction updates multiple objects, dirty writes can lead to a bad outcome. There's the car example in the book of Alice and Bob buying a car in different transactions and having to update the listings and invoices tables. If Alice writes to the listings table first and Bob overrides it, but Bob writes to the invoices table first and Alice overrides it, we'll have inconsistent records in our tables as the records for that car should have the same recipient.
Lost Update: If two transactions happen concurrently and another commit firsts, the later one could overwrite an update made by the earlier transaction, which could lead to lost changes. E.g. Incrementing a counter.

Another anomaly that can happen is a Write skew. Basically, if two transactions read from the same objects, and then update some of those objects (as different transactions may update different objects), a write skew can occur.

Imagine a scenario where two transactions running concurrently first make a query, and then update a database object based on the result of the first query. The operations performed by the transaction to commit first may render the result of the query invalid for the later transaction.

These transactions may update different objects (so it's neither a dirty write nor a lost update), but they'll still make the application function incorrectly. E.g. A meeting room booking app where two transactions running concurrently first see that a timespan was not booked, and then add a row each for different meetings. This wouldn't be a problem if we could somehow have unique constraints on all the time ranges, but it's a problem if we don't.

Database constraints like uniqueness and foreign key constraints may help to enforce this, but they're not always applicable.

Serializable isolation helps to prevent this. However, if it's not available, one way of preventing this is to explicitly lock the rows that a transaction depends on. Unfortunately, if the original query returns no rows (say it's checking for the absence of rows matching a condition), we can't attach locks to anything.

Another example:

If you're distributing students into the same class room and you want to make sure that no students with the same last name belong to the same class. Two concurrently executing transactions could first check that the condition is met, then insert separate rows for a surname. Of course, a simple solution to this is a uniqueness constraint.

The effect, where a write in one transaction changes the result of a search query in another transaction, is called a phantom.

Materializing Conflicts

As described above, we can reduce the effect of phantoms by attaching locks to the rows used in a transaction. However, if there's no object to which we can attach the locks (say if our initial query is searching for the absence of rows), we can artificially introduce locks.

The approach of taking a phantom and turning it into a lock conflict on a concrete set of rows introduced in the database is known as materializing conflicts.

This should be a last resort, as a serializable isolation level is much preferable in most cases.

Serializability

Serializable isolation is regarded as the strongest isolation level. It guarantees that even though transactions may execute in parallel, the end result will be as if they executed one at a time, without any concurrency.

Most databases use of the following three techniques for serializable isolation, which we will explore next:

Actual serial execution i.e. literally executing transactions in a serial order.
Two-phase locking
Serializable snapshot isolation

Actual Serial Execution

The simplest way to avoid concurrency issues is by removing concurrency entirely, and making sure transactions are executed in serial order, on a single thread.

This idea only started being used in practice fairly recently, around 2007, when database designers saw that it was feasible to use a single-threaded loop for executing transactions and still get good performance. Two developments led to this revelation:

RAM become cheap enough that it is now possible to fit an entire dataset in memory for many use cases. When the entire dataset needed for a transaction is stored in memory, transactions execute much faster.
They also realized that OLTP transactions are usually short and make only a small number of reads and writes. Unlike OLAP transactions which are typically run on a consistent snapshot, these transactions are short enough to be run one by one.

This approach for serial transaction execution is implemented in Datomic, Redis, and others.

This can even be faster than databases that implement concurrency, as there is no coordination overhead of locking.

The downside is that its throughput is limited to a single CPU core (processor), except you can structure your transactions differently into partitions (discussed below).

Encapsulating transactions in stored procedures

Transactions are typically executed in a client/server style, one statement at a time: an application makes a query, reads the result, maybe makes another query depending on the first result etc.

In this interactive style, a lot of time is spent in network communication between the application and the database.

In a database with no concurrency that only processes one transaction at a time, the throughput won't be great, as the transaction will spend most of its time waiting for the next request.

For this reason, databases which have single-threaded serial transaction processing do not allow interactive multi-statement transactions. Instead, all the requests in a transaction must be submitted at the same time as a stored procedure. With this approach, there's only one network hop instead of having multiple.

Stored procedures and in-memory data make executing transactions on a single thread become feasible.

Partitioning

As mentioned earlier, executing transactions serially makes concurrency control simpler, but limits the throughput to the speed of a single CPU core on a single machine.

What if we could take advantage of the presence of multiple cores and multiple node by partitioning the data?

If we can partition the data so that each transaction only needs to read and write data within a single partition, then each partition can have its own transaction processing thread running independently from the others.

This will likely involve some cross-partition coordination though, especially in the presence of secondary indexes and what not. Cross-partition coordination has much less throughput than single partitions.

Summary of serial execution

It requires that each transaction is small and fast, else one slow transaction can stall all transaction processing.
It's limited to use cases where the active dataset can fit in memory. A transaction that needs to access data not in memory can slow down processing.
Write throughput must be low enough to be handled on a CPU core, or else transactions need to be partitioned without requiring cross-partition coordination.

Two-Phase Locking (2PL)

For a long time (around 30 years), two-phase locking was the most widely used algorithm for serializability in databases.

The key ideas behind two-phase locking are these:

A transaction cannot write a value that has been read by another transaction.
A transaction cannot read a value that has been written by another transaction. It must wait till the other transaction commits or aborts. Reading an old version of the object (like in snapshot isolation) is not acceptable.

Unlike snapshot isolation, readers can block writers here, and writers can block readers. (Recall that snapshot isolation has the mantra: readers never block writers, and writers never block readers)

Implementation of two-phase locking

The blocking of readers and writers is implemented by having a lock on each object used in a transaction. The lock can either be in shared mode or in exclusive mode. The lock is used as follows:

When a transaction wants to read an object, it must first acquire a shared mode lock. Multiple read-only transactions can share the lock on an object.
When a transaction wants to write an object, it must acquire an exclusive lock on that object.
If a transaction first reads and then writes to an object, it may upgrade its shared lock to an exclusive lock.
After a transaction has acquired the lock, it must continue to hold the lock until the end of the transaction.

Two-phase origin: First phase is acquiring the locks, second phase is when all the locks are released.

If an exclusive lock already exists on an object, a transaction which wants to acquire a shared mode lock must wait for the lock to be released (when the transaction is aborted or committed), and vice versa.

Since so many locks are in use, a deadlock can easily happen if transaction A is stuck waiting for transaction B to release its lock, and vice versa. E.g. If transaction A has a lock on table A, and needs table B, but transaction B already has a lock on table B but needs table A, we are in a deadlock situation.

Aside: Note that some databases perform row-level locks and others perform table-level locks.

Performance of two-phase locking

Transaction throughput and response times of queries are significantly worse under two-phase locking than under weak isolation.

This is as a result of the overhead of acquiring and releasing locks, but more importantly due to reduced concurrency as some transactions have wait for the others to complete. One slow transaction can cause the system to be slow.

Deadlocks also occur more frequently here than in lock-based read committed isolation levels.

Predicate Locks

The idea with predicate locks is to lock all the objects that meet a condition, even for objects that do not yet exist in the database. A condition is first run in a select query, and a predicate lock holds a lock on any objects that could meet that query.

Index Range Locks

Predicate locks don't perform well, however, and checking for matching locks can become time consuming. Thus, most databases implement index-range locking.

With index range locks, we don't just lock the objects which match a condition, we lock a bigger range of objects. For example, if an index is hit in the original query, we could lock any writes to that index entry, even if it doesn't match the condition.

These are not as precise as predicate locks, but there's less overhead with checking for locks as more objects will be locked.

Serializable Snapshot Isolation

Most of this chapter has been bleak, and has made it look like we either have serializable isolation with poor performance, or weak isolation levels that are prone to lost updates, dirty writes, write skews and phantoms.

However, in 2008, Michael Cahill's PhD thesis introduced a new concept known as serializable snapshot isolation. It provides full serializability at only a small performance penalty compared to snapshot isolation.

The main idea here is that instead of holding locks on transactions, it allows transactions to continue to execute as normal until the stage where the transaction is about to commit, when it then decides whether the transaction executed in a serializable manner. This approach is known as an optimistic concurrency control technique.

Optimistic in this sense means that instead of blocking if something potentially dangerous happens, transactions continue anyway, in the hope that everything will be fine. If everything isn't fine, it's only controlled at the time the transactions want to commit, after which it will be aborted. This approach differs from the pessimistic technique used in Two-phase locking.

The pessimistic approach believes that if anything can go wrong, it's better to wait until the situation is safe again (using locks) before doing anything.

Optimistic concurrency control is an old idea, but in the right conditions (e.g. contention between transactions is not too high and there's enough spare capacity), they tend to perform better than pessimistic ones.

SSI is based on snapshot isolation and obeys the rules that readers don’t block writers, and writers don't block readers. The main difference is that SSI adds an algorithm for detecting serialization conflicts among writes and determining which transactions to abort.

Decisions based on an outdated premise

With write skew in snapshot isolation, the recurring pattern was this: a transaction reads some data from the database, examines the result of the query and takes some action based on the result of that query. However, the result from the original query may no longer be valid as at the time the transaction commits, because the data may have been modified in the meantime.

The database has to be able to detect situations in which a transaction may have acted on outdated premise and abort the transaction in that case. To do this, there are two cases to consider:

Detecting reads of a stale MVCC object version (uncommitted write occurred before the read).
Detecting writes that affect prior reads (the write occurs after the read).

Detecting stale MVCC reads

Basically, with snapshot isolation, there can be multiple versions of an object. When a transaction reads from a consistent snapshot in an MVCC database, it ignores writes that were made by transactions that had not committed at the time the snapshot was taken.

This means that if Transaction A reads a value when there are uncommitted writes by Transaction B to that value, and transaction B commits before transaction A, then Transaction A may have performed some operations as a result of that earlier read which is no longer valid.

To prevent this anomaly, a database needs to keep track of transactions which ignore another transaction's writes due to MVCC visibility rules. When the transaction that performed the read wants to commit, the database checks whether any of the ignored writes have been committed. If so, the transaction must be aborted.

Some of the reasons why it waits for the transaction to commit, rather than aborting a transaction when a stale read is detected are that:

The reading transaction might be a read-only transaction, in which case there's no risk of a write skew. The database has no way of knowing whether the transaction will later perform a write.
There's no guarantee that the transaction that performed the uncommitted write will actually commit, so the read may not be a stale one at the end.

SSI avoids unnecessary aborts and thus preserves snapshot isolation's support for long-running reads from a consistent snapshot.

Detecting writes that affect prior reads

The key idea here is to keep track of which values (for example, track the indexes) have been read by what transactions. When a transaction writes to the database, it looks in the indexes for what other transactions have recently read the data. It then notifies the transactions that what they read may be out of date.

Performance of serializable snapshot isolation

The advantage that this has over two-phase locking that one transaction does not need to be blocked when waiting for locks held by another transaction. Recall that like under snapshot isolation, readers don't block writers here and vice versa.

This also has the advantage over serial execution that it is not limited to the throughput of a single CPU core.

Due to the fact that the rate of aborts will affect the performance of SSI, it requires that read-write transactions be fairly short, as long running ones are more likely to run into conflicts. Long running read-only transactions may be fine.

However, note that SSI is likely to be less sensitive to slow transactions than two-phase locking or serial execution.

Chapter 6 - Partitioning

2019-12-15T01:47:10-00:00

My notes from Chapter 6 of 'Designing Data-Intensive Applications' by Martin Kleppmann.

Table of Contents

Partitioning and Replication
Partitioning of Key-Value Data
Partitioning and Secondary Indexes
- Partitioning Secondary Indexes by Document
- Partitioning Secondary Indexes by Term
Rebalancing Partitions
- Operations: Automatic or Manual Rebalancing
Request Routing

This refers to breaking up a large data set into partitions.
Partitioning is also known as sharding.
Partitions are known as shards in MongoDB, Elasticsearch and SolrCloud, regions in Hbase, tablets in Bigtable, vnodes in Cassandra and Riak, and vBucket in Couchbase.
Normally, each piece of data belongs to only one partition.
Scalability is the main reason for partitioning data. It enables a large dataset to be distributed across many disks, and a query load can be distributed across many processors.

Partitioning and Replication

Copies of each partition are usually stored on multiple nodes. Therefore, although a record belongs to only one partition, it may be stored on several nodes for fault tolerance.

A node may also store more than one partition. If the leader-follower replication model is used for partitions, each node may be the leader for some partitions and a follower for other partitions.

Partitioning of Key-Value Data

The goal of partitioning is to spread data and query load evenly across nodes.

If partitioning is unfair i.e. some partitions have more data than others, we call it skewed.

If the partitioning is skewed, it makes partitioning less effective as all the load could end up one partition, effectively making the other nodes idle. A partition with a disproportionately high load is called a hot spot.

Assigning records to nodes randomly is the simplest approach for avoiding hot spots, but this has the downside that it will be more difficult to read a particular item since there's no way of knowing which node the item is on.

There are other approaches for this though, and we'll talk about them.

Partitioning by Key Range

For key-value data where the key could be a primary key, each partition could be assigned a continuous range of keys, say a-d, e-g etc. This way, once we know the boundaries between the ranges, it's easy to determine which partition contains a given key.

This range does not have to be evenly spaced. Some partitions could cover a wider range of keys than others, because the data may not be evenly distributed. Say we're indexing based on a name, one partition could hold keys for: u,v, w, x,y and z, while another could hold only a-c.

To distribute data evenly, the partition boundaries need to adapt to the data.

Hbase, Bigtable, RethinkDB etc use this partitioning strategy.

Keys could also be kept in sorted order within each partition. This makes range-scans effective e.g. find all records that start with 'a' .

The downside of this partitioning strategy is that some access patterns can lead to hotspots. E.g . If the key is a timestamp, then partitions will correspond to ranges of time e.g. one partition by day. This means all the writes end up going to the same partition for each day, so that partition could be overloaded with writes while others sit idle.

The solution for the example above is to use something apart from the timestamp as the primary key, but this is most effective if you know something else about the data. Say for sensor data, we could partition first by the sensor name and then by time. Though with this approach, you would need to perform a separate range query for each sensor name.

Partitioning by Hash of Key

Many distributed datastores use a hash function to determine the partition for a given key.

When we have a suitable hash function for keys, each partition can be assigned a range of hashes, and every key whose hash falls within a partition's range will be stored for that partition.

A good hash function takes skewed data and makes it uniformly distributed. The partition boundaries can be evenly spaced or chosen pseudo-randomly (with the latter being sometimes known as consistent hashing).

Aside:

Consistent hashing, as defined by Karger et al, is a way of evenly distributing load across an internet-wide system of caches such as a content delivery network (CDN). It uses randomly chosen partition boundaries to avoid the need for central control or distributed consensus. Note that consistent here has nothing to do with replica consistency or ACID consistency, but rather describes a particular approach to rebalancing.

As we shall see in “Rebalancing Partitions”, this particular approach actually doesn’t work very well for databases, so it is rarely used in practice (the documentation of some databases still refers to consistent hashing, but it is often inaccurate). It's best to avoid the term consistent hashing and just call it hash partitioning instead.

Kleppmann, Martin. Designing Data-Intensive Applications (Kindle Locations 5169-5174). O'Reilly Media. Kindle Edition.

This approach has the downside that we lose the ability to do efficient range queries. Keys that were once adjacent are now scattered across all the partitions, so sort order is lost.

Note:

Range queries on the primary key are not supported by Riak, Couchbase, or Voldermort.
MongoDB sends range queries to all the partitions.
Cassandra has a nice compromise for this. A table in Cassandra can be declared with a compound primary key consisting of several columns. Only the first part of the key is hashed, but the other columns act as a concatenated index. Thus, a query cannot search for a range of values within the first column of a compound key, but if it specifies a fixed value for the first column, it can perform an efficient range scan over the other columns of the key.

Skewed Workloads and Relieving Hot Spots

Even though hashing the key helps to reduce skew and the number of hot spots, it can't avoid them entirely. In an extreme case where reads and writes are for the same key, we still end up with all requests being routed to the same partition.

This workload seems unusual, but it's not unheard of. Imagine a social media site where a celebrity has lots of followers. If the celebrity posts something and that post has tons of replies, all the writes for that post could potentially end up at the same partition.

Most data systems today are not able to automatically compensate for a skewed workload. It's up to the application to reduce the skew. E.g if a key is known to be very hot, a technique is to add a random number to the beginning or end of the key to split the writes. This has the downside, though, that reads now have to do additional work to keep track of these keys.

Partitioning and Secondary Indexes

The partition techniques discussed so far rely on a key-value data model, where records are only ever accessed by their primary key. In these situations, we can determine the partition from that key and use it to route read and write requests to the partition responsible for the key.

If secondary indexes are involved though, this becomes more complex since they usually don't identify a record uniquely, but are a way of searching for all occurrences of a particular value.

Many key value stores like Hbase and Voldermort have avoided secondary indexes because of their complexity, but some like Riak have started adding them because of their usefulness for data modelling.

Secondary indexes are the bread & butter of search servers like Elasticsearch and Solr though. However, a challenge with secondary indexes is that they do not map neatly to partitions. Two main approaches to partitioning a database with secondary indexes are:

Document-based partitioning and
Term-based partitioning.

Partitioning Secondary Indexes by Document

In this partitioning scheme, each document has a unique document ID. The documents are initially partitioned by that ID, which could be based on a hash range or term range of the ID. Each partition then maintains its own secondary index, covering only the documents that fit within that partition. It does not care what data is stored in other partitions.

A document-partitioned index is also known as a local index.

The downside of this approach is that reads are more complicated. If we want to search by a term in the secondary index (say we're searching for all the red cars where there's an index on the color field), we would typically have to search through all the partitions, and then combine all the results we get back.

This approach to querying a partitioned db is sometimes known as scatter/gather, and it can make read queries on secondary indexes quite expensive. This approach is also prone to tail latency amplification (amplification in the higher percentiles), but it's widely used in Elasticsearch, Cassandra, Riak, MongoDB etc.

Partitioning Secondary Indexes by Term

In this approach, we keep a global index of the secondary terms that covers data in all the partitions. However, we don't store the global index on one node, since it would likely become a bottleneck, but rather, we partition it. The global index must be partitioned, but it can be partitioned differently from the primary key index.

The advantage of a global index over document-partitioned indexes is that reads are more efficient, since we don't need to scatter/gather over all partitions. A client only needs to make a request to the partition containing the desired document.

However, the downside is that writes are slower and more complicated, because a write to a single document may now affect multiple partitions of the index.

In practice, updates to global secondary indexes are often asynchronous (i.e. if you read the index shortly after a write, it may not have reflected a new change).

Rebalancing Partitions

How not to do it: hash mod n

If we partition by hash mod n where n is the number of nodes, we run the risk of excessive rebalancing. If the number of nodes N changes, most of the keys will need to be moved from one node to another. Frequent moves make rebalancing excessively expensive.

Next we'll discuss approaches that don't move data around more than necessary.

Fixed Number of Partitions

This approach is fairly simple. It involves creating more partitions than nodes, and assigning several partitions to each node.

If a node is added to a cluster, the new node steals a few partitions from other nodes until partitions are fairly distributed again. Only entire partitions are moved between nodes, the number of partitions does not change, nor does the assignment of keys to partitions.

What changes is the assignment of partitions to nodes. It may take some time to transfer a large amount of data over the network between nodes, so the old assignment of partitions is typically used for any reads/writes that happen while the transfer is in progress.

This approach is used in Elasticsearch (where the number of primary shards is fixed on index creation), Riak, Couchbase and Voldermort.

With this approach, it's imperative to choose the right number of partitions.

Dynamic partitioning

In this rebalancing strategy, the number of partitions dynamically adjusts to fit the size of the data. This is especially useful for databases with key range partitioning, as a fixed number of partitions with fixed boundaries can be inconvenient if not configured correctly at first. All the data could end up on one node, leaving the other empty.

When a partition grows to exceed a configured size, it is split into two partitions so that approximately half of the data ends up on each side of the split.

Conversely, if lots of data is deleted and a partition shrinks below some threshold, it can be merged with an adjacent partition.

An advantage of this approach is that the number of partitions adapts to the total data volume. If there's only a small amount of data, a small number of partitions is sufficient, so overheads are small.

A downside of this approach is that an empty database starts off with a single partition, since there's no information about where to draw the partition boundaries. While the data set is small, all the writes will be processed by a single node while the others sit idle. Some databases like Hbase and MongoDB mitigate this by allowing an initial set of partitions to be configured on an empty database.

Partitioning proportionally to nodes

Some databases (e.g. Cassandra) have a third option of making the number of partitions proportional to the number of nodes. There is a fixed number of partitions per node.

The size of each partition grows proportionally to the dataset size while the number of nodes remains unchanged, but when you increase the number of nodes, the partitions become smaller again.

When a new node joins the cluster, it randomly chooses a fixed number of existing partitions to split, and then takes one half of the split, leaving the other half in place.

This randomization can produce an unfair split, but databases like Cassandra have come up with algorithms to mitigate the effect of that.

Operations: Automatic or Manual Rebalancing

Databases like Couchbase, Riak, and Voldermort are a good balance between automatic and manual rebalancing. They generate a suggested partitioned assignment automatically, but require an administrator to commit it before it takes effect.

Fully automated rebalancing can be unpredictable. Rebalancing is an expensive operation because it requires rerouting requests and moving a large amount of data from one node to another. This process can overload the network or the nodes if not done carefully.

For example, say one node is overloaded and is temporarily slow to respond to requests. The other nodes can conclude that the overloaded node is dead, and automatically rebalance the cluster to move load away from it. This puts additional load on the overloaded node, other nodes, and the network—making the situation worse and potentially causing a cascading failure.

Request Routing

There's an open question that remains: when a client makes a request, how does it know which node to connect to? It's especially important as the assignment of partitions to nodes changes.

This is an instance of the general problem of service discovery i.e. locating things over a network. This isn't just limited to databases, any piece of software that's accessible over a network has this problem.

There are different approaches to solving this problem on a high level:

Allow clients to contact any node (e.g. via a round-robin load balancer). If the node happens to own the partition to which the request applies, it can handle the request directly. Otherwise, it'll forward it to the relevant node.
Send all requests from clients to a routing tier first, which determines what node should handle what request and forwards it accordingly. This routing tier acts as a partition-aware load balancer.
Require that clients be aware of the partitioning and assignment of partitions to nodes.

In all approaches, it's important for there to be a consensus among all the nodes about which partitions belong to which nodes, especially as we tend to rebalance often.

Many distributed systems rely on a coordination service like Zookeeper to keep track of this cluster metadata. This way, the other components of the system such as the routing tier or the partitioning-aware client can subscribe to this information in Zookeeper.

Chapter 5 - Replication

2019-12-13T23:30:12-00:00

My notes from the fifth chapter of Martin Kleppmann's book: Designing Data Intensive Applications.

Table of Contents

Leaders and Followers
- Synchronous Versus Asynchronous Replication
Problems with Replication Lag
- Other Consistency Levels
- Solutions for Replication Lag
Multi-Leader Replication
Leaderless Replication

Replication involves keeping a copy of the same data on multiple machines connected via a network. Reasons for this involve:

Increasing the number of machines that can handle failed requests - Leads to increased read throughput.
To allow a system continue working in the event of failed parts - Leads to increased availability.
To keep data geographically close to users - Reduced latency.

The challenge with replication lies in handling changes to replicated data. Three algorithms for replicating changes between nodes:

Single leader
Multi-leader
Leaderless

Leaders and Followers

Every node that keeps a copy of data is a replica. Obvious question is: how do we make sure that the data on all the replicas is the same? The most common approach for this is leader-based replication. In this approach:

Only the leader accepts writes.
The followers read off a replication log and apply all the writes in the same order that they were processed by the leader.
A client can query either the leader or any of its followers for read requests.

So here, the followers are read-only, while writes are only accepted by the leader.

This approach is used by MySQL, PostgreSQL etc., as well as non-relational databases like MongoDB, RethinkDB, and Espresso.

Synchronous Versus Asynchronous Replication

With Synchronous replication, the leader must wait for a positive acknowledgement that the data has been replicated from at least one of the followers before terming the write as successful, while with Asynchronous replication, the leader does not have to wait.

Synchronous Replication

The advantage of synchronous replication is that if the leader suddenly fails, we are guaranteed that the data is available on the follower.

The disadvantage is that if the synchronous follower does not respond (say it has crashed or there's a network delay or something else), the write cannot be processed. A leader must block all writes and wait until the synchronous replica is available again. Therefore, it's impractical for all the followers to be synchronous, since just one node failure can cause the system to become unavailable.

In practice, enabling synchronous replication on a database usually means that one of the followers is synchronous, and the others are asynchronous. If the synchronous one is down, one of the asynchronous followers is made synchronous. This configuration is sometimes called semi-synchronous.

Asynchronous Replication

In this approach, if the leaders fails and is not recoverable, any writes that have not been replicated to followers are lost.

An advantage of this approach though, is that the leader can continue processing writes, even if all its followers have fallen behind.

There's some research into how to prevent asynchronous-performance like systems from losing data if the leader fails. A new replication method called Chain replication is a variant of synchronous replication that aims to provide good performance and availability without losing data.

Setting Up New Followers

New followers can be added to an existing cluster to replace a failed node, or to add an additional replica. The next question is how to ensure the new follower has an accurate copy of the leader's data?

Two options that are not sufficient are:

Just copying data files from one node to another. The data in the leader is always updating and a copy will see different versions at different points in time.
Locking the database (hence making it unavailable for writes). This will go against the goal of high availability.

There's an option that works without downtime, which involves the following steps:

Take a consistent snapshot of the leader's db at some point in time. It's possible to do this without taking a lock on the entire db. Most databases have this feature.
Copy the snapshot to the follower node.
The follower then requests all the data changes that happened since the snapshot was taken.
When the follower has processed the log of changes since the snapshot, we say it has caught up.

In some systems, this process is fully automated, while in others, it is manually performed by an administrator.

Handling Node Outages

Any node can fail, therefore, we need to keep the system running despite individual node failures, and minimize the impact of a node outage. How do we achieve high availability with leader-based replication?

Scenario A - Follower Failure: Catch-up recovery

Each follower typically keeps a local log of the data changes it has received from the leader. If a follower node fails, it can compare its local log to the replication log maintained by the leader, and then process all the data changes that occurred when the follower was disconnected.

Scenario B - Leader failure: Failover

This is trickier: One of the nodes needs to be promoted to be the new leader, clients need to be reconfigured to send their writes to the new leader, and the other followers need to start consuming data changes from the new leader. This whole process is called a failover. Failover can be handled manually or automatically. An automatic failover consists of:

Determining that the leader has failed: Many things could go wrong: crashes, power outages, network issues etc. There's no foolproof way of determining what has gone wrong, so most systems use a timeout. If the leader does not respond within a given interval, it's assumed to be dead.
Choosing a new leader: This can be done through an election process (where the new leader is chosen by a majority of the remaining replicas), or a new leader could be appointed by a previously elected controller node. The best candidate for leadership is typically the one with the most up-to-date data changes from the old leader (to minimize data loss)
Reconfiguring the system to use the new leader: Clients need to send write requests to the new leader, and followers need to process the replication log from the new leader. The system also needs to ensure that when the old leader comes back, it does not believe that it is still the leader. It must become a follower.

There are a number of things that can go wrong during the failover process:

For asynchronous systems, we may have to discard some writes if they have not been processed on a follower at the time of the leader failure. This violates clients' durability expectations.
Discarding writes is especially dangerous if other storage systems are coordinated with the database contents. For example, say an autoincrementing counter is used as a MySQL primary key and a redis store key, if the old leader fails and some writes have not been processed, the new leader could begin using some primary keys which have already been assigned in redis. This will lead to inconsistencies in the data, and it's what happened to Github (https://github.blog/2012-09-14-github-availability-this-week/).
In fault scenarios, we could have two nodes both believe that they are the leader: split brain. Data is likely to be lost/corrupted if both leaders accept writes and there's no process for resolving conflicts. Some systems have a mechanism to shut down one node if two leaders are detected. This mechanism needs to be designed properly though, or what happened at Github can happen again( https://github.blog/2012-12-26-downtime-last-saturday/)
It's difficult to determine the right timeout before the leader is declared dead. If it's too long, it means a longer time to recovery in the case where the leader fails. If it's too short, we can have unnecessary failovers, since a temporary load spike could cause a node's response time to increase above the timeout, or a network glitch could cause delayed packets. If the system is already struggling with high load or network problems, unnecessary failover can make the situation worse.

Implementation of Replication Logs

Several replication methods are used in leader-based replication. These include:

a) Statement-based replication: In this approach, the leader logs every write request (statement) that it executes, and sends the statement log to every follower. Each follower parses and executes the SQL statement as if it had been received from a client.

A problem with this approach is that a statement can have different effects on different followers. A statement that calls a nondeterministic function such as NOW() or RAND() will likely have a different value on each replica.
If statements use an auto-incrementing column, they must be executed in exactly the same order on each replica, or else they may have a different effect. This can be limiting when executing multiple concurrent transactions, as statements without any causal dependencies can be executed in any order.
Statements with side effects (e.g. triggers, stored procedures) may result in different side effects occurring on each replica, unless the side effects are deterministic.

Some databases work around this issues by requiring transactions to be deterministic, or configuring the leader to replace nondeterministic function calls with a fixed return value.

b) Write-ahead log (WAL) shipping: The log is an append-only sequence of bytes containing all writes to the db. Besides writing the log to disk, the leader can also send the log to its followers across the network.

The main disadvantage of this approach is that the log describes the data on a low level. It details which bytes were changed in which disk blocks. This makes the replication closely coupled to the storage engine. Meaning that if the storage engine changes in another version, we cannot have different versions running on the leader and the followers, which prevents us from making zero-downtime upgrades.

c) Logical (row-based) log replication: This logs the changes that have occurred at the granularity of a row. Meaning that:

For an inserted row, the log contains the new values of all columns.
For a deleted row , the log contains enough information to identify the deleted row. Typically the primary key, but it could also log the old values of all columns.
For an updated row, it contains enough information to identify the updated row, and the new values of all columns.

This decouples the logical log from the storage engine internals. Thus, it makes it easier for external applications (say a data warehouse for offline analysis, or for building custom indexes and caches) to parse. This technique is called change data capture.

d) Trigger-based replication: This involves handling replication within the application code. It provides flexibility in dealing with things like: replicating only a subset of data, conflict resolution logic, replicating from one kind of database to another etc. Trigger and Stored procedures provide this functionality. This method has more overhead than other replication methods, and is more prone to bugs and limitations than the database's built-in replication.

Problems with Replication Lag

Eventual Consistency: If an application reads from an asynchronous follower, it may see outdated information if the follower has fallen the leader. This inconsistency is a temporary state, and the followers will eventually catchup. That's eventual consistency.

The delay between when a write happens on a leader and gets reflected on a follower is replication lag.

Other Consistency Levels

There are a number of issues that can occur as a result of replication lag. In this section, I'll summarize them under the minimum consistency level needed to prevent it from happening.

a) Reading Your Own Writes: If a client writes a value to a leader and tries to read that same value, the read request might go to an asynchronous follower that has not received the write yet as a result of replication lag. The user might think the data was lost, when it really wasn't. The consistency level needed to prevent this situation is known as read-after-write consistency or read-your-writes consistency. It makes the guarantee that a user will always see their writes. There are a number of various techniques for implementing this:

When reading a field that a user might have modified, read it from the leader, else read it from a follower. E.g. A user's profile on a social network can only be modified by the owner. A simple rule could be that user's profiles are always read from the leader, and other users' profiles are read from a follower.
Of course this won't be effective if most things are editable by the user, since it'll drive most reads to the leader. Another option is to keep track of the time of the update, and only read from followers whose last updated time is at least that. The timestamp could be a logical one, like a sequence of writes.

There's an extra complication with this if the same user is accessing my service across multiple devices say a desktop browser and a mobile app. They might be connected through different networks, yet we need to make sure they're in sync. This is known as cross-device read-after-write consistency. This is more complicated for reasons like the fact that:

We can't use the last update time as suggested earlier, since the code on one device will not know about what updates have happened on the other device.
If replicas are distributed across different datacenters, each device might hit different data datacenters which will have followers that may or may not have received the write. A solution to this is to force that all the reads from a user must be routed to the leader. This will of course introduce the complexity of routing all requests from all of a user's devices to the same datacenter.

b) Monotonic Reads: An anomaly that can occur when reading from asynchronous followers is that it's possible for a user to see things moving backward in time. Imagine a scenario where a user makes the same read multiple times, and each read request goes to a different follower. It's possible that a write has appeared on some followers, and not on others. Time might seem to go backwards sometimes when the user sees old data, after having read newer data.

Monotonic reads is a consistency level that guarantees that a user will not read older data after having previously read newer data. This guarantee is stronger than eventual consistency, but weaker than strong consistency.

A solution to this is that every read from a user should go to the same replica. The hash of a user's id could be used to determine what replica to go to.

c) Consistent Prefix Reads: Another anomaly that can occur as a result of replication lag is a violation of causality. Meaning that a sequence of writes that occur in one order might be read in another order. This can especially happen in distributed databases where different partitions operate independently and there's no global ordering of writes. Consistent prefix reads is a guarantee that prevents this kind of problem.

One solution is to ensure that causally related writes are always written to the same partition, but this cannot always be done efficiently.

Solutions for Replication Lag

Application developers should ideally not have to worry about subtle replication issues and should trust that their databases "do the right thing". This is why transactions exist. They allow databases to provide stronger guarantees about things like consistency. However, many distributed databases have abandoned transactions because of the complexity, and have asserted that eventual consistency is inevitable. Martin discusses these claims later in the chapter.

Multi-Leader Replication

The downside of single-leader replication is that all writes must go through that leader. If the leader is down, or a connection can't be made for whatever reason, you can't write to the database.

Multi-leader/Master-master/Active-Active replication allows more than one node to accept writes. Each leader accepts writes from a client, and acts as a follower by accepting the writes on other leaders.

Use Cases for Multi-Leader Replication

Multi-datacenter operation: Here, each datacenter can have its own leader. This has a better performance for writes, since every write can be processed in its local datacenter (as opposed to being transmitted to a remote datacenter) and replicated asynchronously to other datacenters. It also means that if a datacenter is down, each data center can continue operating independently of the others.

Multi-leader replication has the disadvantage that the same data may be concurrently modified in two different datacenters, and so there needs to be a way to handle conflicts.
Clients with offline operation: Some applications need to work even when offline. Say mobile apps for example, apps like Google Calendar need to accept writes even when the user is not connected to the internet. These writes are then asynchronously replicated to other nodes when the user is connected again. In this setup, each device stores data in its local database. Meaning that each device essentially acts like a leader. CouchDB is designed for this mode of operation apparently.
Collaborative Editing: Real-time collaborative editing applications like Confluence and Google Docs allow several people edit a document at the same time. This is also a database replication problem. Each user that edits a document has their changes saved to a local replica, from which it is then replicated asynchronously.

For faster collaboration, the unit of change can be a single keystroke. That is, after a keystroke is saved, it should be replicated.

Handling Write Conflicts.

Multi-leader replication has the big disadvantage that write conflicts can occur, which requires conflict resolution.

If two users change the same record, the writes may be successfully applied to their local leader. However, when the writes are asynchronously replicated, a conflict will be detected. This does not happen in a single-leader database.

Synchronous versus asynchronous conflict detection

In theory, we could make conflict detection synchronous, meaning that we wait for the write to be replicated to all replicas before telling the user that the write was successful. Doing this will make one lose the main advantage of multi-leader replication though, which is allowing each replica to accept writes independently. Use single-leader replication if you want synchronous conflict detection.

Conflict Avoidance

Conflict avoidance is the simplest strategy for dealing with conflicts. Conflicts can be avoided by ensuring that all the writes for a particular record go through the same leader. For example, you can make all the writes for a user go to the same datacenter, and use the leader there for reading and writing. This of course has a downside that if a datacenter fails, traffic needs to be rerouted to another datacenter, and there's a possibility of concurrent writes on different leaders, which could break down conflict avoidance.

Converging toward a consistent state

A database must resolve conflicts in a convergent way, meaning that all the replicas must arrive at the same final value when all changes have been replicated.

Various ways of achieving this are by:

Giving each write a unique ID ( e.g. a timestamp, UUID etc.), pick the write with the highest ID as the winner, and throw away the other writes. If timestamp is used, that's Last write wins, it's popular, but also prone to data loss.
Giving each replica a unique ID, and letting writes from the higher-number replica always take precedence over writes from a lower-number replica. This is also prone to data loss.
Recording the conflict in an explicit data structure that preserves the information, and writing application code that resolves the conflict at some later time (e.g. by prompting the user).
Merge the values together e.g. ordering them alphabetically.

Custom Conflict Resolution Logic

The most appropriate conflict resolution method may depend on the application, and thus, multi-leader replication tools often let users write conflict resolution logic using application code. The code may be executed on read or on write:

On write: When the database detects a conflict in the log of replicated changes, it calls the conflict handler. The handler typically runs in a background process and must execute quickly. It has no user interaction.
On Read: Conflicting writes are stored. However, when the data is read, the multiple versions of the data are returned to the user, either for the user to resolve them or for automatic resolution.

Automatic conflict resolution is a difficult problem, but there are some research ideas being used today:

Conflict-free replicated datatypes (CRDTs) - Used in Riak 2.0
Mergeable persistent data structure - Similar to Git. Tracks history explicitly
Operational transformation: Algorithm behind Google Docs.

It's still an open area of research though.

Multi-Leader Replication Topologies

A replication topology is the path through which writes are propagated from one node to another. The most general topology is all-to-all, where each leader sends its writes to every other leader. Other types are circular topology and star topology.

All-to-all topology is more fault tolerant than the circular and star topologies because in those topologies, one node failing can interrupt the flow of replication messages across other nodes, making them unable to communicate until the node is fixed.

Leaderless Replication

In this replication style, the concept of a leader is abandoned, and any replica can typically accept writes from clients directly.

This style is used by Amazon for its in-house Dynamo system. Riak, Cassandra and Voldermort also use this model. These are called Dynamo style systems.

In some leaderless implementations, the client writes directly to several replicas, while in others there's a coordinator node that does this on behalf of the client. Unlike a leader database though, this coordinator does not enforce any ordering of the writes.

Preventing Stale Reads

Say there are 3 replicas and one of the replicas goes down. A client could write to the system and have 2 of the replicas successfully acknowledge the write. However, when the offline node gets back up, anyone who reads from it may get stale responses.

To prevent stale reads, as well as writing to multiple replicas, the client may also read from multiple replicas in parallel. Version numbers are attached to the result to determine which value is newer.

Read repair and anti-entropy

When offline nodes come back up, the replication system must ensure that all data is eventually copied to every replica. Two mechanisms used in Dynamo-style datastores are:

Read repair: When data is read from multiple replicas and the system detects that one of the replicas has a lower version number, the data could be copied to it immediately. This works for frequently read values, but has the downside that any data that is not frequently read may be missing from some replicas and thus have reduced durability.
Anti-entropy process: In addition to the above, some databases have a background process that looks for differences in data between replicas and copies any missing data from one replica to another. This process does not copy writes in any particular order, and there may be a notable delay before data is copied.

Quorums for reading and writing

Quorum reads and writes refer to the minimum number of votes for a read or a write to be valid. If there are n replicas, every write must be confirmed by at least w nodes to be considered successful, and every read must be confirmed by at least r nodes to be successful. The general rule that the number chosen for r and w should obey is that:

w + r > n.

This way, we can typically expect an up-to-date value when reading because at least one of the r nodes we're reading from must overlap with the w nodes (barring sloppy quorums which I discussed below).

The parameters n, w, and r are typically configurable. A common choice is to make n an odd number such that w = r = (n + 1)/2. These numbers can be varied though. For a workload with few writes and many reads, it may make sense to set w = n and r = 1. Of course this has the disadvantage of reduced availability for writes if just one node fails.

Note that n does not always refer to the number of nodes in the cluster, it may just be the number of nodes that any given value must be stored on. This allows datasets to be partitioned. I cover Partitioning in the next post.

Note:

With w and r being less than n, we can still process writes if a node is unavailable.
Reads and writes are always sent to all n replicas in parallel, w and r determine how many nodes we wait for i.e., how many nodes need to report success before we consider the read or write to be successful.

Limitations of Quorum Consistency

Quorums don't necessarily have to be majorities i.e. w + r > n. What matters is that the sets of nodes used by the read and write operations overlap in at least one node.

We could also set w and r to smaller numbers, so that w + r ≤ n. With this, reads and writes are still sent to n nodes, but a smaller number of successful responses is required for the operation to succeed. However, you are also more likely to read stale values, as it's more likely that a read did not include the node with the latest value.

The upside of the approach though is that it allows lower latency and higher availability: if there's a network interruption and many replicas become unreachable, there's a higher chance that reads and writes can still be processed.

Even if we configure our database such that w + r > n , there are still edge cases where stale values may be returned. Possible scenarios are:

If a sloppy quorum is used, the nodes for reading and writing may not overlap. Sloppy quorums are discussed further down.

If two writes occur concurrently, it's still not clear which happened first. Therefore, the database may wrongly return the more stale one. If we pick a winner based on a timestamp (last write wins), writes can be lost due to clock skew.
If a write happens concurrently with a read, the write may be reflected on only some of the replicas. It's unclear whether the read will return the old or new value.
In a non-transaction model, if a write succeeds on some replicas but fails on others, it is not rolled back on the replicas where it succeeded.

From these points and others not listed, there is no absolute guarantee that quorum reads return the latest written value. These style of databases are optimized for use cases that can tolerate eventual consistency. Stronger guarantees require transactions or consensus.

Monitoring Staleness

It's important to monitor whether databases are returning up-to-date results, even if the application can tolerate stale reads. If a replica falls behind significantly, the database should alert you so that you can investigate the cause.

For leader-based replication, databases expose metrics for the replication lag. It's possible to do this because writes are applied to the leader and followers in the same order. We can determine how far behind a follower has fallen from a leader by subtracting it's position from the leader's current position.

This is more difficult in leaderless replication systems as there is no fixed order in which writes are applied. There's some research into this, but it's not common practice yet.

Sloppy Quorums and Hinted Handoff

Databases with leaderless replication are appealing for use cases where high availability and low latency is required, as well as the ability to tolerate occasional stale reads. This is because they can tolerate failure of individual nodes without needing to failover since they're not relying on one node. They can also tolerate individual nodes going slow, as long as w or r nodes have responded.

Note that the quorums described so far are not as fault tolerant as they can be. If any of the designated n nodes is unavailable for whatever reason, it's less likely that you'll be able to have w or r nodes reachable, making the system unavailable. Nodes being unavailable can be caused by anything, even something as simple as a network interruption.

To make the system more fault tolerant, instead of returning errors to all requests for which can't reach a quorum of w or r nodes, the system could accept reads and writes on nodes that are reachable, even if they are not among the designated n nodes on which the value usually lives. This concept is known as a sloppy quorum.

With a sloppy quorum, during network interruptions, reads and writes still require r and w successful responses, but they do not have to be among the designated n "home" nodes for a value. These are like temporary homes for the value.

When the network interruption is fixed, the writes that were temporarily accepted on behalf of another node are sent to the appropriate "home" node. This is hinted handoff.

Sloppy quorums are particularly useful for increasing write availability. However, it also means that even when w + r > n, there is a possibility of reading stale data, as the latest value may have been temporarily written to some values outside of n.

Sloppy quorum is more of an assurance of durability, than an actual quorum.

Multi-datacenter operation

For datastores like Cassandra and Voldermort which implement leaderless replication across multiple datacenters, the number of replicas n includes replicas in all datacenter.

Each write is also sent to all datacenters, but it only waits for acknowledgement from a quorum of nodes within its local datacenter so that it's not affected by delays and interruptions on the link between multiple datacenters.

Detecting Concurrent Writes

In dynamo-style databases, several clients can concurrently write to the same key. When this happens, we have a conflict. We've briefly touched on conflict resolution techniques already, but we'll discuss them in more detail.

Last write wins (discarding concurrent writes)

One approach for conflict resolution is the last write wins approach. It involves forcing an arbitrary ordering on concurrent writes (could be by using timestamps), picking the most "recent" value, and discarding writes with an earlier timestamp.

This helps to achieve the goal of eventual convergence across the data in replicas, at the cost of durability. If there were several concurrent writes the same key, only one of the writes will survive and the others will be discarded, even if all the writes were reported as successful.

Last write wins (LWW) is the only conflict resolution method supported by Apache Cassandra.

If losing data is not acceptable, LWW is not a good choice for conflict resolution.

The "happens-before" relationship and concurrency

Whenever we have two operations A and B, there are three possibilities:

Either A happened before B
Or B happened before A
Or A and B are concurrent.

We say that an operation A happened before operation B if either of the following applies:

B knows about A
B depends on A
B builds upon A

Thus, if we cannot capture this relationship between A and B, we say that they are concurrent. If they are concurrent, we have a conflict that needs to be resolved.

Note: Exact time does not matter for defining concurrency, two operations are concurrent if they are both unaware of each other, regardless of the physical time which they occurred. Two operations can happen sometime apart and still be concurrent, as long as they are unaware of each other.

Capturing the happens-before relationship

In a single database replica, version numbers are used to determine concurrency.

It works like this:

Each key is assigned a version number, and that version number is incremented every time that key is written, and the database stores the version number along with the value written. That version number is returned to a client.
A client must read a key before writing. When it reads a key, the server returns the latest version number together with the values that have not been overwritten.
When a client wants to write a new value, it returns the last version number it received in the prior step alongside the write.
If the version number being passed with a write is higher than the version number of other values in the db, it means the new write is aware of those values at the time of the write (since it was returned from the prior read), and can overwrite all values with that version number or below.
If there are higher version numbers, the database must keep all values with the higher version number (because those values are concurrent with the incoming write- it did not know about them).

Example scenario:

If two clients are trying to write a value for the same key at the same time, both would first read the data for that key and get the latest version number of say: 3. If one of them writes first, the version number will be updated to 4 from the database end. However, since the slower one will pass a version number of 3, it means it is concurrent with the other one since it's not aware of the higher version number of 4.

When a write includes the version number from a prior read, that tells us which previous state the write is based on.

Merging Concurrently Written Values

With the algorithm described above, clients have to do the work of merging concurrently written values. Riak calls these values siblings.

A simple merging approach is to take a union of the values. However, this can be faulty if one operation deleted a value but that value is still present in a sibling. To prevent this problem, the system must leave a marker (tombstone) to indicate that an item has been removed when merging siblings.

CRDTs are data structures that can automatically merge siblings in sensible ways, including preserving deletions.

Version Vectors

The algorithm described above used only a single replica. When we have multiple replicas, we use a version number per replica and per key and follow the same algorithm. Note that each replica also keeps track of the version numbers seen from each of the other replicas. With this information, we know which values to overwrite and which values to keep as siblings.

The collection of version numbers from all the replicas is called a version vector. Dotted version vectors are a nice variant of this used in Riak: https://riak.com/posts/technical/vector-clocks-revisited-part-2-dotted-version-vectors/

Version vectors are also sent to clients when values are read, and need to be sent back to the database when a value is written.

Version vectors enable us to distinguish between overwrites and concurrent writes.

We also have Vector clocks, which are different from Version Vectors apparently: https://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/

Chapter 4 - Encoding and Evolution

2019-12-07T16:23:02-00:00

These are my notes from the fourth chapter of Martin Kleppmann's Designing Data Intensive Applications.

Table of Contents

Formats for Encoding Data
- Language-Specific Formats
- JSON, XML, and Binary Variants
  - Binary Encoding
  - Modes of Dataflow

We should aim to build systems that make it easy to adapt to change: Evolvability.

Rolling upgrade: Deploying a new version to a few nodes at a time, checking whether the new version is running smoothly, and gradually working your way through all the nodes.

With rolling upgrades, new and old versions of the code, and old and new data formats may potentially all coexist in the system at the same time. For a system to run smoothly, compatibility needs to be in both directions.

Backward compatibility: Newer code can read data that was written by older code. Simpler to achieve.

Forward compatibility: Older code can read data written by newer code. This is trickier because it requires older code to ignore additions made by a newer version of the code.

Formats for Encoding Data

Programs work with data that have at least 2 different representations:

In memory data structures: optimized for efficient access and CPU manipulation
Sequence of bytes (e.g. JSON) for transmitting over the network.

We need some kind of translation between the two representations:

Encoding/Serialization/Marshalling - Translation from in-memory representation to a byte sequence.

Decoding/ Deserialization - Byte sequence to in-memory representation.

There are a number of different libraries and encoding formats to choose from which we'll discuss next.

Language-Specific Formats

Different programming languages have their built-in support for encoding in-memory objects into byte sequences. Java has Serializable, Ruby has Marshal and so on. However, these language-specific encodings have their own problems:

The encoding is tied to a particular programming language, and reading the data in another language is difficult.
Versioning data is often an afterthought. They often neglect the problems of backward and forward compatibility since they're intended for quick and easy use.
Efficiency is also an afterthought. Java's serialization is notorious for bad performance and bloated encoding.

JSON, XML, and Binary Variants

JSON and XML are the obvious contenders for standard encodings. CSV is another option. These formats are widely known, widely supported, and almost as widely disliked.

XML is often criticized for its verbose syntax. Apart from superficial syntactic issues, they also have subtle problems:

There's ambiguity around how numbers are encoded. In XML and CSV, there's no distinction between a number and a string that happens to have digits. JSON distinguishes both, but does not distinguish integers and floating-point numbers, and does not specify a precision.
No support for binary strings (sequences of bytes without a character encoding)
Optional schema support for XML and JSON

Binary Encoding

The choice of data format can have a big impact especially when the dataset is in the order of terabytes.

JSON is less verbose than XML, but both still use a lot of space compared to binary formats. This has led to a number of binary encodings for JSON (BSON, BJSON, etc) and XML (WBXML, etc). BSON is used as the primary data representation in MongoDB for example.

Thrift and Protocol Buffers: These are binary encoding libraries. Protocol Buffers were developed at Google, while Thrift was developed at Facebook.

Avro: Another binary encoding format different from the two above. This started out as a sub project of Hadoop.

These encoding libraries have some interesting encoding rules which I skipped: http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

Modes of Dataflow

Recall that it was stated earlier that to send data from one process to another with which you don't share memory, it needs to be encoded as a sequence of bytes.

We also said that forward and backward compatibility are important.

Here, we'll explore how data flows between processes:

Dataflow Through Databases

The process that writes to a database encodes the data, while the process the reads from it decodes it. It could be the same process doing both, or different processes.

Forward compatibility is required in databases: If different processes are accessing the database, and one of the processes is from a newer version of the application ( say during a rolling upgrade), the newer code might write a value to the database. Forward compatibility is the ability of the process running the old code to be able to read the data written by the new code.

We also need backward compatibility so that code from a newer version of the app can read data written by an older version.

Data outlives code and often times, there's a need to migrate data to a new schema. Avro has sophisticated schema evolution rules that can allow a database to appear as if was encoded with a single schema, even though the underlying storage may contain records encoded with previous schema versions.

Dataflow Through Services: REST and RPC

When there's communication between processes over a network, a common arrangement is to have two roles: clients (e.g. web browser) and servers.

The server typically exposes an API over the network for the client to make requests. This API is known as a service.

A server can also be a client to another service. E.g. a web app server is usually a client to a database.

A difference between a web app service and a database service is that there's usually tighter restriction on the former.

Service-oriented architecture (SOA): Decomposing a large application into smaller components by area of functionality.

Web Services: If a service is communicated with using HTTP as the underlying protocol, it is called a web service. Two approaches to web services are REST and SOAP (Simple Object Access Protocol).

RPC: The RPC model tries to make a request to a remote network look the same as calling a function or method, within the same process ( location transparency - In computer networks, location transparency is the use of names to identify network resources, rather than their actual location).

There are certain problems with this approach though, which can be summarized under the fundamental fact that network calls are different from function calls. E.g.:

A local function call is predictable and succeeds or fails depending on parameters under my control. A network call is unpredictable - the request or response packets can get lost, the remote machine may be slow etc.
A local function call either returns a result, or throws an exception, or never returns (infinite loop). A network request has another possible outcome, it may return without a result, due to a timeout. There's no way of knowing whether the request got through or not.
Retrying a failed network request could cause the action to be performed multiple times if the request actually got through, but the response was lost. Building a system for idempotence could prevent this though.
The client and service may be implemented in different languages, so the RPC library would need to translate datatypes from one language to another. This is tricky because not all languages have the same types. This process does not exist in a single process written in a single language.

Despite these problems, RPC isn't going away. The new generation of RPC frameworks are explicit about the difference between a remote request and a local function call such as Finagle, Rest.li and GRPC.

The main focus of RPC frameworks is on requests between services owned by the same organization, typically within the same datacenter.

Q - How exactly do RPCs differ from REST? Is it just the way the endpoints look?

Message-Passing Dataflow

Asynchronous message-passing systems are somewhere between RPC and databases.

Similar to RPCs because a client's request is delivered to another process with low latency.
Similar to databases in that a message is not sent via a direct network connection, but via an intermediary called a message broker or message queue.

Advantages of a message broker

It can act as a buffer if the recipient is down or unable to receive messages due to overloading.
The sender can act without knowing the IP address and port number of the recipient (which can change quite often - especially in a cloud deployment where VMs come and go)
A message can be delivered to multiple recipients.
It can retry message delivering to a crashed process and prevent lost messages.
It decouples the sender from the recipient. The sender does not need to know anything about the recipient.

The communication pattern here is usually asynchronous - the sender does not wait for the message to be delivered, but simply sends it and forgets about it.

Message brokers

The configuration settings of message brokers typically vary, but in general they're used like:

A process sends a message to a named queue or topic
The broker ensures that the message is delivered to one or more consumers or subscribers to that queue or topic.

A topic can have many producers and many consumers.

Distributed actor frameworks

The actor model is a programming model for concurrency in a single process. Each part of the system is represented as an actor. An actor is usually a client or an entity which communicates with other actors by sending and receiving asynchronous messages.

In distributed actor frameworks, this model is especially useful for scaling an application across multiple nodes as the same message-passing mechanism is used, regardless of whether the sender and recipient are on the same or different nodes.

This framework integrates the actor programming model and the message broker into a single framework. 3 popular distributed actor frameworks are:

Akka
Orleans
Erlang OTP

Chapter 3 - Storage and Retrieval

2019-12-07T16:21:35-00:00

These are my notes from the third chapter of Martin Kleppmann's Designing Data Intensive Applications.

Table of Contents

Storage Engines
Log-Structured Storage Engines
- Indexing
Further reading:

This chapter is about how databases work under the hood.

There's a difference between storage engines that are optimized for transactional workloads and those that are optimized for analytics.

Storage Engines

There are two families of storage engines: log-structured storage engines (log structured merge trees), and page-oriented storage engines (b-trees). A storage engine’s job is to write things to disk on a single node.

Log-Structured Storage Engines

Many databases internally use a log, an append-only data file for adding something to it. Each line in the log contains a key-value pair, separated by a comma (similar to a CSV file, ignoring escaping issues). The log does not have to be internally-readable, it might be binary and intended only for other programs to read.

Indexing

An index is an additional structure derived from the primary data. Any kind of index usually slows down writes, since the index has to be updated every time data is written.

Well-chosen indexes speed up read queries, but every index slows down writes.

Hash Index

These are indexes for key-value data. For a data storage that consists of only appending to a file, a simple indexing strategy is to keep an in-memory hash map where the value for every key is a byte offset, which indicates where the key is located in the file.

Bitcask (the default storage engine in Riak - Riak is a distributed datastore similar to Cassandra) uses the approach above. The only requirement is that all the keys fit in the available RAM as the hash map is kept completely in memory. The values don't have to fit in memory since they can be loaded from disk with a simple disk seek. Something like Bitcask is suitable for situations where the value for a key is updated frequently.

The obvious challenge in appending to a file is that the file can grow too large and then we run out of disk space. A solution to this is to break the log into segments of a certain size. A segment file is closed when it reaches that size, and subsequent writes are made to a new segment.

We can then perform compaction on these segments. Compaction means keeping the most recent update for each key and throwing away duplicate keys. Compaction often makes segments smaller (relies on the assumption that a key is overwritten several times on average within one segment), and so we can merge several segments together at the same time as performing the compaction.

Basically, we compact and merge segment files together. The merged segment is written to a new file. This can happen as a background process, so the old segment files can still serve read and write requests until the merging process is complete.

Each segment will have its own in-memory hash table. To find a value for a key, we'll check the most recent segment. If it's not there, we'll check the second-most-recent segment, and so on.

There are certain practical issues that must be considered in a real life implementation of this hash index in a log structure. Some of them are:

File Format: As opposed to using a CSV format, it's faster and simpler to use a binary format that first encodes the length of a string in bytes, followed by the raw string.
Deleting Records: To delete a key and its value, it's not practical to search for all the occurrences of that key in the segments. What happens is that a special deletion record is appended to the data file (sometimes called a tombstone). When log segments are merged, the tombstone tells the merging process to discard any previous values for the deleted key.
Crash Recovery: If the database is restarted, the in-memory hash maps will be lost. In principle, the segment's hash maps can be restored by reading the entire segment files and constructing the hash maps from scratch. This might take a while though, so could make server restarts painful. Bitcask's approach to recovery is by storing a snapshot of each segment's hash map on disk, which can be loaded into memory more quickly
Concurrency Control: Since writes are appended in a sequential order, a common implementation is to have only one writer thread. Data files are append-only and otherwise immutable, so they can be read concurrently by multiple threads.

There are good reasons why an append-only log is a good choice, as opposed to a storage where files are updated in place with the new value overwriting the old one. Some of those reasons are:

Appending and segment merging are sequential write operations, which are generally faster than random writes, especially on magnetic spinning-disk hard drives.
Concurrency and crash recovery are simpler if segment files are append-only or immutable. For crash recovery, you don't need to worry if a crash happened while a value was being overwritten, leaving you with partial data.
Merging old segments avoids the problem of data files getting fragmented over time. Fragmentation occurs on a hard drive, a memory module, or other media when data is not written closely enough physically on the drive. Those fragmented, individual pieces of data are referred to generally as fragments.

Basically, when data files are far from each other, it's a form of fragmentation.

There are limitations to the hash table index though, some of them are:

The hash table must fit in memory, so it's not efficient if there are a large number of keys. An option is to maintain a map on disk, but it doesn't perform so well. It requires a lot of random access I/O, is expensive to grow when it becomes full, and hash collisions require fiddly logic.
Range queries are not efficient. You have to look up each key individually in the map.

So, in this approach, writes are made to the segments on a disk while the hash table index being stored is in-memory.

SSTables and LSM-Trees

In log segments with hash indexes, each key-value pair appears in the order that it was written, and values later in the log take precedence over values for the same key earlier in the log. Apart from that, the order of key-value pairs in the file is irrelevant.

There is a change to this approach in a Sorted String Table format, or SSTable for short. Here, it is required that the sequence of key-value pairs is sorted by key. Hence, new key-value pairs cannot be appended to the segment immediately. There are several advantages to this over log segments with hash indexes:

Merging segments is simple and efficient. The approach here is similar to the mergesort algorithm. The same principle as the log with hash indexes applies if a key is duplicated across several segments. We keep the most recent one and discard the others.
You don't need to keep an index of all the keys in memory. Because the file is sorted, if you're looking for the offset of a particular key, it won't be difficult to find that offset once you can determine the offset of keys that are smaller and larger than it in the ordering.

You still need an in-memory index to tell you the offsets of some keys, but it can be sparse.

Read requests often need to scan over several key-value pairs, therefore it is possible to group the records into a block and compress it before writing it to disk. Each entry of the sparse index then points to the start of a compressed block. This has the advantage of saving disk space and reducing the IO bandwidth.

SSTables store their keys in blocks, and have an internal index, so even though a single SSTable may be very large (gigabytes in size), only the index and the relevant block needs to be loaded into memory.

Constructing and maintaining SSTables

It's possible to maintain a sorted structure on disk( see B-Trees) but maintaining it in memory is easier and is the approach described here. The approach is to use well-known tree data structures such as red-black trees or AVL trees into which keys can be inserted in any order and read back in sorted order.

So the storage engine works as follows:

When a write comes in, it is written to an in-memory balanced tree data structure. The in-memory tree is sometimes called a memtable.
When the memtable exceeds a threshold, write it out to disk as an SSTable file. This operation is efficient because the tree already maintains the key-value pairs sorted by key. The new SSTable file then becomes the most recent segment of the database. While the SSTable is being written out to disk, writes can continue to a new memtable instance.
To serve a read request, first check for the key in the memtable. If it's not there, check the most recent segment, then the next-older segment etc
From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.

An obvious problem with this approach is that if the database crashes, the most recent writes (which are in the memtable but not yet written to disk) will disappear. To avoid that problem, one approach is to keep a separate log on disk to which every write is immediately appended. This separate log is not in sorted order, but that's irrelevant because the content can easily be sorted in a memtable. The corresponding log can be discarded every time the memtable is written out to an SSTable.

Making an LSM-tree out of SSTables

The algorithm described above is used in LevelDB and RocksDB. Key-value storage engine libraries are designed to be embedded into other applications. Among other things, LevelDB can be used in Riak as an alternative to Bitcask as its storage engine.

This indexing structure was originally described under the name Log-Structured Merge-Tree.

Lucene, which is an indexing engine for full-text search uses a similar method for storing its term dictionary. A full-text index is more complex than a key-value index but is based on a similar idea: given a word in a search query, find all the documents that mention the word. It's usually implemented with a key-value structure where the key is a word ( a term) and the value is a list of IDs of all the documents that contain the word (the postings list). In Lucene, the mapping from term to postings list is kept in SSTable-like sorted files that are merged in the background as needed.

Performance Optimizations

The LSM-tree algorithm can be slow when looking up keys that do not exist in the database: you first have to check the memtable, then all the segments all the way up to the oldest (possibly having to read from disk for each one) to be certain that the key does not exist. In order to optimize this access, storage engines often make use of Bloom filters.

A Bloom filter is a memory-efficient data structure for approximating the contents of a set. It can tell you if a key does not appear in a database, thus saving you from unnecessary disk reads for nonexistent keys.

There are also strategies to determine the order and timing of how SSTables are compacted and merged. Two most common options are size-tiered and leveled compaction. LevelDB and RocksDB use leveled compaction, Hbase uses size-tiered and Cassandra supports both.

Size-Tiered Compaction: Here, newer and smaller SSTables are successively merged into older and larger SSTables.

Leveled Compaction: The key range is split into smaller SSTables and older data is moved into separate "levels". This allows compaction to proceed more incrementally and use less disk space. The levels are structured roughly so that each level is in total 10x as large as the level above it. New keys arrive at the highest layer, and as that level gets larger and larger and hits a threshold, some SSTables at that level get compacted into fewer (but larger) SSTables one level lower.

Within a single level, SSTables are non-overlapping: one SSTable might contain keys covering the range (a,b), the next (c,d), and so on. The key-space does overlap between levels: if you have two levels, the first might have two SSTables (covering the ranges above), but the second level might have a single SSTable over the key space (a,e). Looking for the key 'aardvark' may require looking in two SSTables: the (a,b) SSTable in Level 1, and the (a,e) SSTable in Level 2.

Basically, a level has many SSTables.

B-Trees

B-trees are a popular indexing structure. Like SSTables, they keep key-value pairs sorted by key, but the similarity ends there.

Log-structured indexes break the database down into segments, however B-trees break the database down into fixed size blocks or pages. Each page can be identified with its address or location on disk, which allows one page to refer to another. Pages are usually small in size, typically 4kb compared to segments which can be several megabytes. Pages are stored on disk.

One page is designated as the root of the B-tree; whenever you want to look up a key in the index, you start here. The page contains several keys and references to child pages. Each child is responsible for a continuous range of keys, and the keys between the references indicate where the boundaries between those ranges lie.

Branching factor: The number of references to child pages in one page the B-tree.

To update the value of an existing key in a B-tree, you search for the leaf page containing that key, change the value in that page, and write the page back to disk (any references to that page remain valid). To add a new key, find the page whose range encompasses the new key and add it to that page. If there's no free space on that page, split the page into two half-full pages, and update the parent page to account for the new subdivision of key ranges.

Making B- Trees reliable

The main write operation of a B-tree is to overwrite a page on disk with new data. The assumption is that an overwrite does not change where a page is located i.e. all the references to a page typically remain intact when the page is overwritten. This differs from LSM trees where things are never updated in place, and are append only.

Some operations require different pages to be overwritten e.g. when a page is split because an insertion caused it to be overfull. We'll need to write the two pages that were split and update the parent page with references to the two child pages. This operation is dangerous especially if the database crashes after only some pages have been written, this can lead to a corrupted index.

A solution used to make databases resilient to crashes is to keep a write-ahead log on disk. It is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself. It's used to restore the DB when it comes back from a crash.

There are also concurrency issues associated with updating pages in place. If multiple threads access a B-tree at the same time, a thread may see the tree in an inconsistent state. The solution is usually implemented by protecting the tree's data structures with latches (lightweight locks). This is not an issue with log structured approaches since all the merging happens in the background without interfering with incoming queries.

B-tree optimizations

Different optimizations have been made with B-trees:

Additional pointers been added to the tree. E.g. a leaf page may have references to its sibling pages to the left and right, this allows scanning keys in order without jumping back to parent pages.
Some databases use a copy-on-write scheme instead of overwriting pages and maintaining a WAL for crash recovery. What this means is that a modified page is written to a different location, and a new version of the parent pages in the tree is created, pointing at the new location.

So, page structured storage engines are organized into fixed-size pages. These pages are all part of a tree called b-tree.

SQLite, for example, has a btree for every table in the database, as well as a btree for every index in the database. For the indexes , the key stored on a page is the column value of the index, while the value is the row id where it can be found. For the table btree, the key is the rowid while I believe the value is all the data in that row: https://jvns.ca/blog/2014/10/02/how-does-sqlite-work-part-2-btrees/

https://hackernoon.com/fundamentals-of-system-design-part-3-8da61773a631

Comparing B-Trees and LSM-Trees

As a rule of thumb, LSM trees are typically faster for writes, while B-trees are thought to be faster for reads. Reads are slower on LSM-trees because they have to check different data structures and SSTables at different stages of compaction.

Advantages of LSM Trees

A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once to the page itself (and perhaps again as pages are split). There's also overhead from having to write an entire page at a time, even if only few bytes in the page change. Log-structured indexes also rewrite data multiple times due to the repeated compaction and merging of SSTables.

Write amplification: When one write to the database results in multiple writes to the disk over the course of the database's lifetime. This is of particular concern on SSDs, which can only overwrite blocks a limited number of times before wearing out.
LSM-trees are typically able to sustain higher write throughput than B-trees partly because they sometimes have lower write amplification, and also because they sequentially write compact SSTable files rather than having to overwrite several pages in the tree. This is important on magnetic hard drives, where sequential writes are faster than random writes.

LSM-trees can be compressed better, and this often produces smaller files on disk than B-trees. B-trees leave some disk space used due to fragmentation: when a row cannot fit into an existing page or a page is split, some space in a page remains unused(Basically, if the existing space on the page cannot fit a new row, the row will be moved to a page). Sending and receiving smaller files over IO is useful if you bandwidth is limited.

On many SSDs, the firmware internally uses a log-structured algorithm to turn random writes into sequential writes on the underlying storage chips, so the impact of the storage engine's write pattern is less pronounced (point 2). Note that lower write amplification and reduced fragmentation is still advantageous on SSDs: representing data more compactly allows more read and write requests within the available I/O bandwidth.

Downsides of LSM Trees

A downside of log-structured storage is that the compaction process happening can interfere with the performance of ongoing reads and writes. Storage engines typically try to perform compaction incrementally and without affecting concurrent access, but it can easily happen that a request needs to wait while the disk finishes an expensive compaction operation.
One other issue with compaction arises at high write throughput: The disk needs to share its finite write bandwidth between the initial write (logging and flushing a memtable to disk) and the compaction threads running in the background.

Compaction has to keep up with the rate of incoming writes, even at high write throughput.
A key can exist in multiple places in different segments with LSM trees. This differs from B-trees where a key can only exist in one place. This prospect makes B-trees more appealing for strong transactional semantics - e.g. with b-trees transaction isolation can be implemented by attaching locks on a range of keys to a tree.

Other Indexing Structures

We've mainly covered key-value indexes which are like a primary key index, but we can also have secondary indexes. You can typically create several secondary indexes on the same table in relational databases. A secondary index can be constructed from a key-value index.

With secondary indexes, note that the indexed values are not necessarily unique. Several rows (documents, vertices) may exist under the same index entry. This can be expressed in two ways:

Making each value in the index a list of matching row identifiers.
Making each key unique by appending a row identifier to it.

Both B-trees and log-structured indexes can be used as secondary indexes.

Storing values within the index

The key in an index is the column value that queries search for, but the value can be either:

The actual row (document, vertex) in question
A reference to the row stored elsewhere. The rows in this case are stored somewhere known as a heap file, which stores data in no particular order (could be append-only, or may keep track of deleted rows in order to overwrite them with new data later). This approach is common because it avoids duplicating data in the presence of several secondary indexes. Each index just references a location in the heap file.

Approach 2 - Heap file

The heap file approach can be efficient when updating a value without changing the key, provided the new value is not larger than the old value. If it is larger, the record might need to be moved to a new location in the heap where there is enough space. When this happens, either all indexes need to be updated to point to the new heap location of the record, or a forwarding pointer is left behind in the old heap location.

Approach 1 - Actual row

In some cases, the hop from the index to the heap file is too much of a performance penalty for reads, so the indexed row is stored directly within the index. This is known as a clustered index. In MySQL's InnoDB storage engine, the primary key of a table is always a clustered index, and secondary indexes refer to the primary key (rather than a heap file).

Covering Index

There's a compromise between a clustered index (storing all row data within the index) and a nonclustered index(storing only references to the data within the index) which is known as a covering index or index with included columns, which stores some of a table's columns within the index. With this approach, some queries can be answered using the index alone.

Multi-column indexes

We've only dealt with indexes which map a single key to a value so far. We need more than that to query multiple columns of a table, or multiple documents simultaneously.

The most common type of multi-column index is a concatenated index. This type of index combines several fields into one key by appending the columns. This kind is useless if you only want to search for the values for one of the columns. The columns should be in the order of the common search queries/pattern, because the index will sort by the first column.

Another approach is a multi-dimensional index. These kind are a more general way of querying several columns at once, which is useful for geospatial data, for example. Say you want to perform a search for records between both a longitude range and a latitude range, an LSM tree or B-tree cannot answer that efficiently. You can get all the records within a range of latitudes (but at any longitude), and within a range of longitudes, but not both simultaneously.

An option is to translate a two-dimensional location into a single number using a space-filling curve(?) and then use a regular B-tree index.

Full-text search and fuzzy indexes

The indexes we've discussed so far assume that we have exact data, and we know the exact values of a key, or a range of values of a key with a sort order. For dealing with things like searching similar keys, such as misspelled words, we look at fuzzy querying techniques.

Levenshtein automaton: Supports efficient search for words within a given edit distance.

Keep everything in memory

The data structures discussed so far provide answers to the limitations of disks. Disks are awkward to deal with compared to main memory. With both magnetic disks and SSDs, data on disk must be laid out carefully to get good read and write performance. Disks have 2 significant advantages over main memory though:

They are durable. Content is not lost if power is turned off.
They have a lower cost per gigabyte than RAM.

There have been developments of in-memory databases lately, especially since RAM has become cheaper and many datasets are not that big so keeping them in memory is feasible.

In-memory databases aim for durability by:

Using special hardware (battery-powered RAM)
Writing a log of changes to disk
Writing periodic snapshots to disk
Replicating the in-memory state to other machines.

VoltDB, MemSQL and Oracle TimesTen are in-memory databases with a relational model.

Interestingly, the performance advantage of in-memory databases is not due to the fact that they don't need to read from disk. A disk-based storage engine may never need to read from disk if there's enough memory, because the OS caches recently used data blocks in memory anyway. Rather, they can be faster because there's no overhead of encoding in-memory data structures in a form that can be written to disk.

Besides performance, an interesting area for in-memory databases is that they allow for the use of data models that are difficult to implement with disk-based indexes. E.g. Redis offers a db-like interface to data structures such as priority queues and sets.

Recent research indicates that in-memory database architecture can be extended to support datasets larger than the available memory, without bringing back the overheads of a disk-centric architecture. This approach works by evicting the least recently used data from memory to disk when there's not enough memory, and loading it back into memory when it's accessed again in the future.

Transaction Processing vs Analytics

Transaction: A group of reads and writes that form a logical unit.

OLAP: Online Analytics Processing. Refers to queries generally performed by business analytics that scan over a huge number of record and calculating aggregate statistics.

OLTP: Online Transaction Processing. Interactive queries which typically return a small number of records.

In the past, OLTP-type queries and OLAP-type queries were performed on the same databases. However, there's been a push for OLAP-type queries to be run on data warehouses.

Data Warehousing

A data warehouse is a separate DB that analysts can query without affecting OLTP operations. The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company. Data is extracted from OLTP databases and loaded into the warehouse using an ETL (Extract-Transform-Load) process.

It turns out that the indexing algorithms discussed so far work well for OLTP, but not so much for answering analytics queries.

Transaction processing and data warehousing databases look similar, but the latter is optimized for analytics queries. They are both often accessible through a common SQL interface though.

A number of SQL-on-Hadoop data warehouses have emerged such as Apache Hive, Spark SQL, Cloudera Impala etc.

Stars and Snowflakes: Schemas for Analytics

Many data warehouses are used in a fairly formulaic style - star schema (or dimensional modelling). The name "star schema" comes from the fact that when the table relationships are visualized, the fact tables is in the middle, surrounded by its dimension tables (these represent the who, what, where, when, how, and why); The connections to the these tables are like the rays of a star.

We also have the snowflake schema, where dimensions are further broken down into subdimensions.

Column Oriented Storage

In most OLTP databases, storage is laid out in a row-oriented fashion: all the values from one row of a table are stored next to each other. Document databases are similar: an entire document is typically stored as one contiguous sequence of bytes.

Analytics queries often access millions of rows, but few columns.

The idea behind column-oriented storage is straight forward: don't store all the values from one row together, but store all the values from each column together instead. If each column is stored in a separate file, a query only needs to read and parse the columns that it is interested in, which can save work.

The column-oriented storage layout relies on each column file containing the rows in the same order.

Column Compression

In addition to loading only the columns from disk that are required for a query, we can reduce the demands on disk throughput by compressing data. Column-oriented storage lends itself well to compression. Different compression techniques can be used such as Bitmap encoding. The number of distinct values in a column is often small compared to the number of rows. Therefore, we can take a column with n distinct values and turn it into n separate bitmaps: one bitmap for each distinct value, with one bit for each row. The bit is 1 if the row has that value, and 0 if not.

So I believe that in column oriented storage, each 'file' for a column has one row, where the number of columns for that row will be the number of rows in a standard row-wise table, and columns contain the rows in the same order.

Column-oriented storage and column families

Cassandra and Hbase have a concept of column families, which differs from being column oriented. Within each column family, they store all the columns from a row together, along with a row key, and do not use column compression.

Sort Order in Column Storage

It doesn't really matter in which order the rows are stored in a column store. It's easiest to store them in the order of insertion, but we can choose to impose an order.

It won't make sense to sort each column individually though because we'll lose track of which columns belong to the same row. Rather, we sort the entire row at a time, even though it is stored by column. We can choose the columns by which the table should be sorted.

Several different sort orders

C-stores provide an extension to sorting on column stores. Different queries benefit from different sort orders, so why not store the same data sorted in several different ways?

Writing to Column-Oriented Storage

Writes are more difficult with column-oriented storage. An update-in-place approach, like B-trees use, is not possible with compressed columns. To insert a row in the middle of a sorted table, you would likely have to rewrite all the column files.

Fortunately, a good approach for writing has been discussed earlier: LSM-trees. All writes go to an in-memory store first, where they are added to a sorted structure and prepared for writing to disk. It doesn't matter whether the in-memory store is row-oriented or column-oriented.

Chapter 2 - Data Models and Query Languages

2019-12-07T16:20:45-00:00

These are my notes from the second chapter of Martin Kleppmann's Designing Data Intensive Applications.

Table of Contents

Relational Model Versus Document Model
- Relational Versus Document Databases Today
Query Languages For Data
- MapReduce Querying
Graph-Like Data Models

This chapter surveys a couple of data models for storing and querying data, as well as different query languages.

Relational Model Versus Document Model

SQL is the best known data model today. Two of the earlier competitors of this model were the network model and the hierarchical model. NoSQL is the latest attempt to overthrow SQL's dominance. Some of the driving forces behind NoSQL's dominance are:

A need for greater scalability than relational databases can easily achieve, including very large datasets or very high write throughput.
Preference for free & open source software over commercial DB products.
Some specialized query operations not well supported by the relational model.
Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model.

NoSQL Databases are in two forms:

Document databases: Targets use cases where data comes in self-contained documents and relationships between one document and another are rare.
Graph databases: These go in the opposite direction, they target use cases where anything is potentially related to everything.

Impedance Mismatch

An awkward translation layer is required between objects stored in relational tables and the application code. The disconnect between the models is called an impedance mismatch. ORMs usually help with this but they don't hide all the differences between both models.

Some developers feel that the JSON model reduces impedance mismatch between the application layer and the database layer.

While relational models refer to a related item by a unique identifier called the foreign key, document models use the document reference.

Relational versus document databases today

Both data models have arguments going for them. For the relational model, it provides better support for joins, and many to one and many to many relationships.

The document data model has the advantages of schema flexibility, better performance due to locality, and for some applications, it's closer to the data structures used by the application.

If data in the application has a document-like structure (i.e. a tree of one-to-many relationships, where typically the entire tree is loaded at once), document model is a good idea. Document databases typically have poor support for many-to-many relationships. In an analytics application, for example, many to many relationships may never be needed.

Basically, for highly interconnected data, the document model is awkward, the relational model is acceptable, and graph models are the most natural.

Schema flexibility in the document model

Document databases have the advantage of having an implicit schema which is not enforced by the database: Also known as schema-on-read. The structure of the data is implicit, and only interpreted when the data is read. This is contrasted with schema-on-write, where the schema is explicit and the database ensures all written data conforms to it.

Data locality for queries

There is a performance advantage to data locality in the document model if you need to access large parts of the document at the same time. Compared with many relational databases where data is usually spread among tables, there is less need for disk seeks and it takes less time.

There are a couple of tools nowadays that offer this locality in a relational model, e.g. Google Spanner, Oracle (multi-table index cluster tables), and in a BigTable model, e.g. Cassandra and HBase (column-family concept).

Convergence of document and relational databases

Relational databases like PostgresSQL, MySQL and IBM DB2 have added support for JSON documents in recent times.

Document databases like RethinkDB also support relational-like joins in its query language.

These models are becoming more similar over time.

Query Languages For Data

There is a distinction here between imperative query languages and declarative.

An imperative language tells the computer to perform certain operations in a certain order. A declarative language specifies the pattern of the data wanted, and how the data should be transformed, but not how to achieve that goal.

An advantage of this declarative approach to query languages is that it hides the implementation details of the database engine, which makes it possible for the database system to introduce performance improvements without requiring any changes to query.

HTML and CSS are also declarative languages.

MapReduce Querying

MapReduce is a paradigm for querying large amounts of data in bulk across many machines. Basically, map jobs are run on all the machines in parallel, and then the results are reduced.

It's neither a declarative query language nor a fully imperative query API, but somewhere in between. The logic of the query is expressed with code but it's repeatedly called by the processing framework.

Both the map and the reduce functions must be pure functions, they only use data passed to them as input, and they cannot perform additional database queries, and they must not have any side effects.

MongoDB has implemented aggregation pipelines, which are similar to MapReduce jobs.

Graph-Like Data Models

If many-to-many relationships are common, the Graph model is probably best suited for it.

Graphs are not limited to homogenous data. Facebook, for example, maintains a single graph with many different types of vertices and edges. Their vertices represent: people, location, events, checkins, comments made by users. Their edges indicate which people are friends with each other, which checkin happened in which location, who commented on what, who attended which event etc.

There are a couple of different but related ways for representing graphs:

Property graph model: Neo4j, Titan and InfiniteGraph
Triple-store model: Datomic, AllegroGraph

Some declarative languages for querying graphs are: Cypher, SPARQL and Datalog.

Chapter 1 - Reliable, Scalable and Maintainable Applications

2019-12-07T16:18:56-00:00

These are my notes from the first chapter of Martin Kleppmann's Designing Data Intensive Applications.

Table of Contents

Reliability
Scalability
- Describing Load
Maintainability

Three important concerns in most software systems are reliability, scalability, and maintainability:

Reliability: The system should work correctly (performing the correct function at the desired level of performance) even in the face of adversity.
Scalability: As the system grows(in data , traffic volume, or complexity), there should be reasonable ways of dealing with that growth.
Maintainability: People should be able to work on the system productively in the future.

Reliability

Typical expectations for software to be termed as reliable are that:

The application performs as expected.
It can tolerate user mistakes or unexpected usage of the product.
Performance is good enough for the required use case, under the expected load and data volume.
The system prevents any unauthorized access and abuse.

Basically, a reliable system is fault-tolerant or resilient. Fault is different from failure. A fault is one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to a user.

It's impossible to reduce the probability of faults to zero, thus, it's useful to design fault-tolerance mechanisms that prevent faults from causing failures.

Hardware Faults

Traditionally, the standard approach for dealing with hardware faults is to add redundancy to the individual hardware components so that if one fails, it can be replaced: RAID configuration.

As data volumes and applications' computing demands have increased, there's been a shift towards using software fault-tolerance techniques in preference or in addition to hardware redundancy. This offers a number of advantages. For example, a single server system will require planned downtime if the machines needs to be rebooted (e.g. to apply operating system security patches). But for a system that can tolerate machine failure, it can be patched one node at a time (without downtime of the entire system - a rolling upgrade).

Software Errors

Software failures are more correlated than hardware failures. Meaning that a fault in one node is more likely to cause many more system failures than uncorrelated hardware failures.

There is no quick solution to the problem of faults in software, but a few things can help:

Thorough testing
Process isolation
Measuring, monitoring, and analyzing system behavior in production.

Human Errors

Humans are known to be unreliable. How do we make systems reliable, in spite of unreliable humans? Through a combination of several approaches:

Designing systems in a way that minimize opportunities for error through well-designed abstractions, APIs, and admin interfaces.
Decoupling the places where people make the most mistakes from the places where they can cause failures. E.g. by providing a fully-featured non-production sandbox environment where people can explore and experiment safely, using real data, without affecting real users.
Testing thoroughly at all levels: from unit tests to integration tests to manual tests to automated tests.
Allow quick and easy recovery from human errors, to minimize the impact of failure. E.g. By making it easy to roll back configuration changes, roll out new code gradually (so bugs do not affect all users).
Set up detailed and clear monitoring, such as performance metrics and error rates.

We have a responsibility to our users, and hence reliability is very important.

Scalability

Scalability describes the ability to cope with increased load. It is meaningless to say "X is scalable" or "Y doesn't scale". Discussing scalability is about answering the question of "If the system grows in a particular way, what are our options for coping with the growth?" and "How can we add computing resources to handle the additional load?"

Describing Load

Load can be described by the load parameters. The choice of parameters depends on the system architecture. It may be:

Requests per second to a web server
Ratio of reads to writes in a database
Number of simultaneously active users in a chat room
Hit rate on a cache.

Twitter Case-study

Twitter has two main operations:

Posting a tweet: 4.6k requests/sec on average, 12k requests/sec at peak.
Home timeline: A user can view tweets posted by the people they follow (300k requests/sec)

(Based on data published in 2012: https://www.infoq.com/presentations/Twitter-Timeline-Scalability/)

Twitter's scaling challenge is primarily due to fan-out. In electronics, fan-out is the number of input gates that are attached to another gate's output. For twitter, each user follows many people, and each user is followed by many people. What is an optimal way of loading all the tweets of people followed by a user? Two operations could happen:

Posting a tweet inserts the new tweet into a global collection of tweets. This collection could be a relational database which could then have billions of rows - not ideal.
Maintain a cache for each user's home timeline - like a mailbox of tweets for each recipient user. When a user posts a tweet, look up all the people who follow the user, and insert the new tweet into each of their home timeline caches. Reading the home timeline is cheap.

Twitter used the first approach initially but they now use approach 2. The downside to approach 2 is that posting a tweet requires a lot of extra work.

For twitter, the distribution of followers per user (may be weighted by how often those users tweet) is a key load parameter for discussing scalability, since it determines the fan-out load.

Note that Twitter is now implementing a hybrid of both approaches. For most users, tweets continue to be fanned out to home timelines at the time when they are posted. However, for a small number of users with millions of followers (celebrities), they are exempted from the fan out.

Describing Performance

Once the load on the system has been described, you can investigate what happens when the load increases in the following two ways:

When you increase a load parameter and keep system resources (CPU, memory, network, bandwidth, etc.) unchanged, how is the performance of the system affected?
When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?

Both questions require numbers.

For Batch processing systems like Hadoop, we care about throughput - the number of records we can process per second, or total time it takes to run a job on a dataset of a certain size. For online systems, we care about response time - the time between a client sending a request and receiving a response.

Aside - Latency vs response time: These two words are often used as synonyms, but they are not the same. Response time is what the client sees: besides the actual time to process a request(service time), it includes network delays and queuing delays. Latency is the duration that a request is waiting to be handled - during which it is latent, awaiting service.

Response time can vary a lot, therefore it's important to think of response time not as a single value, but as a distribution of values. Even in a scenario where you'd think all requests should take the same time, you get variation: random additional latency could be introduced by a context switch on the background process, loss of a network packet and TCP retransmission, a garbage collection pause, etc.

People often report the average response time of a service, but it is not a very good metric if you want to know your "typical" response time, because it doesn't tell you how many users actually experienced that delay. A better approach is to use percentiles.

Percentiles

If you take your list of response times and sort from fastest to slowest, then the median is the halfway point: e.g. if median response time is 200ms, it means half the requests return in less than 200ms and half take longer than that. Thus, median is a good metric if you want to know how long users typically wait. Median is known as the 50th percentile, and sometimes abbreviated as p50.

Occasionally slow requests are known as outliers. In order to figure out how bad they are, you look at higher percentiles: 95th, 99th and 99.9th percentiles are common. These are the response time thresholds at which 95%, 99%, or 99.9% of requests are faster than that particular threshold. For example, if the 95th percentile response time is 1.5 seconds, it means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more.

High percentiles of response times (aka. Tail latencies) are important because they directly impact a user's experience.

Percentiles are often used in SLOs (Service Level Objectives) and SLAs (Service Level Agreements). They set the expectations of a user, and a refund may be demanded if the expectation is not met.

A large part of the response time at high percentiles can be accounted for by queuing delays, which refers to how a number of slow requests on the server-side can hold up the processing of subsequent requests. This effect is called Head-of-Line blocking. Those requests may be fast to process on the server, but the client will see a slow overall response time due to the time waiting for the prior request to complete. This is why it is important to measure response times on the client side. Basically, requests could be fast individually but one slow request could slow down all the other requests.

It takes just one slow call to make the entire end-user request slow, an effect known as tail latency amplification.

If you want to monitor response times for a service on a dashboard, you need to monitor it on an ongoing basis. A good idea is to keep a rolling window of response times of requests in the last 10 minutes. So there could be a graph of the median and various percentiles over that window.

Note: Averaging percentiles is useless, the right way of aggregating response time data is by adding histograms.

Approaches for Coping with Load

An architecture that is appropriate for one level of load is unlikely to cope with 10 times that load. Different terms that come up for dealing with this include vertical scaling, which is moving to a more powerful machine, and scaling out/horizontal scaling, which involves distributing the load across multiple smaller machines. Distributing load across multiple machines is also known as shared-nothing architecture.

There's less of a dichotomy between both approaches, and more of a pragmatic mixture of both approaches.

There's no generic one-size fits all approach for the architecture of large scale data systems, it's usually highly specific to the application.

An architecture that scales well for a particular application is built around assumptions of which operations will be common and which will be rare: the load parameters.

Note that though they are specific to a particular application, scalable architectures are nevertheless built from general-purpose building blocks, arranged in familiar patterns.

Maintainability

Design software in a way that it will minimize pain during maintenance, thereby avoiding the creation of legacy software by ourselves. Three design principles for software systems are:

Operability:

Make it easy for operations teams to keep the system running smoothly.

Simplicity

Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system.

Evolvability

Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change.

Operability: Making Life Easy For Operations

Operations teams are vital to keeping a software system running smoothly. A system is said to have good operability if it makes routine tasks easy so that it allows the operations teams to focus their efforts on high-value activities. Data systems can do the following to make routine tasks easy e.g.

Providing visibility into the runtime behavior and internals of the system, with good monitoring.
Providing good support for automation and integration with standard tools.
Providing good documentation and easy-to-understand operational model ("If I do X, Y will happen").
Self-healing where appropriate, but also giving administrators manual control over the system state when needed.

Simplicity: Managing Complexity

Reducing complexity improves software maintainability, which is why simplicity should be a key goal for the systems we build.

This does not necessarily refer to reducing the functionality of a system, it can also mean reducing accidental complexity. Complexity is accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.

Abstraction is one of the best tools that we have for dealing with accidental complexity. A good abstraction can hide a great deal of implementation detail behind a clean, simple-to-understand façade.

Evolvability: Making Change Easy

System requirements change constantly and we must ensure that we're able to deal with those changes.

Summary

An application has to meet functional and non-functional requirements.

Functional - What it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways.
Nonfunctional - General properties like security, reliability, compliance, scalability, compatibility, and maintainability.

Reliability means making systems work correctly, even when faults occur.

Scalability means having strategies for keeping performance good, even when load increases.

Maintainability is in essence about making life better for the engineering and operations teams who need to work with the system.

Learning Diary: Designing Data Intensive Applications by Martin Kleppmann

2019-12-07T15:44:17-00:00

Background

I tend to read a technical book twice before I can convince myself that I've actually read the book. The first time is typically during my commute to work, and the second time is when I'm home and try to take notes from the book.

This post is to share the notes I've taken while reading Martin Kleppmann's book: Designing Data-Intensive Applications. This was inspired by Jasdev's attempt to learn in public and I hope that you can learn a thing or two from my notes.

A few disclaimers:

This is not meant to be a substitute for the book, and you will rob yourself of a lot of useful knowledge if you use these notes as a replacement for the book.
The first seven chapter notes I'll share were written without a public audience in mind. I had no intention of sharing these notes at the beginning. As such, you might find some mistakes. I've tried to get rid of those, but it's possible I missed some. If you find any that you want to correct, please reach out via email too.
Lastly, these posts will differ from the more expository deep-dive posts like this one that I'll also post on this blog. If you manage to gain from my learning diary, that's great! If you come across a topic that you want to learn more about, you can leverage the numerous resources online about that topic, or reach out to me to write about it.

Part One - Foundations of Data Systems

Part Two - Distributed Data

Data Storage on Your Computer's Disk - Part 1

2019-10-22T21:36:52-00:00

Table of Contents

Background
A bit about data storage on disk
- Records
- Pages
Conclusion/Next Steps
Open Question I Still Have
Further Reading

Background

You may skip this story and go straight to the main post.

I've been reading Martin Kleppmann's great book on "Designing Data-Intensive Applications" for some time now.

In the third chapter of the book which deals with the storage and retrieval of data, Martin explains a lot about what database indexes are, how they work, the different types of indexing structures we have, and so on. I found this interesting to read, but I kept wondering how or if those database indexes differed from the underlying data being stored. He did a good job of explaining this, but I found it difficult to visualize how these indexing structures are actually laid out on disk with data.

This led me on a journey of naive Google searches like: "is the lsm tree an index or a storage engine" and "is the b tree simply an index". In the end, I think I have a decent understanding of how database indexes are related to the data they store, and how those get represented on disk. I aim to detail what I have learned about how data is stored in these series of posts.

This post is an introduction and will focus on how data is represented on your computer's disk without indexes.

Starting off with a few disclaimers:

A lot of the concepts I'll discuss here are based on my knowledge of how SQLServer works. Don't be discouraged if you haven't used SQLServer though, I haven't either! I just found that the resources for SQLServer were more accessible to me than any other set of resources. My guess (and hope) is that though some of the implementation details may differ from other Relational Database Management Systems (think MySQL, PostgreSQL, etc), the underlying concepts are similar.
I'm no expert in any of these topics so please reach out to via the feedback form if you spot any errors in my understanding that you would like to correct.

A bit about data storage on disk

I'll be using some words repeatedly in this section so I thought it'll be a good idea to list them here as a reference. I'll explain how they fit together in more detail later on, so stay with me!

Disk I/O Operations: These are read and write operations that involve accessing your computer's physical disk.
Records: A row in your database table maps to a record on your computer's disk. Records can be of various sizes depending on the data contained in a row. There are different types of records, such as data records - say for storing an individual book in a database of books, index records, and records for other metadata about the database.
Pages: A page consists of many records. Pages have a fixed size which can be 4KB, 8KB, 16KB, etc. Fixed here means that a 4KB page cannot be expanded to fit more data when full. SQLServer pages are typically 8KB in size. Two of the most popular page types are Data pages and Index pages. As you can imagine, data pages store data records and index pages store index records. We'll learn more about these page types in this series.

Records

When you add a new row to a database table using SQL, that row is converted internally to a record. Think of a record as an array of bytes, where each byte or group of bytes stores information about the record. The information stored in a group of bytes could be:

The type of the record: data, index, or other metadata.
Whether the inserted row has columns with null values or columns have fixed or variable length data types.
Where each column's data starts and ends in the array.
The actual data encoded in bytes, divided into separate sections depending on whether the data type used for storing the data has a fixed length or a variable length.

Figure 1 - Inexact representation of a database record.

Notes:

A more precise representation of what a record looks like can be found here. The purpose of the diagram above is to help visualize the structure on a high level.
The "Information about columns..." here typically includes:
1. The number of columns that match the condition i.e. fixed length, null or variable length.
2. The positions of those columns; where those columns start and end. This makes it easier to query the data for a particular column.

Pages

When this record is created, it is added to a page. Databases tend to work with pages. A page is the smallest unit of data that a database will load into a memory cache. Let's back up a little and talk about why pages are important.

A database table could have billions of records or more. Storing these records side by side in a file could be a nightmare to manage. There will be some added complexity around storing and retrieving a record. Records are organized into pages to make them easier to work with.

Imagine that a database needs to retrieve millions of records from disk and load them into memory. It would be difficult to determine how much space to allocate in memory for this operation. With pages, this becomes easier to manage since they are of a fixed length.

Think of pages in this case like pages in a notebook. A notebook page is made up of several lines of text. This organization makes it easier to find a line on a page once you have the page number. It also means that if you tear out a page from a notebook, you can read all the lines on that page without having to refer to the notebook multiple times. Pages in the context of disk storage minimize the number of disk I/O operations that will be needed to retrieve a record or a set of records.

Two key things to note are these:

When you add a new record (row) or update an existing record in your database table, that change is reflected on a data page which exists in an in-memory cache.
When you want to read a record from your database table, the system first checks if the page exists in the cache before hitting the disk otherwise.

The reason for this is simple: it is faster and less resource-intensive for the CPU to access data in the main memory i.e. in-memory cache than it is to go the disk to fetch data. However, this is not without its drawbacks. Unlike data stored in memory, data stored on your hard disk is durable. When you shut down your laptop, you still have the data on your hard disk. That's unfortunately not the case for data in the main memory.

You may then wonder: If a record that I insert or update is stored on a page located in the main memory, and data is lost if the computer is switched off, how are my changes prevented from being lost if something bad happens?

Good question! I'll address this soon, but not before I talk about some principles that guide how pages are loaded from the disk into the in-memory cache. These include:

a) Temporal Locality: Temporal locality means that pages are loaded in-memory based on the likelihood that a recently accessed page will be accessed again soon. It means that if you make a query that happens to fetch some data pages from disk, those data pages will be stored in memory for as long as possible to prevent having to go to the disk to fetch them again.

b) Spatial Locality: This works based on the prediction that if a page is loaded in memory, the pages that are stored physically close to that page on disk will likely be accessed soon. As a result, some pages are 'pre-fetched' ahead of when they are actually used. It means that if you run a query that fetches a page which has records with IDs that range from 1-30, the page which has records from 31-60 will also likely be loaded alongside it to prevent a subsequent trip to the disk.

These principles help to minimize the number of disk I/O operations needed.

Figure 2 - Structure of a Database Page

Notes:

The Page Header contains information about the page like: how many records it contains, how much space it has left, what table a page belongs to etc.
The Record Offset array helps to manage the location of the records on a page. Each 'slot' in the array points to the beginning of a record, and helps to locate where the record is physically stored on disk.

Writeback

Recall that saved records are first added to a page located in an in-memory cache. Writeback is the process by which pages located in-memory are written back to disk. When a page has two copies: one in memory and the other on disk, and the more recent page (the one in-memory) has been modified, the page is referred to as being dirty. The content in-memory differs from what is on disk, and the content on disk needs to be updated since it is the durable data store.

Now, I do not know all the details about when or how this writeback process is triggered. My understanding is that it happens periodically, but I believe there's more to that.

I'm aware that I haven't answered the question I teased earlier about how databases prevent data from being lost, if data is initially saved in a temporary location. This section might also provoke a similar question from you:

What if the computer is switched off before writeback happens?

I've teased you enough, so I'll address that in the next section.

Write-ahead Logs

Write-ahead logs are used in a number of database systems ranging from relational databases like PostgreSQL to in-memory datastores like Redis. A log is an append-only file. It means that when you add a record to a log, it's placed at the bottom of the log i.e. adding records is sequential. This has an advantage of faster writes than say if the records in the file are sorted, as you only need to keep track of the latest record in the log when you want to insert something. There's the drawback that it's less efficient to find a specific record in the log if records are not sorted, as you may end up checking each line in a log of a billion lines! Logs are not typically used this way though, and there's typically an additional structure involved for this purpose (hint: It's called an index).

I'm not giving you the full story about logs and other contexts in which they are used, but for the purpose of this post, the key thing to note is that they are append-only files, and writes are sequential.

Now, recall that I said that database writes (which involve creating, updating, or deleting a record) are first written to a page that exists in memory, making them likely to get lost in the event of a system crash. Well, that's not quite the full story. In many database systems, writes are also logged to a special file known as a write-ahead log. This log describes what changes happened on what pages. The log file is persisted to the disk (aka. permanent storage) when the database writes are completed. This way, if your computer crashes before the in-memory pages have been written back to the disk, the lost changes can be restored from the log file. Any change that has been recorded on the log file but is not reflecting in the database can be restored.

To explain why using a write-ahead log is beneficial over writing the pages directly to the disk, I'll introduce one final concept known as Transactions.

Transactions

Think of a transaction as a group of one or more database operations which act independently of other groups. If at least one operation in a group fails, the whole group is declared as a failure and then the transaction aborts. If all the operations in a group complete successfully, then the transaction commits.

To make this more practical, imagine you have a 'Cars' table in your database, you could be performing the following operations in one transaction:

Inserting 5 new cars into the table
Updating the prices of 3 cars
Deleting cars manufactured before 2010

If at least one of these operations fails, say one of the five insertions violates a uniqueness constraint, the whole transaction fails - including the updates and deletions!

Without this concept of transactions, if there's a failure in the middle of a number of database changes, it's difficult to keep track of which changes have happened, and it could leave you with incorrect data if you make some changes twice.

Going back to why writing to a write-ahead log is beneficial over writing data pages directly to disk, the advantage that a write-ahead log provides is that it is written to disk only once per transaction, which is when the transaction is committed. Contrast this with a situation where data pages are written directly to disk immediately; if a transaction involves changing multiple data pages, each data file will need to be written to disk before a write is successful, which will slow down the performance of the database. A Write-ahead log greatly reduces the number of disk writes needed for database operations.

Conclusion/Next Steps

I've only scratched the surface of this and there are still questions I haven't answered like: What determines what page a record is stored on? What determines how a page is laid out on disk? How does the database know what pages to search for a record? And more...

I'll answer those in the next few posts about database indexing structures: B-Trees and LSM Trees in particular.

Open Question I Still Have

What rules govern when writeback is triggered? Does the application developer just set an interval for that to happen?

Timilearning - A blog by Timi Adeniran

A Library for Incremental Computing

Incremental Computing #

Modelling computation as a graph #

Dirty Marking #

Topological Sorting #

Demand-driven Incremental Computing #

Anchors #

The Anchor class #

The Engine class #

Observing an Anchor #

Setting an Anchor's value #

Reading an Anchor's value #

Closing Thoughts #

Further Reading #

Learning C++ from Java - Pointers and References

Pointers #

References #

Passing arguments to a function #

Pass-by-Value #

Pass-by-Reference #

Pass-by-Address #

An aside #

Pass-by-Reference or Pass-by-Address #

Java is Pass-by-Value #

Further Reading #

Learning C++ from Java - Header files

Forward declarations #

Working with multiple files #

Multiple declarations, single definition #

Header files #

Header guards #

Container #

Spotify #

Tidal #

Header files and linkage #

Further Reading #

Learning C++ from Java - Building, Namespaces, Linkage, and more

Building a C++ Program #

Preprocessing #

Compilation #

Linking #

Initializing Variables #

Copy Initialization #

Direct Initialization #

List Initialization #

Value Initialization and Zero Initialization #

Default Initialization #

Namespaces #

Namespace Aliases #

Linkage #

Linkage is not scope #

Storage Duration #

The 'static' keyword #

Conclusion #

Further Reading #

MIT 6.824: Lecture 20 - Blockstack

Decentralization #

A decentralized architecture #

A decentralized application #

Decentralization can be painful #

Blockstack #

Naming #

The Blockchain #

The Peer Network #

Storage #

Putting them all together #

Conclusion #

Further Reading #

MIT 6.824: Lecture 19 - Bitcoin

Digital Currencies #

Bitcoin #

Limitations of the model so far #

An attacker can steal a private key #

The current owner can double-spend #

Addressing double-spend #

Publishing to a log #

The Bitcoin blockchain #

Adding a new block #

Validation Checks #

Incremental Computing

Modelling computation as a graph

Dirty Marking

Topological Sorting

Demand-driven Incremental Computing

Anchors

The Anchor class

The Engine class

Observing an Anchor

Setting an Anchor's value

Reading an Anchor's value

Closing Thoughts

Further Reading

Pointers

References

Passing arguments to a function

Pass-by-Value

Pass-by-Reference

Pass-by-Address

An aside

Pass-by-Reference or Pass-by-Address

Java is Pass-by-Value

Further Reading

Forward declarations

Working with multiple files

Multiple declarations, single definition

Header files

Header guards

Container

Spotify

Tidal

Header files and linkage

Further Reading

Building a C++ Program

Preprocessing

Compilation

Linking

Initializing Variables

Copy Initialization

Direct Initialization

List Initialization

Value Initialization and Zero Initialization

Default Initialization

Namespaces

Namespace Aliases

Linkage

Linkage is not scope

Storage Duration

The 'static' keyword

Conclusion

Further Reading

Decentralization

A decentralized architecture

A decentralized application

Decentralization can be painful

Blockstack

Naming

The Blockchain

The Peer Network

Storage

Putting them all together

Conclusion

Further Reading

Digital Currencies

Bitcoin

Limitations of the model so far

An attacker can steal a private key

The current owner can double-spend

Addressing double-spend

Publishing to a log

The Bitcoin blockchain

Adding a new block

Validation Checks

Temporary double-spending is possible

Scenario #1

Scenario #2

Peers will abandon the shorter branch

FAQ

Where are new coins from?

Can an attacker change an existing block in the middle of the blockchain?