<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom">
 
 <title>Aaron Morton</title>
 
 <link href="http://thelastpickle.com/" />
 <updated>2013-05-01T11:26:41-07:00</updated>
 <id>http://thelastpickle.com/</id>
 <author>
   <name>Aaron Morton</name>
   <email>aaron@thelastpickle.com</email>
 </author>

 
 <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/AaronMorton" /><feedburner:info xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" uri="aaronmorton" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><entry>
   <title>PRIMARY KEY's in CQL</title>
   <link href="http://thelastpickle.com/2013/01/11/primary-keys-in-cql" />
   <updated>2013-01-11T00:00:00-08:00</updated>
   <id>http://thelastpickle.com/2013/01/11/primary-keys-in-cql</id>
   <content type="html">&lt;p&gt;The final version of &lt;a href='https://github.com/apache/cassandra/blob/cassandra-1.2/doc/cql3/CQL.textile' title='Cassandra Query Language (CQL) v3.0.0'&gt;CQL 3&lt;/a&gt; that ships with Cassandra v1.2 adds some new features to the &lt;code&gt;PRIMARY KEY&lt;/code&gt; clause. It overloads the concept in ways that differ from the standard &lt;a href='http://en.wikipedia.org/wiki/Primary_key'&gt;SQL&lt;/a&gt; definition, and in some places shares ideas with &lt;a href='http://hive.apache.org/docs/r0.9.0/language_manual/data-manipulation-statements.html'&gt;Hive&lt;/a&gt;. But from a Cassandra point of view it allows for the same flexibility as the Thrift API.&lt;/p&gt;

&lt;h2 id='schema_schema_every_where'&gt;Schema, Schema, Every Where&lt;/h2&gt;

&lt;p&gt;There are two ways to specify the primary key in the &lt;a href='http://www.datastax.com/docs/1.2/cql_cli/cql/CREATE_TABLE'&gt;CREATE TABLE&lt;/a&gt; statement. It can be specified in line.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;create table foo (
    bar     int PRIMARY KEY, 
    baz     int
);&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Or it can be specified as a separate clause, which is the method we will be using.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;create table foo (
    bar     int, 
    baz     int,
    PRIMARY KEY (bar)
);&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The definition of the &lt;code&gt;PRIMARY KEY&lt;/code&gt; clause in the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-1.2/doc/cql3/CQL.textile'&gt;spec&lt;/a&gt; can appear confusing at first.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;PRIMARY KEY &amp;#39;(&amp;#39; &amp;lt;partition-key&amp;gt; ( &amp;#39;,&amp;#39; &amp;lt;identifier&amp;gt; )* &amp;#39;)&amp;#39;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;However the comments further down the tell us all we need to know.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In CQL, the order in which columns are defined for the PRIMARY KEY matters. The first column of the key is called the partition key. It has the property that all the rows sharing the same partition key (even across table in fact) are stored on the same physical node. Also, insertion/update/deletion on rows sharing the same partition key for a given table are performed atomically and in isolation. Note that it is possible to have a composite partition key, i.e. a partition key formed of multiple columns, using an extra set of parentheses to define which columns forms the partition key.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So lets get started.&lt;/p&gt;

&lt;h2 id='the_setup'&gt;The Setup&lt;/h2&gt;

&lt;p&gt;In the examples below I used a 3 node local cluster created with the &lt;a href='https://github.com/pcmanus/ccm'&gt;ccm&lt;/a&gt; tool from Sylvain Lebresne. Before starting the cluster I brought up 2 additional network interfaces for the nodes to bind to.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ sudo ifconfig lo0 alias 127.0.0.2 up
$ sudo ifconfig lo0 alias 127.0.0.3 up&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I could then start a 3 node cluster using version 1.2.0 from the ccm directory.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ccm create dev -v 1.2.0
Current cluster is now: dev
$ ccm populate -n 3
$ ccm start&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Checked everything was as expected.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ccm node1 ring
Note: Ownership information does not include topology; for complete information, specify a keyspace

Datacenter: datacenter1
==========
Address         Rack        Status State   Load            Owns                Token                                       
                                                                               3074457345618258602                         
127.0.0.1       rack1       Up     Normal  24.88 KB        33.33%              -9223372036854775808                        
127.0.0.2       rack1       Up     Normal  24.89 KB        33.33%              -3074457345618258603                        
127.0.0.3       rack1       Up     Normal  15.54 KB        33.33%              3074457345618258602  &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then I started a &lt;code&gt;cqlsh&lt;/code&gt; (in &lt;code&gt;bin/&lt;/code&gt; of the standard distribution) session against node 1.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/cqlsh 127.0.0.1
Connected to dev at 127.0.0.1:9160.
[cqlsh 2.3.0 | Cassandra 1.2.0-SNAPSHOT | CQL spec 3.0.0 | Thrift protocol 19.35.0]
Use HELP for help.
cqlsh&amp;gt; &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And created a Keyspace with RF 1.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cqlsh&amp;gt; create keyspace dev 
   ... WITH replication = {&amp;#39;class&amp;#39;:&amp;#39;SimpleStrategy&amp;#39;, &amp;#39;replication_factor&amp;#39;:1};
cqlsh&amp;gt; use dev;
cqlsh:dev&amp;gt; &lt;/code&gt;&lt;/pre&gt;

&lt;h2 id='the_sound_of_one_column_indexing'&gt;The Sound of One Column Indexing&lt;/h2&gt;

&lt;p&gt;Cassandra 1.2 &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-4361' title='[CASSANDRA-4361] CQL3: allow definition with only a PK'&gt;allows&lt;/a&gt; tables to be defined with one column that is also the &lt;code&gt;PRIMARY KEY&lt;/code&gt;. If you&amp;#8217;ve used Cassandra before this may sound muy loco as internally a row without columns is purged during compaction. This allows rows that only contain &lt;a href='https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/db/ExpiringColumn.java'&gt;ExpiringColumns&lt;/a&gt; to be automatically removed. If you wanted a row without any columns you would need a place holder column, and this pretty much what CQL 3 does.&lt;/p&gt;

&lt;p&gt;My one column table looked like this.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;CREATE TABLE device (
  device_id int,
  PRIMARY KEY (device_id)
);&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I then put three rows in it.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;INSERT INTO device 
(device_id)
values
(1);
INSERT INTO device 
(device_id)
values
(2);
INSERT INTO device 
(device_id)
values
(3);&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In &lt;code&gt;cqlsh&lt;/code&gt; this all looked sane.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cqlsh:dev&amp;gt; select * from device;

 device_id
-----------
         1
         2
         3 &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Jumping over to the old fashioned &lt;code&gt;cassandra-cli&lt;/code&gt; we can see the place holder columns.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/cassandra-cli -h 127.0.0.1
Connected to: &amp;quot;dev&amp;quot; on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.2.0

Type &amp;#39;help;&amp;#39; or &amp;#39;?&amp;#39; for help.
Type &amp;#39;quit;&amp;#39; or &amp;#39;exit;&amp;#39; to quit.

[default@unknown] use dev;
Authenticated to keyspace: dev
[default@dev] list device;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 1
=&amp;gt; (column=, value=, timestamp=1357864824406000)
-------------------
RowKey: 2
=&amp;gt; (column=, value=, timestamp=1357864824413000)
-------------------
RowKey: 3
=&amp;gt; (column=, value=, timestamp=1357864825075000)

3 Rows Returned.
Elapsed time: 49 msec(s).&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id='partitioning_and_clustering'&gt;Partitioning and Clustering&lt;/h2&gt;

&lt;p&gt;The PRIMARY KEY definition is made up of two parts: the &lt;em&gt;Partition Key&lt;/em&gt; and the &lt;em&gt;Clustering Columns&lt;/em&gt;. The first part maps to the storage engine row key, while the second is used to group columns in a row. In the storage engine the columns are grouped by prefixing their name with the value of the clustering columns. This is a standard design pattern when using the Thrift API. But now CQL takes care of transposing the clustering column values to and from the non key fields in the table.&lt;/p&gt;

&lt;p&gt;My table with both a partitioning key and clustering columns looked like this.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;CREATE TABLE device_check (
  device_id   int,
  checked_at  timestamp, 
  is_power    boolean, 
  is_locked   boolean,
  PRIMARY KEY (device_id, checked_at)
);&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The partitioning key is the &lt;code&gt;device_id&lt;/code&gt; and the clustering column is &lt;code&gt;checked_at&lt;/code&gt;. The specification allows for more than one clustering column, I just chose one here. It also allows for multiple partitioning key columns as we will see later.&lt;/p&gt;

&lt;p&gt;To see what these keys do I inserted some data. As before there are three devices in the example and each one is checked once a month to see if it is locked and powered.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;INSERT INTO device_check
  (device_id, checked_at, is_power, is_locked)
values
  (1, &amp;#39;2013-01-01T09:00+1300&amp;#39;, true, true)
;
INSERT INTO device_check
  (device_id, checked_at, is_power, is_locked)
values
  (2, &amp;#39;2013-01-01T09:10+1300&amp;#39;, true, true)
;
INSERT INTO device_check
  (device_id, checked_at, is_power, is_locked)
values
  (3, &amp;#39;2013-01-01T09:10+1300&amp;#39;, true, false)
;
INSERT INTO device_check
  (device_id, checked_at, is_power, is_locked)
values
  (1, &amp;#39;2013-02-01T09:00+1300&amp;#39;, true, false)
;
INSERT INTO device_check
  (device_id, checked_at, is_power, is_locked)
values
  (2, &amp;#39;2013-02-01T09:10+1300&amp;#39;, true, false)
;
INSERT INTO device_check
  (device_id, checked_at, is_power, is_locked)
values
  (3, &amp;#39;2013-02-01T09:10+1300&amp;#39;, true, true)
;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Poking around in &lt;code&gt;cqlsh&lt;/code&gt; everything looks as expected.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cqlsh:dev&amp;gt; select * from device_check;

 device_id | checked_at               | is_locked | is_power
-----------+--------------------------+-----------+----------
         1 | 2013-01-01 09:00:00+1300 |      True |     True
         1 | 2013-02-01 09:00:00+1300 |     False |     True
         2 | 2013-01-01 09:10:00+1300 |      True |     True
         2 | 2013-02-01 09:10:00+1300 |     False |     True
         3 | 2013-01-01 09:10:00+1300 |     False |     True
         3 | 2013-02-01 09:10:00+1300 |      True |     True
cqlsh:dev&amp;gt; select * from device_check where device_id = 1;

 device_id | checked_at               | is_locked | is_power
-----------+--------------------------+-----------+----------
         1 | 2013-01-01 09:00:00+1300 |      True |     True
         1 | 2013-02-01 09:00:00+1300 |     False |     True             &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So back to the &lt;code&gt;cassandra-cli&lt;/code&gt; we go to see what&amp;#8217;s happening with the clustering columns for device 1.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@dev] get device_check[1];
=&amp;gt; (column=2013-01-01 09\:00\:00+1300:, value=, timestamp=1357866010549000)
=&amp;gt; (column=2013-01-01 09\:00\:00+1300:is_locked, value=01, timestamp=1357866010549000)
=&amp;gt; (column=2013-01-01 09\:00\:00+1300:is_power, value=01, timestamp=1357866010549000)
=&amp;gt; (column=2013-02-01 09\:00\:00+1300:, value=, timestamp=1357866056217000)
=&amp;gt; (column=2013-02-01 09\:00\:00+1300:is_locked, value=00, timestamp=1357866056217000)
=&amp;gt; (column=2013-02-01 09\:00\:00+1300:is_power, value=01, timestamp=1357866056217000)
Returned 6 results.
Elapsed time: 38 msec(s).&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Where we had two rows in &lt;code&gt;cqlsh&lt;/code&gt; we now have one row with six columns in &lt;code&gt;cassandra-cli&lt;/code&gt; which uses the Thrift API. CQL is mapping multiple instances of our entity (the &lt;code&gt;device_check&lt;/code&gt;) to the same partition, and the partition is identified by the value of &lt;code&gt;device_id&lt;/code&gt;. And you can probably guess that the partition is implemented as a row in the storage engine. The column names look a little strange, and technically I should not call them columns. In the current Cassandra lexicon the internal storage engine columns are called &lt;em&gt;Cells&lt;/em&gt;, Columns are used for CQL. The first cell / storage column has the value of the first &lt;code&gt;checked_at&lt;/code&gt; CQL column as it&amp;#8217;s name &lt;em&gt;2013-01-01 09:00:00+1300:&lt;/em&gt;. The &amp;#8217;:&amp;#8217; indicates there are multiple components to this cell name, however this cell does not supply values for all parts. The cell does not have a value as the CQL column value is stored in the cell name. The second cell has a value for &lt;code&gt;checked_at&lt;/code&gt; and the name of the first none primary key column &lt;code&gt;is_locked&lt;/code&gt; &lt;em&gt;2013-01-01 09:00:00+1300:is&lt;/em&gt;locked_. In this case the cell value is the value for the &lt;code&gt;is_locked&lt;/code&gt; CQL column. This pattern continues for the second CQL row, with &lt;code&gt;checked_at&lt;/code&gt; equal to &lt;code&gt;2013-02-01 09:00:00+1300&lt;/code&gt;, and it will continue for all entities in this partition.&lt;/p&gt;

&lt;p&gt;Now to take a look at the effect of the partition key on data placement. My expectation is that each unique &lt;code&gt;device_id&lt;/code&gt; value, and so each partition, will be stored on a different node. In reality the storage engine rows are randomly distributed between nodes so I&amp;#8217;ve adjusted my expectations appropriately. I also expect that each row will be replicated once, as I set the &lt;code&gt;replication_factor&lt;/code&gt; to one.&lt;/p&gt;

&lt;p&gt;To see the node a row is stored on we use the &lt;code&gt;nodetool getendpoints&lt;/code&gt; command. It returns the replicas for a (storage engine) row key in a given Keyspace and Column Family.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h 127.0.0.1 -p 7100 getendpoints dev device_check 1
127.0.0.2
$ bin/nodetool -h 127.0.0.1 -p 7100 getendpoints dev device_check 2
127.0.0.2
$ bin/nodetool -h 127.0.0.1 -p 7100 getendpoints dev device_check 3
127.0.0.1&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That&amp;#8217;s close enough for me. Different partitions, identified by different &lt;code&gt;device_id&lt;/code&gt; values, are stored on different nodes.&lt;/p&gt;

&lt;h2 id='the_composite_enchilada'&gt;The Composite Enchilada&lt;/h2&gt;

&lt;p&gt;Now lets expand the partition key to use a &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-4179' title='[CASSANDRA-3761] Add more general support for composites (to row key, column value)'&gt;composite type&lt;/a&gt;. This is useful when you have a time series and you need to partition the events to avoid huge rows. Rather than have one partition based on, in my example, the &lt;code&gt;device_id&lt;/code&gt; I can have several where &lt;code&gt;device_id&lt;/code&gt; is a part of the partition selection. Meaning CQL rows in this table with the same &lt;code&gt;device_id&lt;/code&gt; may be located on different nodes. But all rows with the same values for the partitioning keys will be located on the same nodes.&lt;/p&gt;

&lt;p&gt;For my example I used the devices I was tracking to check for &lt;a href='http://www.youtube.com/watch?feature=player_detailpage&amp;amp;v=ZJT2vJMsYc4#t=67s'&gt;dam dirty apes&lt;/a&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;CREATE TABLE events (
  device_id   int,
  year_month  int,
  sequence    timestamp,
  pressure    int,
  temperature int,
  is_dam_dirty_apes  boolean,
  PRIMARY KEY ((device_id, year_month), sequence)
);&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The partition key&lt;em&gt;s&lt;/em&gt; are &lt;code&gt;device_id&lt;/code&gt; and &lt;code&gt;year_month&lt;/code&gt;, every event from the same device in the same month will be placed in the same partition. The grouping column is the time of the event which I called &lt;code&gt;sequence&lt;/code&gt; (to avoid confusion with Cassandra timestamps).&lt;/p&gt;

&lt;p&gt;So we turn on the network and start checking for apes.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;insert into events 
(device_id, year_month, sequence, pressure, temperature, is_dam_dirty_apes)
values
(1, 201301, &amp;#39;2013-01-20T10:58:35+1300&amp;#39;, 123, 10, false);
insert into events 
(device_id, year_month, sequence, pressure, temperature, is_dam_dirty_apes)
values
(2, 201301, &amp;#39;2013-01-20T10:58:40+1300&amp;#39;, 456, 20, false);
insert into events 
(device_id, year_month, sequence, pressure, temperature, is_dam_dirty_apes)
values
(3, 201301, &amp;#39;2013-01-20T10:58:45+1300&amp;#39;, 789, 30, true);
insert into events 
(device_id, year_month, sequence, pressure, temperature, is_dam_dirty_apes)
values
(1, 201302, &amp;#39;2013-02-20T10:58:35+1300&amp;#39;, 1230, 11, true);
insert into events 
(device_id, year_month, sequence, pressure, temperature, is_dam_dirty_apes)
values
(2, 201302, &amp;#39;2013-02-20T10:58:40+1300&amp;#39;, 4560, 21, true);
insert into events 
(device_id, year_month, sequence, pressure, temperature, is_dam_dirty_apes)
values
(3, 201302, &amp;#39;2013-02-20T10:58:45+1300&amp;#39;, 7890, 31, true);&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In &lt;code&gt;cqlsh&lt;/code&gt; we now have 6 rows.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cqlsh:dev&amp;gt; select * from events;

 device_id | year_month | sequence                 | is_dam_dirty_apes | pressure | temperature
-----------+------------+--------------------------+-------------------+----------+-------------
         2 |     201302 | 2013-02-20 10:58:40+1300 |              True |     4560 |          21
         3 |     201302 | 2013-02-20 10:58:45+1300 |              True |     7890 |          31
         1 |     201302 | 2013-02-20 10:58:35+1300 |              True |     1230 |          11
         1 |     201301 | 2013-01-20 10:58:35+1300 |             False |      123 |          10
         3 |     201301 | 2013-01-20 10:58:45+1300 |              True |      789 |          30
         2 |     201301 | 2013-01-20 10:58:40+1300 |             False |      456 |          20&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Checking in the &lt;code&gt;cassandra-cli&lt;/code&gt; we see a similar layout to before, with the addition of a composite value used for the row key. Each CQL row has been transposed to four columns in a storage engine row.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@dev] list events;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 2:201302
=&amp;gt; (column=2013-02-20 10\:58\:40+1300:, value=, timestamp=1357869160739000)
=&amp;gt; (column=2013-02-20 10\:58\:40+1300:is_dam_dirty_apes, value=01, timestamp=1357869160739000)
=&amp;gt; (column=2013-02-20 10\:58\:40+1300:pressure, value=000011d0, timestamp=1357869160739000)
=&amp;gt; (column=2013-02-20 10\:58\:40+1300:temperature, value=00000015, timestamp=1357869160739000)
-------------------
RowKey: 3:201302
=&amp;gt; (column=2013-02-20 10\:58\:45+1300:, value=, timestamp=1357869161380000)
=&amp;gt; (column=2013-02-20 10\:58\:45+1300:is_dam_dirty_apes, value=01, timestamp=1357869161380000)
=&amp;gt; (column=2013-02-20 10\:58\:45+1300:pressure, value=00001ed2, timestamp=1357869161380000)
=&amp;gt; (column=2013-02-20 10\:58\:45+1300:temperature, value=0000001f, timestamp=1357869161380000)
...&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Nothing to scary there. Now over to &lt;code&gt;nodetool&lt;/code&gt; to see where the rows are placed. We need to specify the value of &lt;code&gt;device_id&lt;/code&gt; and &lt;code&gt;year_month&lt;/code&gt; as these are used in the storage engine row key.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h 127.0.0.1 -p 7100 getendpoints dev events 1:201301
127.0.0.3
$ bin/nodetool -h 127.0.0.1 -p 7100 getendpoints dev events 2:201301
127.0.0.1
$ bin/nodetool -h 127.0.0.1 -p 7100 getendpoints dev events 3:201301
127.0.0.1
$ bin/nodetool -h 127.0.0.1 -p 7100 getendpoints dev events 1:201302
127.0.0.3
$ bin/nodetool -h 127.0.0.1 -p 7100 getendpoints dev events 2:201302
127.0.0.2
$ bin/nodetool -h 127.0.0.1 -p 7100 getendpoints dev events 3:201302
127.0.0.2&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The two partitions each for devices 2 and 3 have been placed on different nodes. The partitions for device 1 are on the same node, but with enough nodes they would probably be on different ones.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Sorting Lists For Humans</title>
   <link href="http://thelastpickle.com/2012/08/18/Sorting-Lists-For-Humans" />
   <updated>2012-08-18T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2012/08/18/Sorting-Lists-For-Humans</id>
   <content type="html">&lt;p&gt;I like humans. Many of my friends a humans; my wife is a human. And I admire their ability to arbitrarily order items in a list. Applying a manual order to a list of items has been discussed a few times on the Cassandra user list. And I&amp;#8217;ve been thinking about it recently.&lt;/p&gt;

&lt;p&gt;To keep things simple let&amp;#8217;s start with an artificially constrained problem. Our list will only contain 50 &amp;#8220;things&amp;#8221; that must be manually sorted. The things consist of an Integer &lt;code&gt;id&lt;/code&gt; and an ASCII &lt;code&gt;label&lt;/code&gt;. We can also say that the rate of changes is not excessive. In this scenario there are two issues we want to look at:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;em&gt;How do I move an item without re-ordering everything?&lt;/em&gt;&lt;/li&gt;

&lt;li&gt;&lt;em&gt;How do I handle concurrent changes to the list?&lt;/em&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id='the_natural_order'&gt;The Natural Order&lt;/h2&gt;

&lt;p&gt;My first instinct is to duplicate the list order in Cassandra Columns. For example we could use a &lt;a href='http://www.datastax.com/dev/blog/introduction-to-composite-columns-part-1'&gt;Composite Column&lt;/a&gt; name such as &lt;code&gt;(weight, id)&lt;/code&gt;. Cassandra would then sort the Columns by &lt;code&gt;weight&lt;/code&gt;, using &lt;code&gt;id&lt;/code&gt; as a tie breaker for duplicate &lt;code&gt;weight&lt;/code&gt;&amp;#8217;s. To get the list in order we simply select the Columns from the row.&lt;/p&gt;

&lt;p&gt;Moving items using this data model can be painful. Consider the &lt;code&gt;(weight, id&lt;/code&gt;) list below:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(1, 101)
(2, 102)
(3, 103)
(4, 104)
(5, 105)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If we want to move item &lt;code&gt;105&lt;/code&gt; to be before item &lt;code&gt;102&lt;/code&gt; we would change it&amp;#8217;s weight to &lt;code&gt;2&lt;/code&gt;. To avoid duplicate weights we would then change the &lt;code&gt;weight&lt;/code&gt; of 3 and the items that follow it. The reason to avoid duplicate weight&amp;#8217;s is that the sort order for duplicates would be based on the &lt;code&gt;id&lt;/code&gt;. If we had a better tie breaker duplicate weights would not be a problem.&lt;/p&gt;

&lt;p&gt;The unceasing March Of Time is a handy tie breaker. With it we can say that if two items have the same &lt;code&gt;weight&lt;/code&gt; the most recently updated one should be ordered last. If there are items with duplicate &lt;code&gt;(weight, timestamp)&lt;/code&gt; values we can then use the &lt;code&gt;id&lt;/code&gt; to break the tie. To avoid confusion with the Timestamp Cassandra stores for Columns I&amp;#8217;ll refer to our timestamp as the &amp;#8216;sequence&amp;#8217; or &amp;#8216;seq&amp;#8217;.&lt;/p&gt;

&lt;p&gt;Using the weights and sequence model the Column names wil have the form &lt;code&gt;(weight, seq, id)&lt;/code&gt;. If we initialise the &lt;code&gt;weight&lt;/code&gt; and &lt;code&gt;seq&lt;/code&gt; to the current time stamp our example from above may now look like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(1, 1, 101)
(2, 2, 102)
(3, 3, 103)
(4, 4, 104)
(5, 5, 105)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To move an item so that it preceeds another we:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set the &lt;code&gt;weight&lt;/code&gt; to the moving item of the weight of the new following sibling -1.&lt;/li&gt;

&lt;li&gt;Set the &lt;code&gt;seq&lt;/code&gt; of the moving item to the current timestamp.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After moving item 105 to preceed item 102 the list would look like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(1, 1, 101)
(1, 6, 105) # items with duplicate weight sorted on seq
(2, 2, 102)
(3, 3, 103)
(4, 4, 104)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We were able to move item 105 without having to change any other items in the list. And it almost looks like we did it just by chaning one value. However we cannot change column names in Cassandra, we can only create and delete them. This &lt;em&gt;read-modify-write&lt;/em&gt; pattern has serious implications when it comes to concurrent changes as we&amp;#8217;ll see later. Until then let&amp;#8217;s implement the current design.&lt;/p&gt;

&lt;p&gt;To store natually sorted lists in Cassandra start with a &lt;a href='http://www.datastax.com/docs/1.1/dml/using_cli'&gt;cassandra-cli&lt;/a&gt; schema:&lt;/p&gt;
&lt;script src='https://gist.github.com/3394160.js?file=natural_list-schema.txt'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;To load and modify the list using Python and &lt;a href='http://pycassa.github.com/pycassa/'&gt;pycassa&lt;/a&gt;:&lt;/p&gt;
&lt;script src='https://gist.github.com/3394160.js?file=natural_list.py'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;Note that the &lt;code&gt;move()&lt;/code&gt; function deletes the old column before inserting the new one.&lt;/p&gt;

&lt;p&gt;To exercise the code I first called &lt;code&gt;initialise()&lt;/code&gt; to fill the list and used &lt;code&gt;get()&lt;/code&gt; to view it. I then called &lt;code&gt;move()&lt;/code&gt; to place item &lt;code&gt;5&lt;/code&gt; before &lt;code&gt;2&lt;/code&gt;. When item &lt;code&gt;4&lt;/code&gt; is moved to be before &lt;code&gt;2&lt;/code&gt; it is given then same weight as &lt;code&gt;5&lt;/code&gt; but a higher &lt;code&gt;seq&lt;/code&gt;. This places &lt;code&gt;4&lt;/code&gt; &lt;em&gt;after&lt;/em&gt; &lt;code&gt;5&lt;/code&gt; and both of them before &lt;code&gt;2&lt;/code&gt;:&lt;/p&gt;
&lt;script src='https://gist.github.com/3394160.js?file=natural_list-example.txt'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;We moved the item but it was not pretty; we moved it by deleting and then re-inserting. If our &lt;em&gt;read-modify-write&lt;/em&gt; operation for item &lt;code&gt;5&lt;/code&gt; &lt;em&gt;overlaped&lt;/em&gt; with another client moving &lt;code&gt;5&lt;/code&gt; it would result in two entries for &lt;code&gt;5&lt;/code&gt;. In a tradtional RDBMS we could prevent this by running an &lt;a href='http://en.wikipedia.org/wiki/ACID'&gt;ACID Transaction&lt;/a&gt; working at &lt;a href='http://en.wikipedia.org/wiki/Isolation_%28database_systems%29#Repeatable_reads'&gt;Repeatable Read&lt;/a&gt; or better. But this is Sparta / Cassandra (cross out as applicable) and we don&amp;#8217;t have locking transactions.&lt;/p&gt;

&lt;h2 id='the_timeless_wonder_of_the_whole_thing'&gt;&lt;a href='http://www.youtube.com/watch?v=0X2GD5C_wHY&amp;amp;feature=player_detailpage#t=45s'&gt;The timeless wonder of the whole thing&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;The overlapping operations we are trying to avoid look like this:&lt;/p&gt;

&lt;p&gt;&lt;img src='/files/2012-08-18-Sorting-Lists-For-Humans/overlapped-operations.png' alt='Overlapping Operations' /&gt;&lt;/p&gt;

&lt;p&gt;If the Clients used a &lt;code&gt;REPEATABLE READ&lt;/code&gt; Transaction around all their work they would block each other. The &lt;code&gt;Shared&lt;/code&gt; locks taken when reading would prevent the other Client from obtaining the &lt;code&gt;Exclusive&lt;/code&gt; lock needed to delete. The DB Engine would pick an victim and &lt;code&gt;ABORT&lt;/code&gt; it&amp;#8217;s Transaction.&lt;/p&gt;

&lt;p&gt;A Transaction around the Delete and Write calls would work. The Delete from Client 1 would be blocked until Client 2 completed. Client 1 could then notice it&amp;#8217;s Delete updated 0 rows and &lt;code&gt;ABORT&lt;/code&gt; it&amp;#8217;s own Transaction.&lt;/p&gt;

&lt;p&gt;The &lt;a href='http://en.wikipedia.org/wiki/Multiple_granularity_locking'&gt;locks&lt;/a&gt; taken in the Transaction would prevent the Clients from modifying the same data. This is known as &lt;a href='http://en.wikipedia.org/wiki/Serializability'&gt;serializability&lt;/a&gt; and is the purpose of Transactions:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&amp;#8230;a transaction schedule is serializable if its outcome (e.g., the resulting database state) is equal to the outcome of its transactions executed serially, i.e., sequentially without overlapping in time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I can think of two approaches to making our &lt;code&gt;move()&lt;/code&gt; operations &lt;a href='http://www.youtube.com/watch?v=h05YfP_8UsU'&gt;super cereal&lt;/a&gt;. The first is to make the Cassandra requests inherently serialisable. The second is to record the &lt;code&gt;move()&lt;/code&gt; transformations and have the application apply them in a serial fashion.&lt;/p&gt;

&lt;h2 id='take_it_and_turn_it'&gt;Take It and Turn It&lt;/h2&gt;

&lt;p&gt;At the start of the post I restricted the problem to lists of 50 &amp;#8220;things&amp;#8221;. I chose 50 for no particular reason. So instead of 50 let&amp;#8217;s say lists where it&amp;#8217;s &lt;em&gt;reasonable&lt;/em&gt; to read &lt;em&gt;all&lt;/em&gt; of the items at once. Even if you don&amp;#8217;t want to display all of the items. To get idea of what is reasonable, I can read 50 Columns that are 50 bytes long from a hot row cache in about &lt;a href='http://www.slideshare.net/aaronmorton/cassandra-sf-2012-technical-deep-dive-query-performance/56'&gt;260 Microseconds&lt;/a&gt; (excluding network IO).&lt;/p&gt;

&lt;p&gt;While we are at it lets say it&amp;#8217;s &lt;em&gt;unnecessary&lt;/em&gt; to have Cassandra maintain the order of the items in the list. It won&amp;#8217;t take much effort for the client to sort them. So lets see what happens when we give up the ability to ask the database for exactly the right data in the correct order.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;weight&lt;/code&gt;, &lt;code&gt;seq&lt;/code&gt; and &lt;code&gt;label&lt;/code&gt; for items in the Read Sorted list will be stored in Column values. While the Column names will contain the &lt;code&gt;item_id&lt;/code&gt; and a property name:&lt;/p&gt;
&lt;script src='https://gist.github.com/3394160.js?file=read_sorted-schema.txt'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;To move an item we update it&amp;#8217;s &lt;code&gt;weight&lt;/code&gt; and &lt;code&gt;seq&lt;/code&gt; Columns, and because they are column values we do not need to delete first. &lt;a href='http://thelastpickle.com/2012/08/16/Row-Isolation-and-Consensus/'&gt;Row Level Isolation&lt;/a&gt; ensures that reading clients see either all updates or none. Overlapping requests from clients that update the same row will be serialised by the server. And the Column Timestamp included in the request means that the order they are applied in does not matter. Moving an item is now an atomic and serialisable operation.&lt;/p&gt;

&lt;p&gt;Read Sorted lists are implemented by the Python code:&lt;/p&gt;
&lt;script src='https://gist.github.com/3394160.js?file=read_sorted.py'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;Note that &lt;code&gt;move()&lt;/code&gt; now makes a single insert call.&lt;/p&gt;

&lt;p&gt;I exercised the code using the same steps as above:&lt;/p&gt;
&lt;script src='https://gist.github.com/3394160.js?file=read_sorted-example.txt'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;The output from the &lt;code&gt;cassandra-cli&lt;/code&gt; shows that the order of the Columns does not change when items are moved.&lt;/p&gt;

&lt;h2 id='ledger_list'&gt;Ledger List&lt;/h2&gt;

&lt;p&gt;It&amp;#8217;s time to start adding functionality, and complexity. The Ledger List maintains the order of items in Cassandra Columns, so uses a delete and insert for moving. The move operations are written to an application Transaction Log rather then applied directly to the list. When a client reads the list it applies the current transactions to it&amp;#8217;s client local copy; a process that is inherently serializable. Updates to the list stored in Cassandra are done through background worker processes which can be syncronised at the application level. The Data Model is roughly the same as Natural Lists so lets start there.&lt;/p&gt;

&lt;p&gt;Each item will usually be represented by one column in a row in the &lt;code&gt;LedgerList&lt;/code&gt; Column Family. The column name will be a composite of &lt;code&gt;weight&lt;/code&gt;, &lt;code&gt;seq&lt;/code&gt;, &lt;code&gt;item_id&lt;/code&gt; and &lt;code&gt;deleted&lt;/code&gt;. The column value will be used to store the item label. The &lt;code&gt;deleted&lt;/code&gt; component is used to soft delete an item in the list when applying a Transaction. When an item is soft deleted a new column is created with the same &lt;code&gt;(weight, seq, item_id)&lt;/code&gt; that has deleted set to &lt;code&gt;True&lt;/code&gt;. Transactions are serialised as JSON and stored in a single column where the column name is the Transaction ID:&lt;/p&gt;
&lt;script src='https://gist.github.com/3394160.js?file=ledger_list-schema.txt'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;The &lt;code&gt;move()&lt;/code&gt; function creates a Transaction and stores it in the &lt;code&gt;LedgerTransactions&lt;/code&gt; Column Family. When &lt;code&gt;get()&lt;/code&gt; reads the list it applies the Transactions to it&amp;#8217;s local copy. Later a background process can call &lt;code&gt;apply_tx()&lt;/code&gt; to update the list in Cassandra. While both &lt;code&gt;get()&lt;/code&gt; and &lt;code&gt;apply_tx()&lt;/code&gt; move items their implementation is very different.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;apply_tx()&lt;/code&gt; has a tough life; to move an item it must delete the old column and insert a new one. It needs to make sure that Clients see either the state before the move or after it. But the Row Level Isolation guarantee used previously cannot be relied on, as it only applies to a single row mutation. And deleting and inserting columns for the same row must be done with two mutations. So our approach cannot rely on a delete &lt;em&gt;and&lt;/em&gt; an insert being processed together.&lt;/p&gt;

&lt;p&gt;To apply a Transaction to the list in Cassandra we:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Soft delete the old item and insert the new one by inserting new columns.&lt;/li&gt;

&lt;li&gt;Hard delete the old item and the soft delete marker.&lt;/li&gt;

&lt;li&gt;Hard delete the Transaction.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once the insert in step 1 has completed was can consider the Transaction &amp;#8220;in flight&amp;#8221; until step 3 completes. During this time there may be &lt;em&gt;three&lt;/em&gt; columns in the row that identify the item we are moving. The original entry will still be there with it&amp;#8217;s original &lt;code&gt;weight&lt;/code&gt; and &lt;code&gt;seq&lt;/code&gt;. A soft delete column will be there with the same &lt;code&gt;weight&lt;/code&gt; and &lt;code&gt;seq&lt;/code&gt; as the original, but marked as deleted. And finally a second non deleted column will be there placing the item at it&amp;#8217;s new position. After step 3 the Transaction has been deleted and there will only be a single column for the item again. Reads that take place while a Transaction is in flight have to account for the intermediate state they are seeing.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;get()&lt;/code&gt; reads the list it has to apply all the Transactions to it&amp;#8217;s local copy, excluding those that are in flight. For each Transaction record it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Checks if the item being moved has been soft deleted. If so the Transaction is in flight (after step 1 in &lt;code&gt;apply_tx()&lt;/code&gt;) and can be skipped.&lt;/li&gt;

&lt;li&gt;Tries to remove the item being moved from it&amp;#8217;s local copy. If the item is missing the Transaction is still in flight (between steps 2 and 3 in &lt;code&gt;apply_tx()&lt;/code&gt;) and can be skipped.&lt;/li&gt;

&lt;li&gt;Inserts the item at the new location.&lt;/li&gt;

&lt;li&gt;Filters to remove soft deletes left by step 1.&lt;/li&gt;

&lt;li&gt;Sorts the list.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Ledger Lists are implemented by the Python code below which includes some diagnostic &lt;code&gt;print&lt;/code&gt; statements:&lt;/p&gt;
&lt;script src='https://gist.github.com/3394160.js?file=ledger_list.py'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;To exercise the code I used the ISO list of countries found in the &lt;a href='https://gist.github.com/3394160'&gt;gist&lt;/a&gt; that contains the sample code. First the list was initialised and a couple of countries moved:&lt;/p&gt;
&lt;script src='https://gist.github.com/3394160.js?file=ledger_list-example-1.txt'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;Note that when an item is moved the row in the &lt;code&gt;LedgerList&lt;/code&gt; Column Family does not change.&lt;/p&gt;

&lt;p&gt;Next I ran &lt;code&gt;apply_tx()&lt;/code&gt; and passed a flag to stop processing after step 1. This places the Transaction in flight by inserting the soft delete for the old item and writing the new one. Log messages from &lt;code&gt;get()&lt;/code&gt; show it detected the in flight Transaction and skipped processing:&lt;/p&gt;
&lt;script src='https://gist.github.com/3394160.js?file=ledger_list-example-2.txt'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;A second call to &lt;code&gt;apply_tx()&lt;/code&gt; had it re-process the in flight Transactions and stop at step 2. Log messages from &lt;code&gt;get()&lt;/code&gt; show it detected the items had already been moved and skipped processing:&lt;/p&gt;
&lt;script src='https://gist.github.com/3394160.js?file=ledger_list-example-3.txt'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;A final call to &lt;code&gt;apply_tx()&lt;/code&gt; without any parameters processed the Transactions to completion. &lt;code&gt;get()&lt;/code&gt; no longer has to apply Transactions:&lt;/p&gt;
&lt;script src='https://gist.github.com/3394160.js?file=ledger_list-example-4.txt'&gt;&amp;nbsp;&lt;/script&gt;
&lt;h2 id='whats_missing_'&gt;What&amp;#8217;s Missing ?&lt;/h2&gt;

&lt;p&gt;The next things to look at are pagination and adding more operations such as &amp;#8220;add&amp;#8221; and &amp;#8220;delete&amp;#8221;. It would also be handy to move items outside of the current page.&lt;/p&gt;

&lt;p&gt;So far this is has been an experiment conducted on my couch, for fun. If you think I have missed something or got it wrong let me know.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Row Level Isolation and Consensus</title>
   <link href="http://thelastpickle.com/2012/08/16/Row-Isolation-and-Consensus" />
   <updated>2012-08-16T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2012/08/16/Row-Isolation-and-Consensus</id>
   <content type="html">&lt;p&gt;Row Level &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-2893'&gt;Isolation&lt;/a&gt; in Cassandra 1.1 is the most important new feature in Cassandra so far. There have been a lot of great improvements since version 0.5, but row level isolation adds an entirely new feature. One that opens the door to new use cases.&lt;/p&gt;

&lt;h2 id='but_if_you_could_just_see_the_beauty'&gt;&lt;a href='http://www.youtube.com/watch?v=7Mz5AEgE24o'&gt;But if you could just see the beauty&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;If you are not familiar with row level isolation Data Stax have a &lt;a href='http://www.datastax.com/dev/blog/row-level-isolation'&gt;blog post&lt;/a&gt; on it. In addition I covered some of the implications in my &lt;a href='http://www.datastax.com/events/cassandrasummit2012/presentations'&gt;Cassandra Summit 2012&lt;/a&gt; talk on &lt;a href='http://www.slideshare.net/aaronmorton/cassandra-sf-2012-technical-deep-dive-query-performance'&gt;Query Performance&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To recap, row isolation leverage&amp;#8217;s the &amp;#8220;zero copy clone&amp;#8221; feature of &lt;a href='https://github.com/nbronson/snaptree'&gt;Snap Tree&lt;/a&gt;. This allows write threads in Cassandra to clone the current Memtable columns for a row and work on the cloned copy. Changes are then only applied to the Memtable if the row has not been updated while the write thread was working.&lt;/p&gt;

&lt;p&gt;In my Cassandra Summit talk I replicated the logic as best I could using &lt;a href='http://en.wikipedia.org/wiki/Commodore_64'&gt;Commodore 64&lt;/a&gt; basic:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;00 REM Cassandra for C64
05 REM Clone row_cols into my_cols
10 GOSUB 1000
20 FOR col = write.first_col TO write.last_col
25 REM Add or Reconcile col with my_cols
30 GOSUB 2000
40 IF my_cols != row_cols THEN GOTO 05
50 NEXT col
55 REM Atomic swap row_cols with my_cols
60 GOSUB 3000
70 IF swapped_cols = FALSE THEN GOTO 05  &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;At line 10 we clone the current Memtable columns for the row we want to update. Each column in the mutation is then reconciled against the &lt;code&gt;my_cols&lt;/code&gt; copy, isolating the changes from the Memtable. After each column is added the code checks if the Memtable&amp;#8217;s &lt;code&gt;row_cols&lt;/code&gt; object has been replaced (line 40). If it has we jump back to 10 and try our changes again. A final check in the form of an atomic &lt;code&gt;compareAndSet&lt;/code&gt; is done at line 60. Again if the Memtable &lt;code&gt;row_cols&lt;/code&gt; has been replaced this write thread tries again.&lt;/p&gt;

&lt;p&gt;As a result readers will see either all or none of the changes applied by a writer. It also means that writer threads may take several attempts to complete their write. In my talk I looked at clients trying to write 10,000 (50 byte) Columns to a row in chunks of 50 columns. For the grey line below 10 clients wrote to different rows, for the red line they wrote to the same row.&lt;/p&gt;

&lt;p&gt;&lt;img src='/files/2012-08-16-Row-Isolation-and-Consensus/CF-Write-Latency-and-row-concurrency-10-clients.png' alt='CF Write Latency and row concurrency (10 clients)' /&gt;&lt;/p&gt;

&lt;p&gt;There are some trade off&amp;#8217;s. The increased latency is due to write threads detecting changes to the Memtable and restarting their operation. On the upside however we now have a database that provides lock free atomic assignment to multiple registers / values.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(AFAIK we cannot call this wait free as there is no guarantee that a particular write thread will complete given infinite number of steps. For example a client writing 1,000 columns may be continually stopped by clients writing 1 column.)&lt;/em&gt;&lt;/p&gt;

&lt;h2 id='we_can_all_agree_on_consensus'&gt;We Can All Agree on Consensus&lt;/h2&gt;

&lt;p&gt;&lt;a href='http://en.wikipedia.org/wiki/Consensus_%28computer_science%29'&gt;Consensus&lt;/a&gt; is an interesting topic in distributed systems, and one I&amp;#8217;m enjoying learning about from &lt;a href='http://amzn.com/0123973376'&gt;The Art of Multiprocessor Programming&lt;/a&gt;. It&amp;#8217;s an excellent book, and the inspiration for this experiment. From page 100:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A &lt;em&gt;consensus object&lt;/em&gt; provides a single method &lt;code&gt;decide()&lt;/code&gt;. Each thread calls the &lt;code&gt;decide()&lt;/code&gt; method with its input &lt;em&gt;v at most once&lt;/em&gt;. The object&amp;#8217;s &lt;code&gt;decide()&lt;/code&gt; method will return a value meeting the following conditions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;consistent:&lt;/em&gt; all threads decide the same value.&lt;/li&gt;

&lt;li&gt;&lt;em&gt;valid:&lt;/em&gt; the common decision value is some threads input.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;A further restricted placed on the algorithm is that is wait free, and there for lock free.&lt;/p&gt;

&lt;p&gt;A Consensus Protocol has a &lt;em&gt;Consensus Number&lt;/em&gt; &lt;strong&gt;n&lt;/strong&gt; that describes the maximum number of threads it provides consensus for. Obviously the simplest level of Consensus is between two threads, so lets look at a protocol with a Consensus Number of 2.&lt;/p&gt;

&lt;h2 id='assign23'&gt;Assign23&lt;/h2&gt;

&lt;p&gt;2-Consensus with a &amp;#8220;(2/3) array&amp;#8221; is discussed in the book which credits the idea to &lt;a href='http://dl.acm.org/citation.cfm?id=102808'&gt;Maurice Herlihy, Wait-Free Synchronisation, 1991&lt;/a&gt; (in case you missed it Maurice Herlihy is one of the book authors). It is also discussed in the &lt;a href='http://www.elsevierdirect.com/v2/companion.jsp?ISBN=9780123705914'&gt;teaching slide deck&lt;/a&gt; for chapter 5. The (2/3) array is actually used to prove it&amp;#8217;s impossible to implement (wait free, so lock free) 2-consensus using only atomic registers.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;Assign23&lt;/code&gt; class from the book looks like this:&lt;/p&gt;
&lt;script src='https://gist.github.com/5f1a81114d4c6a7f85c3.js?file=assign23.java'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;&lt;code&gt;Assign23&lt;/code&gt; wraps access to an array of three elements. It uses a lock via the &lt;code&gt;synchronized&lt;/code&gt; keyword to allow a caller to atomically write 2 array elements at a time. It uses the same lock when providing read access to ensure reads are consistent. That is reads only see the state before two elements are updated or after. The effect is to isolate writes from reads.&lt;/p&gt;

&lt;p&gt;To build a 2-consensus protocol using &lt;code&gt;Assign23&lt;/code&gt; Threads write to elements in the array and then observe the state of the array. Each Thread (remember we only have 2) writes to 2 elements in the array. Thread 0 writes 1 to elements 0 and 1, while Thread 1 writes 1 to elements 1 and 2. To decide the outcome both Threads read part of the array. The consensus we are after is which Thread wrote to the array first.&lt;/p&gt;

&lt;p&gt;Lets say we initialise an instance of &lt;code&gt;Assign23&lt;/code&gt; with &lt;code&gt;-1&lt;/code&gt; in all array elements. If Thread 0 writes and then reads from the array it will see three possible states:&lt;/p&gt;

&lt;p&gt;&lt;img src='/files/2012-08-16-Row-Isolation-and-Consensus/Assign23.png' alt='Assign23 states for Thread 0' /&gt;&lt;/p&gt;

&lt;p&gt;From Thread 0s point of view it will decide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Case 1: Thread 0 wins. Thread 1 has not not written to the array.&lt;/li&gt;

&lt;li&gt;Case 2: Thread 0 wins. Thread 0 wrote first, then Thread 1.&lt;/li&gt;

&lt;li&gt;Case 3: Thread 1 wins. Thread 1 wrote first, then Thread 0.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 2-consensus protocol is then constructed as:&lt;/p&gt;
&lt;script src='https://gist.github.com/5f1a81114d4c6a7f85c3.js?file=consensus.java'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;Note that &lt;code&gt;decide()&lt;/code&gt; does not tell you which thread wrote it&amp;#8217;s value to the &lt;code&gt;proposed&lt;/code&gt; array first. Or which made the call to &lt;code&gt;assign()&lt;/code&gt; first. The Thread that is considered the &lt;em&gt;winner&lt;/em&gt; is the one that acquired the lock (on the instance) implemented by the &lt;code&gt;synchronized&lt;/code&gt; keyword.&lt;/p&gt;

&lt;h2 id='cf23'&gt;CF23&lt;/h2&gt;

&lt;p&gt;So to implement &lt;code&gt;Assign23&lt;/code&gt; all we needed was atomic, isolated, read-write access to an array. Hmmm, I think we can do that in Cassandra.&lt;/p&gt;

&lt;p&gt;Start with a Column Family definition using &lt;a href='http://www.datastax.com/docs/1.1/references/cql/index'&gt;CQL3&lt;/a&gt;:&lt;/p&gt;
&lt;script src='https://gist.github.com/5f1a81114d4c6a7f85c3.js?file=cql.txt'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;&lt;em&gt;(Defining the column names up front is not necessary. It just makes the select statements below look nicer.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The row key for the &lt;code&gt;cf23&lt;/code&gt; Column Family is the name of the Consensus object. The &lt;code&gt;element_*&lt;/code&gt; columns represent the &lt;code&gt;r&lt;/code&gt; array from &lt;code&gt;Assign23&lt;/code&gt;, and the &lt;code&gt;propose_*&lt;/code&gt; columns are the values proposed by the threads / clients.&lt;/p&gt;

&lt;p&gt;To make the example easier let&amp;#8217;s use Python and &lt;a href='http://pycassa.github.com/pycassa/'&gt;pycassa&lt;/a&gt;.&lt;/p&gt;
&lt;script src='https://gist.github.com/5f1a81114d4c6a7f85c3.js?file=assign23.py'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;To use the &lt;code&gt;Consensus2&lt;/code&gt; class two clients needs to know the name of the consensus object they are sharing, and their client id. With that they can propose a value and agree on which value &amp;#8220;won&amp;#8221;.&lt;/p&gt;

&lt;p&gt;In the examples below I ran two iPython clients and examined the database using &lt;code&gt;cqlsh&lt;/code&gt;. In the first test player 1 answers the question first:&lt;/p&gt;
&lt;script src='https://gist.github.com/5f1a81114d4c6a7f85c3.js?file=example_1.py'&gt;&amp;nbsp;&lt;/script&gt;
&lt;p&gt;In the second test player 2 gets in first:&lt;/p&gt;
&lt;script src='https://gist.github.com/5f1a81114d4c6a7f85c3.js?file=example_2.py'&gt;&amp;nbsp;&lt;/script&gt;
&lt;h2 id='the_fine_print'&gt;The Fine Print&lt;/h2&gt;

&lt;p&gt;Is this fair? Does the client who calls Cassandra first always win? No.&lt;/p&gt;

&lt;p&gt;For example say the write thread for client 0 starts executing nanoseconds before the write for client 1. The first write thread may be pre-empted by the JVM before it gets a chance to complete, allowing the second thread to finish first. When the JVM resumes the first writer it will discover that the row columns have changed and re-apply it&amp;#8217;s mutation. Unfortunately by then it&amp;#8217;s too late, client 1 has won. A small &amp;#8220;wall clock&amp;#8221; advantage in starting a write is not enough to win. The race between the clients is the first to pass the &lt;code&gt;GOSUB&lt;/code&gt; call at line 60 of our C64 code above.&lt;/p&gt;

&lt;p&gt;This being Cassandra there is still a potential issue with timestamps and clock skew between clients. The value of &lt;code&gt;element_1&lt;/code&gt; in the &lt;code&gt;cf23&lt;/code&gt; row depends on the timestamps used by clients. If client 0 writes with a timestamp skewed to the future it&amp;#8217;s write for &lt;code&gt;element_1&lt;/code&gt; may always win when reconciled with client 1. If this happened when client 0 wrote first, the array would look like Case 3 (from above) rather than Case 2. The result would be client 1 &amp;#8220;winning&amp;#8221; when really client 0 won.&lt;/p&gt;

&lt;p&gt;In this situation CQL has an advantage over the RPC interface. If a &lt;code&gt;TIMESTAMP&lt;/code&gt; is not included in an &lt;a href='http://www.datastax.com/docs/1.1/references/cql/INSERT'&gt;&lt;code&gt;INSERT&lt;/code&gt;&lt;/a&gt; statement the coordinating node generates a timestamp for the new Columns. Baring &lt;a href='http://en.wikipedia.org/wiki/Network_Time_Protocol'&gt;NTP&lt;/a&gt; moving the clock backwards, successive calls to the same server &lt;em&gt;should&lt;/em&gt; result in monotonic timestamps.&lt;/p&gt;

&lt;p&gt;Finally you cannot reuse the &lt;code&gt;Consensus2&lt;/code&gt; object, the same is true for the in memory Java version in the book. If you do weird things happen:&lt;/p&gt;
&lt;script src='https://gist.github.com/5f1a81114d4c6a7f85c3.js?file=example_3.py'&gt;&amp;nbsp;&lt;/script&gt;</content>
 </entry>
 
 <entry>
   <title>ZooKeeper Reading 12-01-2012</title>
   <link href="http://thelastpickle.com/2012/01/12/ZooKeeper-Reading" />
   <updated>2012-01-12T00:00:00-08:00</updated>
   <id>http://thelastpickle.com/2012/01/12/ZooKeeper-Reading</id>
   <content type="html">&lt;h2 id='overview'&gt;&lt;a href='http://zookeeper.apache.org/doc/current/zookeeperOver.html'&gt;Overview&lt;/a&gt;&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;There is a leader in the cluster.&lt;/li&gt;

&lt;li&gt;Hierarchical model of &lt;code&gt;znodes&lt;/code&gt; which act like directories and files. Uses &amp;#8221;/&amp;#8221; as the path separator.&lt;/li&gt;

&lt;li&gt;All in memory, high throughput low latency.&lt;/li&gt;

&lt;li&gt;&amp;#8220;The ZooKeeper implementation puts a premium on high performance, highly available, strictly ordered access.&amp;#8221;&lt;/li&gt;

&lt;li&gt;Transaction logs and snapshots on disk.&lt;/li&gt;

&lt;li&gt;Clients hold a TCP connection for duplex messaging.&lt;/li&gt;

&lt;li&gt;All updates have a globally ordered TxID.&lt;/li&gt;

&lt;li&gt;Works best in read heavy workloads, think 10:1 R:W ratios.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;znodes&lt;/code&gt; may have data and children.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;znodes&lt;/code&gt; have a version number for their local (?) state.&lt;/li&gt;

&lt;li&gt;Reads and writes on a &lt;code&gt;znode&lt;/code&gt; are atomic with respect to the version of data.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;znodes&lt;/code&gt; can be protected by an ACL.&lt;/li&gt;

&lt;li&gt;Ephemeral &lt;code&gt;znodes&lt;/code&gt; are deleted when the session that created them ends.&lt;/li&gt;

&lt;li&gt;Clients can set a watch on &lt;code&gt;znode&lt;/code&gt; that is triggered when it changes. Is this for the local data or local data and children ?&lt;/li&gt;

&lt;li&gt;Guarantees: * Sequential Consistency - Updates from a client will be applied in the order that they were sent. * Atomicity - Updates either succeed or fail. No partial results. * Single System Image - A client will see the same view of the service regardless of the server that it connects to. * Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update. * Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.&lt;/li&gt;

&lt;li&gt;Write to WAL before apply to in memory DB.&lt;/li&gt;

&lt;li&gt;Reads are serviced by the local DB on the node, writes are services by an agreement protocol. Guess this is why it&amp;#8217;s tuned for read.&lt;/li&gt;

&lt;li&gt;Writes go to the leader, are then distributed to the other (follower) nodes. Local DB&amp;#8217;s should never diverge.&lt;/li&gt;

&lt;li&gt;&lt;a href='http://zookeeper.apache.org/doc/current/zookeeperOver.html#Performance'&gt;Performance&lt;/a&gt; 3 servers should give between 20k/sec and 80k/sec requests depending on the read/write mix.&lt;/li&gt;

&lt;li&gt;&lt;a href='http://zookeeper.apache.org/doc/current/zookeeperOver.html#Reliability'&gt;Reliability&lt;/a&gt; less than 200ms to elect a new leader, failure of a follower reduces throughput.&lt;/li&gt;

&lt;li&gt;What&amp;#8217;s the recover model for a follower that is down for a while ? Does this affect performance ? &lt;strong&gt;Answer&lt;/strong&gt; (from the internals) If too many &lt;code&gt;Proposals&lt;/code&gt; are missing a snapshot is sent.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='zookeeper_internals'&gt;&lt;a href='http://zookeeper.apache.org/doc/current/zookeeperInternals.html'&gt;ZooKeeper Internals&lt;/a&gt;&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&amp;#8220;At the heart of ZooKeeper is an &lt;a href='http://img135.imageshack.us/img135/5011/atomics.gif'&gt;atomic&lt;/a&gt; messaging system that keeps all of the servers in sync.&amp;#8221;&lt;/li&gt;

&lt;li&gt;Guarantees: * Reliable Delivery * Total Order * Causal Order&lt;/li&gt;

&lt;li&gt;The messaging layer is build around FIFO channels between nodes, and relies on the properties of TCP for this. Specifically: * Ordered delivery * No message after close&lt;/li&gt;

&lt;li&gt;The protocol is composed of: * Packet: a sequence of bytes sent through a FIFO channel * Proposal: a unit of agreement. Proposals are agreed upon by exchanging packets with a quorum of ZooKeeper servers. Most proposals contain messages, however the NEW_LEADER proposal is an example of a proposal that does not correspond to a message. * Message: a sequence of bytes to be atomically broadcast to all ZooKeeper servers. A message put into a proposal and agreed upon before it is delivered.&lt;/li&gt;

&lt;li&gt;QUORUM is (n/2) +1 by default.&lt;/li&gt;

&lt;li&gt;QUORUM can be majority quorums, weights, or a hierarchy of groups.&lt;/li&gt;

&lt;li&gt;Proposals are stamped with the &lt;code&gt;zxid&lt;/code&gt; and sent to all servers, a server ack&amp;#8217;s when it is on persistent store. Messages in the proposal are then delivered.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;zxid&lt;/code&gt; has two parts: the epoch and a counter. Implemented as a 64 bit int, high 32 bits are the epoch, low 32 are the count.&lt;/li&gt;

&lt;li&gt;&amp;#8220;The epoch number represents a change in leadership. Each time a new leader comes into power it will have its own epoch number.&amp;#8221;&lt;/li&gt;

&lt;li&gt;Messaging consists of two phases, Leader Activation and Active Messaging.&lt;/li&gt;

&lt;li&gt;Leader Activation may appear to have worked but later fail when checking the invariant that a QUORUM of followers follow the same leader. During the election it must only hold with a high probability.&lt;/li&gt;

&lt;li&gt;In Active Messaging: * Leader sends &lt;code&gt;PROPOSE&lt;/code&gt; to all followers for a new proposal. * Followers commit to non-volatile storage and then &lt;code&gt;ACK&lt;/code&gt; * Leader sends &lt;code&gt;COMMIT&lt;/code&gt; to all followers once a &lt;code&gt;QUOURM&lt;/code&gt; have &lt;code&gt;ACK&lt;/code&gt;&amp;#8216;d.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='getting_started'&gt;&lt;a href='http://zookeeper.apache.org/doc/current/zookeeperStarted.html'&gt;Getting Started&lt;/a&gt;&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Grab the latest distro and start a single node with &lt;code&gt;bin/zkServer.sh start-foreground conf/zoo_sample.cfg&lt;/code&gt;&lt;/li&gt;

&lt;li&gt;Fire up the command line interface with &lt;code&gt;bin/zkCli.sh -server 127.0.0.1:2181&lt;/code&gt; and work through the examples in the doc.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='zope_zookeeper_client_for_python'&gt;&lt;a href='http://pypi.python.org/pypi/zc.zk/0.5.2'&gt;Zope ZooKeeper client for Python&lt;/a&gt;&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Requires &lt;a href='http://pypi.python.org/pypi/zc-zookeeper-static/3.3.4.0'&gt;zc-zookeeper-static&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;code&gt;zc-zookeeper-static&lt;/code&gt; is a wrapper around the C libs, it&amp;#8217;s pretty low level. e.g. you get an int handle and pass that into methods, not OO. &lt;code&gt;zc.zk&lt;/code&gt; ads an OO wrapper and some other stuff I cannot work out.&lt;/li&gt;

&lt;li&gt;A lot of methods on the zc.zk.ZooKepper object are pass through to the &lt;code&gt;zc-zookeeper-static&lt;/code&gt; package and do not have any docs. Check to docs on &lt;code&gt;zookeeper&lt;/code&gt; for the function help. For example &lt;code&gt;zc.zk.ZooKeeper.get&lt;/code&gt; has no docs and a crap &lt;code&gt;(*arg, **kwargs)&lt;/code&gt; param list, look at &lt;code&gt;zookeeper.get&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id='get_a_connection'&gt;Get a connection&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;import zc.zk
zk = zc.zk.ZooKeeper(&amp;#39;127.0.0.1:2181&amp;#39;)&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id='get_the_children_of_a_'&gt;Get the children of a &lt;code&gt;znode&lt;/code&gt;&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;In [6]: zk.get_children(&amp;quot;/&amp;quot;)
Out[6]: [&amp;#39;consumers&amp;#39;, &amp;#39;brokers&amp;#39;, &amp;#39;zookeeper&amp;#39;, &amp;#39;zk_test&amp;#39;]
# some stuff from kafka there.&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id='get_the_properties_of_a_'&gt;Get the properties of a &lt;code&gt;znode&lt;/code&gt;&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;# Get a zc.zk.Properties for the path
# **NOTE:** This is heavy weight, for single reads use get_properties()
In [55]: p = zk.properties(&amp;quot;/zookeeper&amp;quot;)

In [56]: p.data
Out[56]: {}

In [58]: p.values()
Out[58]: []

# simple get_properties()
In [64]: zk.get_properties(&amp;quot;/zk_test&amp;quot;)
Out[64]: {&amp;#39;string_value&amp;#39;: &amp;#39;foo&amp;#39;}

#zv.zk assume node data is json
In [49]: zk.set(&amp;quot;/zk_test&amp;quot;, &amp;quot;foo&amp;quot;)
Out[49]: 0

In [51]: p = zk.properties(&amp;quot;/zk_test&amp;quot;)

In [53]: p.values()
Out[53]: [&amp;#39;foo&amp;#39;]

In [54]: p.data
Out[54]: {&amp;#39;string_value&amp;#39;: &amp;#39;foo&amp;#39;}&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id='tree_operations'&gt;Tree operations&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;In [59]: zk.print_tree(&amp;quot;/zookeeper&amp;quot;)
/zookeeper
  /quota&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id='_operations'&gt;&lt;code&gt;znode&lt;/code&gt; operations&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;# create an ephemeral node

# must have an ACL this is an open one 
In [78]: acl = [{&amp;quot;perms&amp;quot; : zookeeper.PERM_ALL, &amp;quot;scheme&amp;quot; : &amp;quot;world&amp;quot;, &amp;quot;id&amp;quot; : &amp;quot;anyone&amp;quot;}]

# Parent path must exist
In [85]: zk.create( &amp;quot;/fake/ephemeral&amp;quot;, &amp;quot;some data&amp;quot;, acl, zookeeper.EPHEMERAL)
...
NoNodeException: no node

In [84]: zk.create( &amp;quot;/zk_test/ephemeral&amp;quot;, &amp;quot;some data&amp;quot;, acl, zookeeper.EPHEMERAL)
Out[84]: &amp;#39;/zk_test/ephemeral&amp;#39;

# node now listed (locally) on the connection 
In [86]: zk.ephemeral
Out[86]: 
{&amp;#39;/zk_test/ephemeral&amp;#39;: {&amp;#39;acl&amp;#39;: [{&amp;#39;id&amp;#39;: &amp;#39;anyone&amp;#39;,
                                 &amp;#39;perms&amp;#39;: 31,
                                 &amp;#39;scheme&amp;#39;: &amp;#39;world&amp;#39;}],
                        &amp;#39;data&amp;#39;: &amp;#39;some data&amp;#39;,
                        &amp;#39;flags&amp;#39;: 1}}
                        
# View from the cluster
In [88]: zk.get_properties(&amp;quot;/zk_test/ephemeral&amp;quot;)
Out[88]: {&amp;#39;string_value&amp;#39;: &amp;#39;some data&amp;#39;}

In [89]: p = zk.properties(&amp;quot;/zk_test/ephemeral&amp;quot;)
In [91]: p.meta_data
Out[91]: 
{&amp;#39;aversion&amp;#39;: 0,
 &amp;#39;ctime&amp;#39;: 1326337991257L,
 &amp;#39;cversion&amp;#39;: 0,
 &amp;#39;czxid&amp;#39;: 1950L,
 &amp;#39;dataLength&amp;#39;: 9,
 &amp;#39;ephemeralOwner&amp;#39;: 86922380708675587L,
 &amp;#39;mtime&amp;#39;: 1326337991257L,
 &amp;#39;mzxid&amp;#39;: 1950L,
 &amp;#39;numChildren&amp;#39;: 0,
 &amp;#39;pzxid&amp;#39;: 1950L,
 &amp;#39;version&amp;#39;: 0}&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id='watch_for_changes'&gt;Watch for changes&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;In [8]: children = zk.children(&amp;quot;/zk_test&amp;quot;)

In [9]: def my_callback(node):
   ...:   print &amp;quot;Called with node: &amp;quot;, str(node)
   ...: 

In [11]: children(my_callback)
Called with node:  zc.zk.Children(0, /zk_test)
Out[11]: zc.zk.Children(0, /zk_test)

In [14]: acl = [{&amp;quot;perms&amp;quot; : zookeeper.PERM_ALL, &amp;quot;scheme&amp;quot; : &amp;quot;world&amp;quot;, &amp;quot;id&amp;quot; : &amp;quot;anyone&amp;quot;}]

In [15]: zk.create( &amp;quot;/zk_test/ephemeral&amp;quot;, &amp;quot;some data&amp;quot;, acl, zookeeper.EPHEMERAL)
Out[15]: &amp;#39;/zk_test/ephemeral&amp;#39;
Called with node:  zc.zk.Children(0, /zk_test)    &lt;/code&gt;&lt;/pre&gt;</content>
 </entry>
 
 <entry>
   <title>Cassandra Reading 02-01-2012</title>
   <link href="http://thelastpickle.com/2012/01/02/Cassandra-Reading" />
   <updated>2012-01-02T00:00:00-08:00</updated>
   <id>http://thelastpickle.com/2012/01/02/Cassandra-Reading</id>
   <content type="html">&lt;h2 id='cql_sql_in_cassandra'&gt;&lt;a href='http://www.slideshare.net/jericevans/cql-sql-in-cassandra'&gt;CQL: SQL In Cassandra&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;And a &lt;a href='http://www.acunu.com/blogs/eric-evans/cql-benchmarking/'&gt;blog post&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;By Eric Evans&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Good examples of simple statements.&lt;/li&gt;

&lt;li&gt;Performance comparisons show CQL has 5% to 10% lower throughput and higher latency.&lt;/li&gt;

&lt;li&gt;List of drivers and &lt;a href='http://code.google.com/a/apache-extras.org/hosting/search?q=label%3ACassandra'&gt;where&lt;/a&gt; to get them.&lt;/li&gt;

&lt;li&gt;More improvements coming.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='storm_cassandra_integration'&gt;&lt;a href='https://github.com/ptgoetz/storm-cassandra'&gt;Storm Cassandra Integration&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By P. Taylor Goetz&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;a href='https://github.com/nathanmarz/storm'&gt;storm&lt;/a&gt; bolt to persist data to Cassandra.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='cassandra_in_online_advertising_real_time_bidding'&gt;&lt;a href='http://www.slideshare.net/edwardcapriolo/m6d-cassandrapresentation'&gt;Cassandra in Online Advertising: Real Time Bidding&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By Edward Capriolo&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10TB of data and growing.&lt;/li&gt;

&lt;li&gt;Latency requirements of less than 120ms.&lt;/li&gt;

&lt;li&gt;&amp;#8220;Distributed not Duplicated&amp;#8221; is a great phrase.&lt;/li&gt;

&lt;li&gt;Ed still loves the Cacti :)&lt;/li&gt;

&lt;li&gt;Good idea to use JMX to modify the (Cassandra) cache settings for a single node to compare against the others.&lt;/li&gt;

&lt;li&gt;Pay attention to how the cache hit rate varies with regard to the cache size, at some point may be better to accept a (say) 90% hit rate and give more memory to another CF.&lt;/li&gt;

&lt;li&gt;Tuning IO performance for peak and off-peak by modifying &lt;code&gt;nodetool setcompactionlimit&lt;/code&gt; to improve compaction performance.&lt;/li&gt;

&lt;li&gt;Running night time major compactions like Urban Airship. I wonder how leveled compaction would work with the mixed workload?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='replication_and_the_latencyconsistency_tradeoff'&gt;&lt;a href='http://dbmsmusings.blogspot.com/2011/12/replication-and-latency-consistency.html'&gt;Replication and the latency-consistency tradeoff&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By Daniel Abadi&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A discussion about latency and consistency.&lt;/li&gt;

&lt;li&gt;&amp;#8220;there&amp;#8217;s no way to perform consistent replication across database replicas without some level of synchronous network communication.&amp;#8221;&lt;/li&gt;

&lt;li&gt;Not sure I agree that in Dynamo / Cassandra (not sure about Riak) &amp;#8220;updates generally go to the same node, and are then propagated synchronously to W other nodes (case (2)(c))&amp;#8221;.&lt;/li&gt;

&lt;li&gt;&lt;a href='http://dbmsmusings.blogspot.com/2011/12/replication-and-latency-consistency.html?showComment=1325454055745#c307358897668058915'&gt;I think&lt;/a&gt; how inconsistencies are handled during read requests is another source of latency.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='cassandra_for_sys_admins'&gt;&lt;a href='http://www.slideshare.net/nmilford/cassandra-for-sysadmins'&gt;Cassandra for sys admins&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By Nathan Milford&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;14 nodes in two DC&amp;#8217;s&lt;/li&gt;

&lt;li&gt;In productions since Cassandra version 0.4, awesome!&lt;/li&gt;

&lt;li&gt;A good list of things to monitor in a cluster.&lt;/li&gt;

&lt;li&gt;Good best practices for shutting down a node that give the fastest startup, also &lt;a href='http://blog.milford.io/2011/11/rolling-upgrades-for-cassandra'&gt;see&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='cassandra_for_lobs'&gt;&lt;a href='http://ruby.dzone.com/articles/cassandra-lobs'&gt;Cassandra for LOBS&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By Dan Pritchett&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expensive SAN storage bombshell.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='expedia_hotel_price_cache'&gt;&lt;a href='http://www.slideshare.net/clibou/seattle-scalability-meetup-10505322/25'&gt;Expedia Hotel Price Cache&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By B. Todd Burruss&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pre-calculate, trading space for time.&lt;/li&gt;

&lt;li&gt;A rolling window of 2.8 billion data points.&lt;/li&gt;

&lt;li&gt;Test, measure, tune.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='data_modeling_examples'&gt;&lt;a href='http://www.slideshare.net/mattdennis/cassandra-nyc-2011-data-modeling'&gt;Data Modeling Examples&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By Matthew Dennis&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I always check out Matthew&amp;#8217;s data model presentations to see what the best practices are.&lt;/li&gt;

&lt;li&gt;&amp;#8220;Usually better to keep a record that something happened as opposed to changing a value&amp;#8221;.&lt;/li&gt;

&lt;li&gt;Good advice on time series and the XACT_LOG.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='cassandra_in_production_things_we_learned'&gt;&lt;a href='http://devblog.seomoz.org/2011/11/cassandra-in-production-things-we-learned/'&gt;Cassandra In Production: Things We Learned&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By Walt Jones&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using Cassandra 0.7, there are a &lt;em&gt;lot&lt;/em&gt; of improvements in Cassandra 1.0.&lt;/li&gt;

&lt;li&gt;12 AWS EC2 m1.xlarge nodes with 5TB of data.&lt;/li&gt;

&lt;li&gt;S3 archive&lt;/li&gt;

&lt;li&gt;The memory footprint issues has been eliminated in Cassandra 1.0.&lt;/li&gt;

&lt;li&gt;Please avoid using Super Columns.&lt;/li&gt;

&lt;li&gt;Please use the Random Partitioner.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='data_modeling_with_cassandra'&gt;&lt;a href='http://www.acunu.com/blogs/sam-overton/cassandra-data-modelling/'&gt;Data Modeling with Cassandra&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;By Sam Overton&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;De-normalize for a brighter future.&lt;/li&gt;

&lt;li&gt;No SQL &amp;#8220;Hello World&amp;#8221; twitter example.&lt;/li&gt;
&lt;/ul&gt;</content>
 </entry>
 
 <entry>
   <title>Anatomy of a Cassandra Partition</title>
   <link href="http://thelastpickle.com/2011/12/15/Anatomy-of-a-Cassandra-Partition" />
   <updated>2011-12-15T00:00:00-08:00</updated>
   <id>http://thelastpickle.com/2011/12/15/Anatomy-of-a-Cassandra-Partition</id>
   <content type="html">&lt;p&gt;Recently I was working on a Cassandra cluster and experienced a strange situation that resulted in a partition of sorts between the nodes. Whether you actually call it a partition or not is a matter for discussion (see &lt;a href='http://codahale.com/you-cant-sacrifice-partition-tolerance/'&gt;&amp;#8221; You Can’t Sacrifice Partition Tolerance - Updated October 22, 2010&amp;#8221;&lt;/a&gt;]. But weird stuff happened, Cassandra remained available, and it was fixed with zero site down time. It was also a good example of how and why Cassandra is a Highly Available data store, and fun to fix. So here are all the nerdy details&amp;#8230;&lt;/p&gt;

&lt;p&gt;The cluster was running Cassandra 1.0.5 on 6 nodes, 3 in the East DC (nodes 20, 21 and 22) and 3 in the West DC (nodes 23, 24 and 25), and the Keyspace was configured with RF 3 in each DC. Application servers were running in each DC and it was important to keep a &lt;code&gt;LOCAL_QUOURM&lt;/code&gt; in each data centre. The code would drop back to QUORUM if necessary, but that was something we wanted to avoid.&lt;/p&gt;

&lt;p&gt;I was deploying some configuration changes that required a rolling restart. The first few nodes worked fine, after restarting Cassandra I watched the other nodes detect it was &lt;code&gt;UP&lt;/code&gt; and start sending hints to it. However when node #23 restarted the rest of the cluster continued to see it as &lt;code&gt;DOWN&lt;/code&gt;, even though node #23 could see the rest of the cluster as &lt;code&gt;UP&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id='is_cassandra_available_'&gt;Is Cassandra available ?&lt;/h2&gt;

&lt;p&gt;This was the first question to answer. For this situation available meant all keys could be written to or read from using the &lt;code&gt;LOCAL_QUOURM&lt;/code&gt; CL. Technically as I still had at least 2 nodes in each data centre I still had a Quorum, but strange stuff was happening so I wanted to confirm this.&lt;/p&gt;

&lt;p&gt;Node #23 appeared to be happy:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/nodetool -h localhost info
Token            : 28356863910078205288614550619314017621
Gossip active    : true
Load             : 275.44 GB
Generation No    : 1762556151
Uptime (seconds) : 67548
Heap Memory (MB) : 2926.44 / 8032.00
Data Center      : DC1
Rack             : RAC_unknown  
Exceptions       : 0&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;(Note: this is from a little later, so the Uptime is not correct)&lt;/p&gt;

&lt;p&gt;When I looked at the ring from any node other than #23 all nodes were &lt;code&gt;UP&lt;/code&gt; other than node #23 (10.29.60.10):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra20# nodetool -h localhost ring
Address         DC          Rack        Status State   Load            Owns    Token                                       
                                                                               141784319550391026443072753096570088106     
10.37.114.8     DC1         RAC20       Up     Normal  285.86 GB       16.67%  0                                           
10.29.60.10     DC2         RAC23       Down   Normal  277.86 GB       16.67%  28356863910078205288614550619314017621      
10.6.130.70     DC1         RAC21       Up     Normal  244.9 GB        16.67%  56713727820156410577229101238628035242      
10.29.60.14     DC2         RAC24       Up     Normal  296.85 GB       16.67%  85070591730234615865843651857942052864      
10.37.114.10    DC1         RAC22       Up     Normal  255.81 GB       16.67%  113427455640312821154458202477256070485     
10.29.60.12     DC2         RAC25       Up     Normal  316.88 GB       16.67%  141784319550391026443072753096570088106  &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;But when I looked at the ring from node #23 all nodes were &lt;code&gt;UP&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# nodetool -h localhost ring
    Address         DC          Rack        Status State   Load            Owns    Token                                       
                                                                                   141784319550391026443072753096570088106     
    10.37.114.8     DC1         RAC20       Up     Normal  285.86 GB       16.67%  0                                           
    10.29.60.10     DC2         RAC23       Up     Normal  277.86 GB       16.67%  28356863910078205288614550619314017621      
    10.6.130.70     DC1         RAC21       Up     Normal  244.9 GB        16.67%  56713727820156410577229101238628035242      
    10.29.60.14     DC2         RAC24       Up     Normal  296.85 GB       16.67%  85070591730234615865843651857942052864      
    10.37.114.10    DC1         RAC22       Up     Normal  255.81 GB       16.67%  113427455640312821154458202477256070485     
    10.29.60.12     DC2         RAC25       Up     Normal  316.88 GB       16.67%  141784319550391026443072753096570088106  &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So it did not appear to be a cross DC networking issue, and technically we have a Quorum of replicas in each data centre.&lt;/p&gt;

&lt;p&gt;To confirm the consistency levels available to the application I opened the &lt;code&gt;cassandra-cli&lt;/code&gt; on nodes #20 and #23 and did some &lt;code&gt;get&lt;/code&gt;&amp;#8217;s changing the consistency level with &lt;code&gt;consistencylevel as&lt;/code&gt;. It was still working but things were strange, this is what I found.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Node #23 could serve requests at &lt;code&gt;LOCAL_QUORUM&lt;/code&gt;, &lt;code&gt;QUORUM&lt;/code&gt; and &lt;code&gt;ALL&lt;/code&gt; consistency.&lt;/li&gt;

&lt;li&gt;All other nodes could serve requests at &lt;code&gt;LOCAL_QUOURM&lt;/code&gt; and &lt;code&gt;QUORUM&lt;/code&gt; but not &lt;code&gt;ALL&lt;/code&gt; consistency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also I could see requests been processed by node #23 using &lt;code&gt;nodetool tpstats&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;So from the point of view of the application the cluster was up, and the application guys confirmed this. However we had lost redundancy in the West DC, if we lost another node for any reason we would lose the &lt;code&gt;LOCAL_QUORUM&lt;/code&gt;. Any reason can include things like a planned restart.&lt;/p&gt;

&lt;p&gt;To be sure this had to be fixed, but we had time to fix it. This is part of the ops story with Cassandra, if something goes wrong you can normally fix it while the application keeps working.&lt;/p&gt;

&lt;h2 id='orientate'&gt;Orientate&lt;/h2&gt;

&lt;p&gt;My initial thought was that either the Gossip heartbeat from node #23 was not getting through, or it was been ignored. So I started looking around to get an idea of what view the nodes in the cluster had of node #23.&lt;/p&gt;

&lt;p&gt;Nodes in the rest of the cluster could telnet to node #23 on port 7000 the default &lt;code&gt;storage_port&lt;/code&gt;. A simple test but it helps to know the basic assumptions are still valid.&lt;/p&gt;

&lt;p&gt;All the nodes (including #23) had a consistent view of the state of the other nodes propagated via Gossip, for example:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra20# bin/nodetool -h localhost gossipinfo
...
/10.29.60.10
  LOAD:2.98347080902E11
  STATUS:NORMAL,28356863910078205288614550619314017621
  RPC_ADDRESS:10.29.60.10
  SCHEMA:fe933880-19bd-11e1-0000-5ff37d368cb6
  RELEASE_VERSION:1.0.5 &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &amp;#8220;Application State&amp;#8221; transmitted via Gossip is not directly related to a node been marked as &lt;code&gt;DOWN&lt;/code&gt; (more later), but at least I could see that Gossip was distributing the same information about the node around the cluster. The properties listed above have the following meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;LOAD&lt;/code&gt;: The total live disk space used, in bytes.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;STATUS&lt;/code&gt;: The ring state of the node, it&amp;#8217;s token (if any) and some other information. &lt;code&gt;NORMAL&lt;/code&gt; indicates the node is a normal member of the ring, the other options are &lt;code&gt;BOOT&lt;/code&gt;, &lt;code&gt;LEAVING&lt;/code&gt;, &lt;code&gt;LEFT&lt;/code&gt;, &lt;code&gt;MOVING&lt;/code&gt;.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;RPC_ADDRESS&lt;/code&gt;: IP address to connect to this node on.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;SCHEMA&lt;/code&gt;: UUID of the schema this node is using.&lt;/li&gt;

&lt;li&gt;&lt;code&gt;RELEASE_VERSION&lt;/code&gt;: Cassandra version.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If Application State was responsible &lt;code&gt;nodetool ring&lt;/code&gt; would have said something other than &lt;code&gt;DOWN&lt;/code&gt; for node #23.&lt;/p&gt;

&lt;p&gt;Node #23 should have been trying to tell the other nodes that it is here and happy to answer queries. So the next step was to look at the Gossip traffic and see if that was happening. I started by enabling &lt;code&gt;TRACE&lt;/code&gt; debugging on the main &lt;code&gt;org.apache.cassandra.gms.Gossiper&lt;/code&gt; class by adding the line below to &lt;code&gt;conf/log4j-server.properties&lt;/code&gt;, log4j is configured to watch for changes so you can just change the file and wait a few seconds.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;log4j.logger.org.apache.cassandra.gms.Gossiper=TRACE&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To disable a live change to the logging config like this it&amp;#8217;s not enough to just comment out the line. You will need to a provide a new value for the &lt;code&gt;Gossiper&lt;/code&gt; logger. So after I saw what I wanted in &lt;code&gt;/var/log/cassandra/system.log&lt;/code&gt; on node #20 I changed the line the logging config to be:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;log4j.logger.org.apache.cassandra.gms.Gossiper=INFO&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This was the first thing I looked at in the logs on node #20:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;TRACE [GossipStage:1] 2011-12-13 00:58:49,636 Gossiper.java (line 647) local heartbeat version 526912 greater than 7951 for /10.29.60.10
TRACE [GossipStage:1] 2011-12-13 00:58:49,636 Gossiper.java (line 661) Adding state LOAD: 2.98347080902E11
TRACE [GossipStage:1] 2011-12-13 00:58:49,636 Gossiper.java (line 647) local heartbeat version 1231 greater than 1229 for /10.37.114.10
TRACE [GossipStage:1] 2011-12-13 00:58:49,636 Gossiper.java (line 647) local heartbeat version 4892 greater than 4890 for /10.29.60.12&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first line was the important one, something in the cluster was telling node #20 about node #23. The &lt;code&gt;version&lt;/code&gt; number referred to in the message is part of the heartbeat, a higher version number means the heartbeat is more recent. So it looked like node #20 was remembering something about node #23 from one of the machines that had been restarted, that would explain the higher local version number. But that didn&amp;#8217;t make sense, Cassandra can handle nodes restarting.&lt;/p&gt;

&lt;p&gt;This was not the right place to be looking. The log message could have been logged in response to information about node #23 that was sent from any other node. I needed to find proof of Gossip messages been received directly from node #23.&lt;/p&gt;

&lt;p&gt;Gossip uses a 3 step protocol, the initiating node sends an &lt;code&gt;Syn&lt;/code&gt; message, the target replies with an &lt;code&gt;Ack&lt;/code&gt; and the initiator replies with an &lt;code&gt;Ack2&lt;/code&gt;. So I went hunting information about &lt;code&gt;Syn&lt;/code&gt;&amp;#8217;s received from node #23 by enabling &lt;code&gt;TRACE&lt;/code&gt; logging on the verb handler:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;log4j.logger.org.apache.cassandra.gms.GossipDigestSynVerbHandler=TRACE&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is what I found in the node #20 logs:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;TRACE [GossipStage:1] 2011-12-13 02:04:56,165 GossipDigestSynVerbHandler.java (line 46) Received a GossipDigestSynMessage from /10.29.60.10
TRACE [GossipStage:1] 2011-12-13 02:04:56,166 GossipDigestSynVerbHandler.java (line 76) Gossip syn digests are : /10.6.130.70:1323732220:9792 /10.37.114.10:1323736718:5242 /10.29.60.12:1323733099:8904 /10.29.60.10:1762556151:11964 /10.29.60.14:1323732392:9619 /10.37.114.8:1323731527:10494 
TRACE [GossipStage:1] 2011-12-13 02:04:56,166 GossipDigestSynVerbHandler.java (line 90) Sending a GossipDigestAckMessage to /10.29.60.10&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It got a &lt;code&gt;Syn&lt;/code&gt; from node #23 and replied with an &lt;code&gt;Ack&lt;/code&gt;. It was receiving and responding to messages from node #23 but it still thought it was &lt;code&gt;DOWN&lt;/code&gt;. Things were getting weirder, time to go check the code.&lt;/p&gt;

&lt;p&gt;Right after logging that message &lt;code&gt;GossipDigestSynVerbHandler&lt;/code&gt; &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/GossipDigestSynVerbHandler.java#L79'&gt;notifies the &lt;code&gt;FailureDetector&lt;/code&gt;&lt;/a&gt; that it had heard from the endpoint. The &lt;code&gt;FailureDetector&lt;/code&gt; is ultimately responsible for deciding if a node is UP or DOWN, the clue is in the name. &lt;code&gt;GossipDigestSynVerbHandler&lt;/code&gt; actually calls the main &lt;code&gt;Gossiper&lt;/code&gt; class which does a few things and then decides if it wants to report to the &lt;code&gt;FailureDetector&lt;/code&gt;. I just wanted to know if the call was made, so I enabled the logging again:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;log4j.logger.org.apache.cassandra.gms.GossipDigestSynVerbHandler=TRACE
log4j.logger.org.apache.cassandra.gms.FailureDetector=TRACE&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And found this in the logs on node #20:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;TRACE [GossipStage:1] 2011-12-13 02:14:37,033 GossipDigestSynVerbHandler.java (line 46) Received a GossipDigestSynMessage from /10.29.60.10
TRACE [GossipStage:1] 2011-12-13 02:14:37,033 GossipDigestSynVerbHandler.java (line 76) Gossip syn digests are : /10.29.60.10:1762556151:12552 /10.29.60.14:1323732392:10208 /10.37.114.8:1323731527:11082 /10.37.114.10:1323736718:5830 /10.6.130.70:1323732220:10379 /10.29.60.12:1323733099:9493 
TRACE [GossipStage:1] 2011-12-13 02:14:37,033 GossipDigestSynVerbHandler.java (line 90) Sending a GossipDigestAckMessage to /10.29.60.10
TRACE [GossipStage:1] 2011-12-13 02:14:37,037 GossipDigestSynVerbHandler.java (line 46) Received a GossipDigestSynMessage from /10.37.114.10
TRACE [GossipStage:1] 2011-12-13 02:14:37,037 GossipDigestSynVerbHandler.java (line 76) Gossip syn digests are : /10.37.114.10:1323736718:5831 /10.29.60.12:1323733099:9493 /10.6.130.70:1323732220:10379 /10.29.60.14:1323732392:10208 /10.37.114.8:1323731527:11082 /10.29.60.10:1762556151:526912 
TRACE [GossipStage:1] 2011-12-13 02:14:37,037 GossipDigestSynVerbHandler.java (line 90) Sending a GossipDigestAckMessage to /10.37.114.10
TRACE [GossipStage:1] 2011-12-13 02:14:37,038 FailureDetector.java (line 164) reporting /10.37.114.10
TRACE [GossipStage:1] 2011-12-13 02:14:37,038 FailureDetector.java (line 164) reporting /10.29.60.12
TRACE [GossipTasks:1] 2011-12-13 02:14:37,447 FailureDetector.java (line 185) PHI for /10.29.60.12 : 0.1740028340787402
TRACE [GossipTasks:1] 2011-12-13 02:14:37,447 FailureDetector.java (line 185) PHI for /10.6.130.70 : 0.17769446955863263
TRACE [GossipTasks:1] 2011-12-13 02:14:37,447 FailureDetector.java (line 185) PHI for /10.29.60.14 : 0.3766607052738764
TRACE [GossipTasks:1] 2011-12-13 02:14:37,447 FailureDetector.java (line 185) PHI for /10.37.114.10 : 0.16292431576785454
TRACE [GossipTasks:1] 2011-12-13 02:14:37,448 FailureDetector.java (line 185) PHI for /10.29.60.10 : 9511.26282656631
TRACE [GossipTasks:1] 2011-12-13 02:14:37,448 FailureDetector.java (line 189) notifying listeners that /10.29.60.10 is down
TRACE [GossipTasks:1] 2011-12-13 02:14:37,448 FailureDetector.java (line 190) intervals: 500.0 mean: 500.0
TRACE [GossipTasks:1] 2011-12-13 02:14:38,453 FailureDetector.java (line 185) PHI for /10.29.60.12 : 0.6019902450401402
TRACE [GossipTasks:1] 2011-12-13 02:14:38,453 FailureDetector.java (line 185) PHI for /10.6.130.70 : 0.5983077316197724
TRACE [GossipTasks:1] 2011-12-13 02:14:38,453 FailureDetector.java (line 185) PHI for /10.29.60.14 : 0.7697319392007641
TRACE [GossipTasks:1] 2011-12-13 02:14:38,453 FailureDetector.java (line 185) PHI for /10.37.114.10 : 0.5636623638423329
TRACE [GossipTasks:1] 2011-12-13 02:14:38,454 FailureDetector.java (line 185) PHI for /10.29.60.10 : 9512.137495652863
TRACE [GossipTasks:1] 2011-12-13 02:14:38,454 FailureDetector.java (line 189) notifying listeners that /10.29.60.10 is down
TRACE [GossipTasks:1] 2011-12-13 02:14:38,454 FailureDetector.java (line 190) intervals: 500.0 mean: 500.0
TRACE [GossipStage:1] 2011-12-13 02:14:38,454 FailureDetector.java (line 164) reporting /10.29.60.14
TRACE [GossipStage:1] 2011-12-13 02:14:38,454 FailureDetector.java (line 164) reporting /10.37.114.10
TRACE [GossipStage:1] 2011-12-13 02:14:38,454 FailureDetector.java (line 164) reporting /10.6.130.70
TRACE [GossipStage:1] 2011-12-13 02:14:38,454 FailureDetector.java (line 164) reporting /10.29.60.12
TRACE [GossipStage:1] 2011-12-13 02:14:39,030 GossipDigestSynVerbHandler.java (line 46) Received a GossipDigestSynMessage from /10.6.130.70&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The first line shows a &lt;code&gt;Syn&lt;/code&gt; message received from node #23, later a response &lt;code&gt;Ack&lt;/code&gt; was sent and the next &lt;code&gt;Syn&lt;/code&gt; handled without reporting to the &lt;code&gt;FailureDetector&lt;/code&gt;. By contrast look at the next &lt;code&gt;Syn&lt;/code&gt; message from 10.37.114.10, it results in the &lt;code&gt;FailureDetector&lt;/code&gt; been told (via the &lt;code&gt;report()&lt;/code&gt; function) that we&amp;#8217;ve head from both 10.37.114.10 and 10.29.60.12.&lt;/p&gt;

&lt;p&gt;You can also see that the &lt;code&gt;FailureDetector&lt;/code&gt; has logged some information about node #23 (10.29.60.10). At the &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L167'&gt;end of a Gossip round&lt;/a&gt; the node &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L559'&gt;asks the &lt;code&gt;FailureDetector&lt;/code&gt;&lt;/a&gt; to &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/FailureDetector.java#L175'&gt;interpret&lt;/a&gt; the information it has about each other node. The &lt;code&gt;FailureDetector&lt;/code&gt; tracks the intervals between when it is told about a node, these values are interpreted every second to &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-2597'&gt;calculate the PHI value&lt;/a&gt; used to determine the nodes liveness. By default a node with a &lt;code&gt;PHI&lt;/code&gt; higher than 8 is considered &lt;code&gt;DOWN&lt;/code&gt;, as you can see node #23 was way &lt;code&gt;DOWN&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;At the time I did not pay any attention to the other log messages from the &lt;code&gt;FailureDetector&lt;/code&gt;, I just wanted to know why &lt;code&gt;report()&lt;/code&gt; was not been called.&lt;/p&gt;

&lt;h2 id='stop_and_think'&gt;Stop and think&lt;/h2&gt;

&lt;p&gt;This is what I knew:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All nodes in the cluster other than node #23 had a consistent view that #23 was down.&lt;/li&gt;

&lt;li&gt;New &amp;#8220;Application State&amp;#8221; about node #23 transmitted via Gossip was been ignored as the nodes thought it was from the past.&lt;/li&gt;

&lt;li&gt;Direct Gossip messages from node #23 were not been reported to the FailureDetector.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I put out a call for help on the #Cassandra IRC channel and started poking around &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L694'&gt;Gossiper.notifyFailureDetector()&lt;/a&gt; to see why it was not reporting messages from node #23.&lt;/p&gt;

&lt;p&gt;My assumption was that the high heartbeat version number node #20 had locally for node #23 was the reason for not reporting to the &lt;code&gt;FailureDetector&lt;/code&gt;, and &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L724'&gt;line 724&lt;/a&gt; reinforce this view of the world. The &lt;code&gt;FailureDetector&lt;/code&gt; was not called because the local version number was higher than the remote one. Done. I saw what I wanted to see, the version number is the problem.&lt;/p&gt;

&lt;p&gt;But I still didn&amp;#8217;t know why this was happening, it was unlikely to be a bug in the Gossip code. Rolling restarts are standard practice, we do them all the time. About this time I got some help from Brandon Williams (driftx) from &lt;a href='http://www.datastax.com'&gt;Data Stax&lt;/a&gt; in the IRC room. Getting help is always good, and Gossip is Brandon&amp;#8217;s thing.&lt;/p&gt;

&lt;h2 id='generational_issues'&gt;Generational issues&lt;/h2&gt;

&lt;p&gt;Heartbeat versions are only part of how Gossip works out if new information it receives really is new, or is old information delivered out of order. Sitting above the version number is the Generation. When a major change occurs on a node, such as a restart or a changing tokens, the Generation number is increased&lt;/p&gt;

&lt;p&gt;When a node detects a higher Generation for another endpoint in a &lt;code&gt;Syn&lt;/code&gt; Gossip message it &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L706'&gt;clears the&lt;/a&gt; the &lt;code&gt;FailureDetector&lt;/code&gt; information for endpoint. More importantly it takes action to get the most recent information about the node. First the node &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L951'&gt;includes a request&lt;/a&gt; in the &lt;code&gt;Ack&lt;/code&gt; it returns to the gossiper for all the information it has about the endpoint where the generation has increased. Second when &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L844'&gt;processing&lt;/a&gt; the digests in the returned &lt;code&gt;Ack2&lt;/code&gt; message the local endpoint state is &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/gms/Gossiper.java#L783'&gt;replaced&lt;/a&gt; with the new state.&lt;/p&gt;

&lt;p&gt;When node #23 restarted it should have picked up a new Generation number and the other nodes in the cluster should have noticed the change and updated their local state. The simple way to test if this was happening was do another restart on node #23 and see what happened, yay for redundant clusters with no single point of failure.&lt;/p&gt;

&lt;p&gt;While that was going on Brandon noticed that the Generation number for node #23 was strange. The number is initialized to the current UTC time in seconds when the server is first started. After than it is increased by 1 for every major change, such as a restart. The number shown in the logs and other places for node #23 was &lt;code&gt;1762556151&lt;/code&gt;, which equates to &lt;strong&gt;Fri, 07 Nov 2025 22:55:51 GMT&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This server was from the future.&lt;/p&gt;

&lt;p&gt;Still is should have incremented the Generation on restart and maintained the forward trajectory of the space time continuum. The restart completed. Node #23 came back on line with the same generation number &lt;code&gt;1762556151&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It was a server from the future that was stuck in present which was really it&amp;#8217;s past!&lt;/p&gt;

&lt;h2 id='back_to_the_future'&gt;Back to the future&lt;/h2&gt;

&lt;p&gt;The current Generation for a node is stored in the &lt;code&gt;LocationInfo&lt;/code&gt; CF in the &lt;code&gt;System&lt;/code&gt; KS. Every time the Generation is bumped it&amp;#8217;s read from there, incremented and written back. So the next step was to see what was actually stored in the CF using the &lt;code&gt;cassandra-cli&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@system] list LocationInfo;                      
Using default limit of 100
-------------------
RowKey: Ring
=&amp;gt; (column=00, value=0a257208, timestamp=1323730075935)
=&amp;gt; (column=2aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, value=0a068246, timestamp=1323730075923)
=&amp;gt; (column=40000000000000000000000000000000, value=0a1d3c0e, timestamp=1323730075897)
=&amp;gt; (column=55555555555555555555555555555555, value=0a25720a, timestamp=1323730075907)
=&amp;gt; (column=6aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, value=0a1d3c0c, timestamp=1323730075944)
-------------------
RowKey: L
=&amp;gt; (column=436c75737465724e616d65, value=737069, timestamp=1320437246450000)
=&amp;gt; (column=47656e65726174696f6e, value=690e78f6, timestamp=1762556150811000)
=&amp;gt; (column=50617274696f6e6572, value=6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572, timestamp=1320437246450000)
=&amp;gt; (column=546f6b656e, value=15555555555555555555555555555555, timestamp=1762556150877)
-------------------
RowKey: Cookies
=&amp;gt; (column=5072652d312e302068696e747320707572676564, value=6f68207965732c2074686579207765726520707572676564, timestamp=1320437246470)
-------------------
RowKey: Bootstrap
=&amp;gt; (column=42, value=01, timestamp=1762556150877)

4 Rows Returned.
Elapsed time: 37 msec(s).&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There were columns in the &amp;#8220;L&amp;#8221; row (the local node state) and the &amp;#8220;Cookies&amp;#8221; row that were written with a timestamp from the future (the ones that start with 17&amp;#8230;). Any attempt to overwrite these columns would fail unless it used a higher time stamp.&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s the &amp;#8220;L&amp;#8221; row with proper formatting for the column names:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@system] assume LocationInfo comparator as ascii; 
Assumption for column family &amp;#39;LocationInfo&amp;#39; added successfully.
[default@system] get LocationInfo[&amp;#39;L&amp;#39;];
=&amp;gt; (column=ClusterName, value=737069, timestamp=1320437246450000)
=&amp;gt; (column=Generation, value=690e78f6, timestamp=1762556150811000)
=&amp;gt; (column=Partioner, value=6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572, timestamp=1320437246450000)
=&amp;gt; (column=Token, value=15555555555555555555555555555555, timestamp=1762556150877)
Returned 4 results.
Elapsed time: 2 msec(s).&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Generation is hex &lt;code&gt;690e78f6&lt;/code&gt;, which is &lt;code&gt;1762556150&lt;/code&gt; binary. When node #23 started it &lt;a href='https://github.com/apache/cassandra/blob/3d5d9a40720034368feddd2f442cd6851ce6a233/src/java/org/apache/cassandra/db/SystemTable.java#L312'&gt;read this value&lt;/a&gt;, added 1 to it to get &lt;code&gt;1762556151&lt;/code&gt; and wrote it back. However the timestamp used to write it back was the current time, which was less than the one on disk so the write was ignored during reads. So the next time it started it got the same Generation number.&lt;/p&gt;

&lt;h2 id='back_to_the_present'&gt;Back to the present&lt;/h2&gt;

&lt;p&gt;The fix was to somehow get the Generation number to be increased. The approach I took was to delete the &lt;code&gt;Generation&lt;/code&gt; column with the high time stamp, and purge it&amp;#8217;s existence out of the SSTables. It wasn&amp;#8217;t enough to just delete the column, I needed to delete it with a higher time stamp and then see that Tombstone was purged so that future writes with current time stamps would work as expected.&lt;/p&gt;

&lt;p&gt;This approach was possible because the &lt;code&gt;GC_GRACE_SECONDS&lt;/code&gt; on the &lt;code&gt;LocationInfo&lt;/code&gt; CF is 0. So tombstones are committed to disk, but will be purged the first time compaction processes them. If &lt;code&gt;GC_GRACE_SECONDS&lt;/code&gt; was higher I would have to either wait for the time to pass or temporarily reduce it.&lt;/p&gt;

&lt;p&gt;The other possible options were a full cluster stop and restart passing the &lt;code&gt;Dcassandra.load_ring_state=false&lt;/code&gt; option. Or some form of monkeying with the token and/or IP address.&lt;/p&gt;

&lt;p&gt;Here are the steps I took to get there, I&amp;#8217;ve included a few additional ones to illustrate what was happening with the tombstone.&lt;/p&gt;

&lt;h3 id='step_1__snapshot'&gt;Step 1 - Snapshot&lt;/h3&gt;

&lt;p&gt;Always backup data before you start exploring it&amp;#8217;s limits, think of it as safe word. If it all get&amp;#8217;s a little to scary just say the safe word and we can stop.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/nodetool -h localhost snapshot system -t pre-generation-delete
Requested snapshot for: system 
Snapshot directory: pre-generation-delete&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I snap shot a lot when doing stuff like this so that I can roll back to any point.&lt;/p&gt;

&lt;h3 id='step_1__delete_the_generation_column'&gt;Step 1 - Delete the Generation column&lt;/h3&gt;

&lt;p&gt;To specify a timestamp for the delete I used the &lt;a href='http://pycassa.github.com/pycassa/assorted/pycassa_shell.html'&gt;pycassaShell&lt;/a&gt; bundled with the &lt;a href='https://github.com/pycassa/pycassa'&gt;pycassa&lt;/a&gt; client, this is an interactive Python session with some handy tools.&lt;/p&gt;

&lt;p&gt;Remember I wanted to delete the Generation column:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@system] get LocationInfo[&amp;#39;L&amp;#39;];
=&amp;gt; (column=ClusterName, value=737069, timestamp=1320437246450000)
=&amp;gt; (column=Generation, value=690e78f6, timestamp=1762556150811000)
=&amp;gt; (column=Partioner, value=6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572, timestamp=1320437246450000)
=&amp;gt; (column=Token, value=15555555555555555555555555555555, timestamp=1323805490467000)
Returned 4 results.
Elapsed time: 2 msec(s).&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So I needed to specify a timestamp higher than 1762556150811000:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# ./pycassaShell -H 10.29.60.10 -k system
[I]: IPython not found, falling back to default interpreter.
----------------------------------
Cassandra Interactive Python Shell
----------------------------------
Keyspace: system
Host: 10.29.60.10:9160

Available ColumnFamily instances:
 * HINTSCOLUMNFAMILY          ( HintsColumnFamily )
 * VERSIONS                   ( Versions )
 * INDEXINFO                  ( IndexInfo )
 * LOCATIONINFO               ( LocationInfo )
 * MIGRATIONS                 ( Migrations )
 * NODEIDINFO                 ( NodeIdInfo )
 * SCHEMA                     ( Schema )

Schema definition tools and cluster information are available through SYSTEM_MANAGER.
&amp;gt;&amp;gt;&amp;gt; LOCATIONINFO.remove(&amp;quot;L&amp;quot;, columns=[&amp;quot;Generation&amp;quot;], timestamp=1762556150811001)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The Generation column was no longer returned in query results:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@system] get LocationInfo[&amp;#39;L&amp;#39;];
=&amp;gt; (column=ClusterName, value=737069, timestamp=1320437246450000)
=&amp;gt; (column=Partioner, value=6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572, timestamp=1320437246450000)
=&amp;gt; (column=Token, value=15555555555555555555555555555555, timestamp=1323805490467000)
Returned 4 results.
Elapsed time: 2 msec(s).&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id='step_3__snapshot'&gt;Step 3 - Snapshot&lt;/h3&gt;

&lt;p&gt;Another snapshot for good luck. This one will also flush the &lt;code&gt;LocationInfo&lt;/code&gt; CF so all of our changes are on disk.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/nodetool -h localhost snapshot system -t post-generation-delete
Requested snapshot for: system 
Snapshot directory: post-generation-delete&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id='step_4__check_the_delete'&gt;Step 4 - Check the delete&lt;/h3&gt;

&lt;p&gt;This was not necessary, I knew what was going to happen later but I like to confirm that my expectations are correct and it helps illustrate how tombstones work. What I expected to see was two SSTables for &lt;code&gt;LocationInfo&lt;/code&gt;, the first would have all the columns we saw originally and the second would have the tombstone for the &lt;code&gt;Generation&lt;/code&gt; column delete.&lt;/p&gt;

&lt;p&gt;First check which SSTables are on disk:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# ls -lah data/system
total 9428
...
-rw-r--r--  2 root  wheel   617B Dec 13 20:56 LocationInfo-hb-52-Data.db
-rw-r--r--  2 root  wheel    68B Dec 13 20:56 LocationInfo-hb-52-Digest.sha1
-rw-r--r--  2 root  wheel   1.9K Dec 13 20:56 LocationInfo-hb-52-Filter.db
-rw-r--r--  2 root  wheel    61B Dec 13 20:56 LocationInfo-hb-52-Index.db
-rw-r--r--  2 root  wheel   4.2K Dec 13 20:56 LocationInfo-hb-52-Statistics.db
-rw-r--r--  2 root  wheel    80B Dec 13 21:18 LocationInfo-hb-54-Data.db
-rw-r--r--  2 root  wheel    68B Dec 13 21:18 LocationInfo-hb-54-Digest.sha1
-rw-r--r--  2 root  wheel    16B Dec 13 21:18 LocationInfo-hb-54-Filter.db
-rw-r--r--  2 root  wheel    11B Dec 13 21:18 LocationInfo-hb-54-Index.db
-rw-r--r--  2 root  wheel   4.2K Dec 13 21:18 LocationInfo-hb-54-Statistics.db&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Excellent, there were two. So I used &lt;code&gt;sstable2json&lt;/code&gt; to take a look in the oldest one:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/sstable2json data/system/LocationInfo-hb-52-Data.db
{&amp;#39;426f6f747374726170&amp;#39;: [[&amp;#39;42&amp;#39;, &amp;#39;01&amp;#39;, 1323805212929000]],
 &amp;#39;436f6f6b696573&amp;#39;: [[&amp;#39;5072652d312e302068696e747320707572676564&amp;#39;,
                     &amp;#39;6f68207965732c2074686579207765726520707572676564&amp;#39;,
                     1320437246470]],
 &amp;#39;4c&amp;#39;: [[&amp;#39;436c75737465724e616d65&amp;#39;, &amp;#39;737069&amp;#39;, 1320437246450000],
        [&amp;#39;47656e65726174696f6e&amp;#39;, &amp;#39;690e78f6&amp;#39;, 1762556150811000],
        [&amp;#39;50617274696f6e6572&amp;#39;,
         &amp;#39;6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572&amp;#39;,
         1320437246450000],
        [&amp;#39;546f6b656e&amp;#39;, &amp;#39;15555555555555555555555555555555&amp;#39;, 1323805490467000]],
 &amp;#39;52696e67&amp;#39;: [[&amp;#39;00&amp;#39;, &amp;#39;0a257208&amp;#39;, 1323730075935],
              [&amp;#39;2aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&amp;#39;, &amp;#39;0a068246&amp;#39;, 1323730075923],
              [&amp;#39;40000000000000000000000000000000&amp;#39;, &amp;#39;0a1d3c0e&amp;#39;, 1323730075897],
              [&amp;#39;55555555555555555555555555555555&amp;#39;, &amp;#39;0a25720a&amp;#39;, 1323730075907],
              [&amp;#39;6aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&amp;#39;, &amp;#39;0a1d3c0c&amp;#39;, 1323730075944]]}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I&amp;#8217;ve reformatted the output a little. The row key &lt;code&gt;4c&lt;/code&gt; is the &amp;#8220;L&amp;#8221; row from above, and the &amp;#8220;Generation&amp;#8221; column is the second column in the list with &lt;code&gt;47656e65726174696f6e&lt;/code&gt; as the column name (it&amp;#8217;s just the hex encoded ASCII values). The column value and timestamp are the next two values.&lt;/p&gt;

&lt;p&gt;Now the newer SSTable:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/sstable2json data/system/LocationInfo-hb-54-Data.db
{
&amp;quot;4c&amp;quot;: [[&amp;quot;47656e65726174696f6e&amp;quot;,&amp;quot;4ee7c086&amp;quot;,1762556150811001,&amp;quot;d&amp;quot;]]
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There was a single row with the &lt;code&gt;4c&lt;/code&gt; / &amp;#8220;L&amp;#8221; row key. That had a single column with &amp;#8220;Generation&amp;#8221; as the name, the local time of the deletion as the value, a sensible timestamp and &amp;#8220;d&amp;#8221; as a flag to indicate it&amp;#8217;s a &lt;code&gt;DeletedColumn&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id='step_5__compact_away_the_tombstone'&gt;Step 5 - Compact away the Tombstone.&lt;/h3&gt;

&lt;p&gt;It was time to purge the Tombstone by running a manual compaction on the &lt;code&gt;LocationInfo&lt;/code&gt; CF.&lt;/p&gt;

&lt;p&gt;cassandra23# bin/nodetool -h localhost compact system LocationInfo&lt;/p&gt;

&lt;h3 id='step_6__confirm_the_purge'&gt;Step 6 - Confirm the purge&lt;/h3&gt;

&lt;p&gt;I now expected to see a single &lt;code&gt;LocationInfo&lt;/code&gt; SSTable:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# ls -lah data/system/
...
-rw-r--r--  1 root  wheel    68B Dec 13 21:24 LocationInfo-hb-55-Digest.sha1
-rw-r--r--  1 root  wheel   976B Dec 13 21:24 LocationInfo-hb-55-Filter.db
-rw-r--r--  1 root  wheel    61B Dec 13 21:24 LocationInfo-hb-55-Index.db
-rw-r--r--  1 root  wheel   4.2K Dec 13 21:24 LocationInfo-hb-55-Statistics.db&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In the SSTable the &lt;code&gt;4c&lt;/code&gt; row should be present but the &lt;code&gt;47656e65726174696f6e&lt;/code&gt; &amp;#8220;Generation&amp;#8221; column should not:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/sstable2json data/system/LocationInfo-hb-55-Data.db
{&amp;#39;426f6f747374726170&amp;#39;: [[&amp;#39;42&amp;#39;, &amp;#39;01&amp;#39;, 1323805212929000]],
 &amp;#39;436f6f6b696573&amp;#39;: [[&amp;#39;5072652d312e302068696e747320707572676564&amp;#39;,
                     &amp;#39;6f68207965732c2074686579207765726520707572676564&amp;#39;,
                     1320437246470]],
 &amp;#39;4c&amp;#39;: [[&amp;#39;436c75737465724e616d65&amp;#39;, &amp;#39;737069&amp;#39;, 1320437246450000],
        [&amp;#39;50617274696f6e6572&amp;#39;,
         &amp;#39;6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572&amp;#39;,
         1320437246450000],
        [&amp;#39;546f6b656e&amp;#39;, &amp;#39;15555555555555555555555555555555&amp;#39;, 1323805490467000]],
 &amp;#39;52696e67&amp;#39;: [[&amp;#39;00&amp;#39;, &amp;#39;0a257208&amp;#39;, 1323730075935],
              [&amp;#39;2aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&amp;#39;, &amp;#39;0a068246&amp;#39;, 1323730075923],
              [&amp;#39;40000000000000000000000000000000&amp;#39;, &amp;#39;0a1d3c0e&amp;#39;, 1323730075897],
              [&amp;#39;55555555555555555555555555555555&amp;#39;, &amp;#39;0a25720a&amp;#39;, 1323730075907],
              [&amp;#39;6aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&amp;#39;, &amp;#39;0a1d3c0c&amp;#39;, 1323730075944]]}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And just like that node #23 has no record of ever having a value for the &amp;#8220;Generation&amp;#8221; column.&lt;/p&gt;

&lt;h3 id='step_7__rewrite_the_generation_number'&gt;Step 7 - Re-write the Generation number&lt;/h3&gt;

&lt;p&gt;The next time I restarted node #23 I needed it to get a Generation number greater than the one all the other nodes have for it. And for all restarts after that it should continue to increase the Generation.&lt;/p&gt;

&lt;p&gt;So I set the Generation column to &lt;code&gt;1762556151&lt;/code&gt; decimal or &lt;code&gt;690e78f7&lt;/code&gt; hex using the &lt;code&gt;cassandra-cli&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@system] set LocationInfo[&amp;#39;L&amp;#39;][&amp;#39;Generation&amp;#39;] = bytes(&amp;#39;690e78f7&amp;#39;);
Value inserted.
Elapsed time: 1 msec(s).&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And then just to check:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@system] get LocationInfo[&amp;#39;L&amp;#39;];                                  
=&amp;gt; (column=ClusterName, value=737069, timestamp=1320437246450000)
=&amp;gt; (column=Generation, value=690e78f7, timestamp=1323811677675000)
=&amp;gt; (column=Partioner, value=6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572, timestamp=1320437246450000)
=&amp;gt; (column=Token, value=15555555555555555555555555555555, timestamp=1323805490467000)
Returned 4 results.
Elapsed time: 1 msec(s).
[default@system] &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Looks good, got the next Generation number in there and the timestamp is sane.&lt;/p&gt;

&lt;h3 id='step_8__confirm_the_generation_rewrite'&gt;Step 8 - Confirm the Generation re-write&lt;/h3&gt;

&lt;p&gt;To get the change on disk I flushed the CF:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/nodetool -h localhost flush system LocationInfo&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I now expected to see two SSTables for &lt;code&gt;LocationInfo&lt;/code&gt; on disk:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# ls -lah data/system/
total 9428
-rw-r--r--  1 root  wheel   588B Dec 13 21:24 LocationInfo-hb-55-Data.db
-rw-r--r--  1 root  wheel    68B Dec 13 21:24 LocationInfo-hb-55-Digest.sha1
-rw-r--r--  1 root  wheel   976B Dec 13 21:24 LocationInfo-hb-55-Filter.db
-rw-r--r--  1 root  wheel    61B Dec 13 21:24 LocationInfo-hb-55-Index.db
-rw-r--r--  1 root  wheel   4.2K Dec 13 21:24 LocationInfo-hb-55-Statistics.db
-rw-r--r--  1 root  wheel    80B Dec 13 21:32 LocationInfo-hb-57-Data.db
-rw-r--r--  1 root  wheel    68B Dec 13 21:32 LocationInfo-hb-57-Digest.sha1
-rw-r--r--  1 root  wheel    16B Dec 13 21:32 LocationInfo-hb-57-Filter.db
-rw-r--r--  1 root  wheel    11B Dec 13 21:32 LocationInfo-hb-57-Index.db
-rw-r--r--  1 root  wheel   4.2K Dec 13 21:32 LocationInfo-hb-57-Statistics.db&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And if we poke into the new one:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/sstable2json data/system/LocationInfo-hb-57-Data.db
{
&amp;quot;4c&amp;quot;: [[&amp;quot;47656e65726174696f6e&amp;quot;,&amp;quot;690e78f7&amp;quot;,1323811677675000]]
}&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id='step_9__restart_node_23'&gt;Step 9 - Restart node #23&lt;/h3&gt;

&lt;p&gt;Before the restart node #23 was using 1762556151 as the Generation. This was because the previous on disk value was 1762556150, and when it started it read this and added 1. I had now set the on disk value to be 1762556151 so after the restart I expected the Generation to be 1762556152:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cassandra23# bin/nodetool -h localhost info
Token            : 28356863910078205288614550619314017621
Gossip active    : true
Load             : 275.44 GB
Generation No    : 1762556152
Uptime (seconds) : 495
Heap Memory (MB) : 959.72 / 8032.00
Data Center      : DC1
Rack             : RAC_unknown  
Exceptions       : 0&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Aces. And finally on disk the value should be 1762556152 decimal or 690E78F8 hex:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@system] get LocationInfo[&amp;#39;L&amp;#39;];            
=&amp;gt; (column=ClusterName, value=737069, timestamp=1320437246450000)
=&amp;gt; (column=Generation, value=690e78f8, timestamp=1323812553690000)
=&amp;gt; (column=Partioner, value=6f72672e6170616368652e63617373616e6472612e6468742e52616e646f6d506172746974696f6e6572, timestamp=1320437246450000)
=&amp;gt; (column=Token, value=15555555555555555555555555555555, timestamp=1323805490467000)
Returned 4 results.
Elapsed time: 22 msec(s).&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;After the node came back online it started using the new Generation number and the other nodes saw it as UP.&lt;/p&gt;

&lt;h2 id='wrapping_up'&gt;Wrapping up&lt;/h2&gt;

&lt;p&gt;I think in theory there is a danger that if we removed node #23 from the ring and added another node with the same IP in less than 4 days it will have problems with it&amp;#8217;s Generation not been seen as new. But thats a pretty small danger.&lt;/p&gt;

&lt;p&gt;The important thing in all this was that the site stayed up. It&amp;#8217;s one of the nicest things about working on Cassandra, you can fix problems while it&amp;#8217;s working.&lt;/p&gt;

&lt;p&gt;Oh and I fixed the other columns with wacky time stamps and am about to write a patch to log when things like this happen.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Reversed Comparators</title>
   <link href="http://thelastpickle.com/2011/10/03/Reverse-Comparators" />
   <updated>2011-10-03T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/10/03/Reverse-Comparators</id>
   <content type="html">&lt;p&gt;Cassandra 0.8.1 &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-2355'&gt;added&lt;/a&gt; support for Composite Types and Reversed Types, through a handy little type composition language. Hopefully I&amp;#8217;ll get to show some of the things you can do with Composite Types later, for now the best resource I know is Ed Anuff&amp;#8217;s presentation on &lt;a href='http://www.slideshare.net/edanuff/indexing-in-cassandra'&gt;Cassandra Indexing Techniques&lt;/a&gt; at Cassandra SF 2011.&lt;/p&gt;

&lt;p&gt;The Reversed Comparator wraps another Comparator (.e.g. &lt;code&gt;AsciiType&lt;/code&gt;) and well, there is easy way to say this, &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.6/src/java/org/apache/cassandra/db/marshal/ReversedType.java#L60'&gt;reverses&lt;/a&gt; the order of the comparisons so that columns are stored in &lt;strong&gt;descending&lt;/strong&gt; order. If you are using something like a time stamp for a column name, the most recent value will be the first column on the row rather than the last. In a Relational DB with tree based indexes there is not much difference between getting the first or last row in an index, but Cassandra does not use tree indexes.&lt;/p&gt;

&lt;p&gt;Recall from my post on &lt;a href='http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/'&gt;Cassandra Query Plans&lt;/a&gt; that once rows get to a certain size they include an index of the columns. And that the entire index must be read whenever any part of the index needs to be used, which is the case when using a Slice Range that specifies start or reversed. So the fastest slice query to run against a row was one that retrieved the first X columns in a row by only specifying a column count.&lt;/p&gt;

&lt;p&gt;To get an idea of how different the performance is I took the test script from &lt;a href='http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/'&gt;Cassandra Query Plans&lt;/a&gt; and modified it to test two query patterns using both (regular) Ascending and (reversed) Descending comparators.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Getting the 10 most recent columns using a start column and a column count.&lt;/li&gt;

&lt;li&gt;Getting the 10 most recent columns using only a column count.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the second case the query would specify &lt;code&gt;reversed&lt;/code&gt; when using (regular) Ascending ordering so Cassandra would get the last 10 columns. However when using (reversed) Descending columns it would only specify the column count.&lt;/p&gt;

&lt;h2 id='in_motion__the_setup'&gt;In Motion - The Setup&lt;/h2&gt;

&lt;p&gt;The &lt;a href='https://gist.github.com/1258711'&gt;reverse_query_profile.py gist&lt;/a&gt; contains the Python code I used to setup the data and profile queries using the pycassa library. All tests were done locally on a 2011 Mac Book Pro with 8GB of RAM and spinning disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; you will need to edit the file and set the CASSANDRA_PATH at the top of the file.&lt;/p&gt;

&lt;p&gt;The tests used columns with a 10 byte name and 25 bytes of data. Together with the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.6/src/java/org/apache/cassandra/db/Column.java#L116'&gt;15 bytes of column meta data&lt;/a&gt; this will create columns that use 50 bytes on disk. For the standard 64KB column page this should give 1,310 columns per column page.&lt;/p&gt;

&lt;p&gt;To test the latency of various queries I used the (recent) &amp;#8216;ReadLatency: &amp;#8217; metric available via &lt;code&gt;nodetool cfstats&lt;/code&gt;. This metric wraps the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.6/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1295'&gt;local processing&lt;/a&gt; of a query for a single CF. The value presented by &lt;code&gt;nodetool&lt;/code&gt; is the most recent value.&lt;/p&gt;

&lt;p&gt;All tests were run 10 times and the min, max 80th and 95th percentiles of latency were recorded.&lt;/p&gt;

&lt;p&gt;The Reversed Comparator is specified by using a string to specify the comparator and appending &lt;code&gt;(reversed=true)&lt;/code&gt;. I created the schema using a clean 0.8.6 install and the following &lt;code&gt;bin/cassandra-cli&lt;/code&gt; script:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;create keyspace reverse
    with strategy_options=[{replication_factor:1}]
    and placement_strategy = &amp;#39;org.apache.cassandra.locator.SimpleStrategy&amp;#39;;

use reverse;

create column family NoCache_Ascending
    with comparator = AsciiType
    and default_validation_class = AsciiType
    and key_validation_class = AsciiType
    and keys_cached = 0
    and rows_cached = 0;

create column family NoCache_Descending
    with comparator = &amp;#39;AsciiType(reversed=true)&amp;#39;
    and default_validation_class = AsciiType
    and key_validation_class = AsciiType
    and keys_cached = 0
    and rows_cached = 0;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; You can use the Reversed Comparator anywhere, but it only makes sense to do so for the &lt;code&gt;comparator&lt;/code&gt;, &lt;code&gt;subcomparator&lt;/code&gt; and any secondary indexes.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;reverse_query_profile&lt;/code&gt; module inserts the following rows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&amp;#8220;small-row&amp;#8221; with 100 columns, 5K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;no-col-index&amp;#8221; with 1200 columns, 60K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;five-thousand&amp;#8221; with 5000 columns, 244K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;ten-thousand&amp;#8221; with 10000 columns, 488K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;hundred-thousand&amp;#8221; with 100000 columns, 4.8M of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;one-million&amp;#8221; with 1000000 columns, 48M of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;ten-million&amp;#8221; with 10000000 columns, 480M of data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Insert the data by executing:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python reverse_query_profile insert_rows&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To keep things simple flush and compact the database so we only have one SSTable:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost flush
$ bin/nodetool -h localhost compact query&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;At the end of this my &lt;code&gt;/var/lib/cassandra/data/query&lt;/code&gt; directory contained the following SSTables (ignoring the &lt;code&gt;-Compacted&lt;/code&gt; SSTables):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-rw-r--r--   1 aaron  wheel   536M  3 Oct 21:49 NoCache_Ascending-g-184-Data.db
-rw-r--r--   1 aaron  wheel   2.4K  3 Oct 21:49 NoCache_Ascending-g-184-Filter.db
-rw-r--r--   1 aaron  wheel   154B  3 Oct 21:49 NoCache_Ascending-g-184-Index.db
-rw-r--r--   1 aaron  wheel   4.2K  3 Oct 21:49 NoCache_Ascending-g-184-Statistics.db
-rw-r--r--   1 aaron  wheel   536M  3 Oct 21:49 NoCache_Descending-g-149-Data.db
-rw-r--r--   1 aaron  wheel   2.4K  3 Oct 21:49 NoCache_Descending-g-149-Filter.db
-rw-r--r--   1 aaron  wheel   154B  3 Oct 21:49 NoCache_Descending-g-149-Index.db
-rw-r--r--   1 aaron  wheel   4.2K  3 Oct 21:49 NoCache_Descending-g-149-Statistics.db&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;reverse_query_profile&lt;/code&gt; contains a warm up function that will slice through all the columns in all the rows, warm up the database by executing:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python reverse_query_profile warm_up&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id='in_motion__start_column'&gt;In Motion - Start Column&lt;/h2&gt;

&lt;p&gt;I was not expecting much difference between the Ascending and Descending CF&amp;#8217;s for this test as both use a start column. For the Ascending CF the script specifies the name of the tenth last column as the column name. For the Descending CF it uses the name of the last column because it will be the first column on the row and the others will follow it.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ./reverse_query_profile.py recent_100_start
Latency is min, 80th percentile, 95th percentile and max.
100 most recent columns, using start

Testing CF: NoCache_Ascending
Row            small-row latency in ms       0.14     0.1512     0.1542      0.157
Row         no-col-index latency in ms      0.361     0.5082     0.6311      0.751
Row        five-thousand latency in ms      0.337      0.368     0.5269      0.721
Row         ten-thousand latency in ms      0.307       0.32     0.3214      0.323
Row     hundred-thousand latency in ms      0.205     0.2362     0.2506      0.266
Row          one-million latency in ms      0.429      0.476     0.5234       0.58
Row          ten-million latency in ms      1.247      1.432       4.71       8.71
Testing CF: NoCache_Descending
Row            small-row latency in ms      0.138     0.1526     0.1697       0.19
Row         no-col-index latency in ms       0.32     0.3532     0.3572       0.36
Row        five-thousand latency in ms      0.329      0.374     0.3863        0.4
Row         ten-thousand latency in ms      0.351     0.3868     0.3875      0.388
Row     hundred-thousand latency in ms      0.357     0.4094     0.4955        0.6
Row          one-million latency in ms      0.451      0.496     0.5069      0.519
Row          ten-million latency in ms      1.312      1.401      1.415      1.424&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Looks like there was a couple of outliers that mucked up the 95th percentile, but if we look at the 80th percent there is not much difference between the two CF&amp;#8217;s.&lt;/p&gt;

&lt;h2 id='in_motion__count_only'&gt;In Motion - Count Only&lt;/h2&gt;

&lt;p&gt;This is where the differences are. When a query does not specify a start column (and does not specify reversed) the server can just start reading columns from the start without having to worry about finding the right place to start. This is exactly what we can do for the Descending CF.&lt;/p&gt;

&lt;p&gt;For the regular Ascending CF we need to specify reversed, so the server must read the row index and work out which column is column count from the end of the row.&lt;/p&gt;

&lt;p&gt;There is no comparison really.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ./reverse_query_profile.py recent_100_count
Latency is min, 80th percentile, 95th percentile and max.
100 most recent columns, using no start

Testing CF: NoCache_Ascending
Row            small-row latency in ms      0.119     0.1496     0.1577      0.167
Row         no-col-index latency in ms      0.313     0.3504     0.3879      0.433
Row        five-thousand latency in ms      0.315      0.328     0.3361      0.346
Row         ten-thousand latency in ms      0.264      0.291     0.4032      0.539
Row     hundred-thousand latency in ms      0.181     0.2038     0.2049      0.206
Row          one-million latency in ms      0.397     0.4308     0.4315      0.432
Row          ten-million latency in ms       1.29      1.421      4.538      8.341
Testing CF: NoCache_Descending
Row            small-row latency in ms      0.133     0.1508     0.1514      0.152
Row         no-col-index latency in ms      0.139      0.157     0.1633      0.171
Row        five-thousand latency in ms      0.136     0.1506     0.1532      0.156
Row         ten-thousand latency in ms      0.142     0.1558     0.1565      0.157
Row     hundred-thousand latency in ms      0.139     0.1498     0.1518      0.154
Row          one-million latency in ms      0.139     0.1592     0.1661      0.171
Row          ten-million latency in ms      0.134     0.1602     0.1678      0.175&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For this type of query the Reversed Comparator provided consistent query performance no matter how many columns the row container. From now on if you are modeling time series data, such as in our old friend the Twitter clone, you should probably use the Reversed Comparator.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Token Range Ownership</title>
   <link href="http://thelastpickle.com/2011/09/30/Token-Range-Ownership" />
   <updated>2011-09-30T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/09/30/Token-Range-Ownership</id>
   <content type="html">&lt;p&gt;Recently, like 2 hours ago, I was planning some work to rebalance a Cassandra cluster and I wanted to see how the steps involved would effect the range ownership of the nodes. So I replicated the logic from &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.6/src/java/org/apache/cassandra/dht/RandomPartitioner.java#L152'&gt;RandomPartitioner.describeOwnership()&lt;/a&gt; in a handy python script.&lt;/p&gt;

&lt;p&gt;The script is available at &lt;a href='https://gist.github.com/1250496'&gt;git hub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In my case I had an unbalanced cluster that looked like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Address         DC          Rack        Status State   Load            Owns    Token                                       
                                                                              154009024815050802110273337963779530663     
127.0.0.3    datacenter1 rack1       Up     Normal  430.51 GB       69.95%  	102889564695022956386161396156024583904     
127.0.0.2     datacenter1 rack1       Up     Normal  430.26 GB       22.81%   	141704132449535340642001248672108470009     
127.0.0.1    datacenter1 rack1       Up     Normal  725.61 GB       7.23%   	154009024815050802110273337963779530663&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So I wanted to change the nodes to use the balanced tokens:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;0
56713727820156410577229101238628035242
113427455640312821154458202477256070484&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As a side affect I also want to make the token order match the node IP / host name order, which makes things a little more confusing than they need to be. So each node, together with it&amp;#8217;s old and new token looks like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;127.0.0.1
    154009024815050802110273337963779530663
    0
127.0.0.2
    141704132449535340642001248672108470009
    56713727820156410577229101238628035242
127.0.0.3
    102889564695022956386161396156024583904
    113427455640312821154458202477256070484&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To get an idea of what the cluster balance would look like as I performed the &lt;code&gt;nodetool move&lt;/code&gt; operations I passed the initial tokens as a space separated list to the script and used the &lt;code&gt;--interactive&lt;/code&gt; arg so I could easily enter the changes I wanted to make.&lt;/p&gt;

&lt;p&gt;Here are the initial tokens, in the order of the host names.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;./token_range.py --interactive 154009024815050802110273337963779530663 141704132449535340642001248672108470009 102889564695022956386161396156024583904
154009024815050802110273337963779530663 141704132449535340642001248672108470009 102889564695022956386161396156024583904
69.95% - 102889564695022956386161396156024583904
22.81% - 141704132449535340642001248672108470009
 7.23% - 154009024815050802110273337963779530663&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;First change is to wrap the token for &lt;code&gt;0.1&lt;/code&gt; around to 0:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Next tokens: 0 141704132449535340642001248672108470009 102889564695022956386161396156024583904
0 141704132449535340642001248672108470009 102889564695022956386161396156024583904
16.71% - 0
60.47% - 102889564695022956386161396156024583904
22.81% - 141704132449535340642001248672108470009&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Next center the middle token:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Next tokens: 0 56713727820156410577229101238628035242 102889564695022956386161396156024583904
0 56713727820156410577229101238628035242 102889564695022956386161396156024583904
39.53% - 0
33.33% - 56713727820156410577229101238628035242
27.14% - 102889564695022956386161396156024583904&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Finally sort out the little piggy at the end:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Next tokens: 0 56713727820156410577229101238628035242 113427455640312821154458202477256070484
0 56713727820156410577229101238628035242 113427455640312821154458202477256070484
33.33% - 0
33.33% - 56713727820156410577229101238628035242
33.33% - 113427455640312821154458202477256070484&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; The only testing this had was that it worked for me and matched what a live cluster was saying.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Cassandra Query Plans</title>
   <link href="http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans" />
   <updated>2011-07-04T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans</id>
   <content type="html">&lt;p&gt;I&amp;#8217;ve had a few conversations about query performance on wide rows recently, so it seemed about time to dig into how the different slice queries work.&lt;/p&gt;

&lt;h2 id='lotsogarbage'&gt;Lots-o-garbage&lt;/h2&gt;

&lt;p&gt;Clearly lots of garbage in an SSTable in the form of deleted, expired, overwritten or tombstoned columns will have an impact on reads. As they increase the amount of data that will be read, examined and then discarded.&lt;/p&gt;

&lt;p&gt;Like any database one part of query evaluation involves reading data from disk and another involves filtering the candidate data into the final result set. The filtering part of a Cassandra read query involves collating the various row fragments, reconciling multiple columns with the same name, and handling tombstones and expiring columns.&lt;/p&gt;

&lt;p&gt;For now I&amp;#8217;m going to focus on the reading part and I&amp;#8217;m going to restrict it further by only considering cases where all the data is in a single SSTable and there are no garbage columns.&lt;/p&gt;

&lt;h2 id='whats_in_a_row'&gt;What&amp;#8217;s in a row?&lt;/h2&gt;

&lt;p&gt;Aside from the columns there are a few other things stored in a row in an SSTable that are of use when performing read queries.&lt;/p&gt;

&lt;p&gt;A row will be written to an SSTable with zero columns if it has a row level Tombstone. The Tombstone will be used when reconciling the columns for a query as it may delete columns in the row that have a lower timestamp.&lt;/p&gt;

&lt;p&gt;Each row also contains a &lt;a href='http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html'&gt;Bloom Filter&lt;/a&gt; of the column names in the row. This can be used to test if the row &lt;em&gt;may&lt;/em&gt; contain any of the columns a query is looking for. Bloom Filters may give false positives, just like the SSTable bloom filter for row keys, so the only way to know if the column really exists is to read it.&lt;/p&gt;

&lt;p&gt;Finally if the serialised size of the columns in the row is greater than the &lt;code&gt;column_index_size_in_kb&lt;/code&gt; config setting (default is 64) a column index is also written. The index is a sampling of first and last column names in each &lt;code&gt;column_index_size_in_kb&lt;/code&gt; (or more) chunk of column data, along with the offset and width of the chunk. Queries can de-serialise the index and work out how may pages of columns to skip before dropping down to the data file to scan for matching columns.&lt;/p&gt;

&lt;h2 id='slice_by_names'&gt;Slice By Names&lt;/h2&gt;

&lt;p&gt;A slice query that specifies a list of columns names in the &lt;a href='http://wiki.apache.org/cassandra/API'&gt;SlicePredicate&lt;/a&gt; is turned into a &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/SliceByNamesReadCommand.java'&gt;SliceByNamesReadCommand&lt;/a&gt; internally. This is the command you will see logged by the server when &lt;code&gt;DEBUG&lt;/code&gt; level logging is enabled. It uses the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/columniterator/SSTableNamesIterator.java'&gt;SSTableNamesIterator&lt;/a&gt; to read bytes off disk and create the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/Column.java'&gt;Column&lt;/a&gt; objects that will be collated and reconciled to arrive at the query result.&lt;/p&gt;

&lt;p&gt;The first thing the Iterator does is determine if the SSTable contains the row fragment we are interested in, this follows the read path described in &lt;a href='http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/'&gt;The forces of Write and Read&lt;/a&gt;. When the file is opened a buffer of size &lt;code&gt;column_index_size_in_kb&lt;/code&gt; (specified in cassandra.yaml, default is 64K) will be allocated if Standard (non memory mapped) file access is been used. If the default &lt;a href='http://en.wikipedia.org/wiki/Memory-mapped_file'&gt;Memory-mapped file access&lt;/a&gt; is used the buffer size is ignored.&lt;/p&gt;

&lt;p&gt;If the row exists in the SSTable the Iterator will seek to the start of the row in the &lt;code&gt;Data&lt;/code&gt; component of the SSTable and take the following initial steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read the row key.&lt;/li&gt;

&lt;li&gt;Read the row size.&lt;/li&gt;

&lt;li&gt;Read the row level Bloom Filter.&lt;/li&gt;

&lt;li&gt;Read the row level column index, if present.&lt;/li&gt;

&lt;li&gt;Read the row level Tombstone.&lt;/li&gt;

&lt;li&gt;Filter the requested columns using the row level Bloom Filter.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At the end of these initial steps the Iterator will know if the SSTable &lt;em&gt;may&lt;/em&gt; contain any of the columns requested in the query. Even if the SSTable table does not contain any columns of interest it may contain a row level Tombstone which will need to be reconciled with columns contained in other SSTables.&lt;/p&gt;

&lt;p&gt;If the row may contain columns for the query the path the Iterator will take depends on presence of the column index.&lt;/p&gt;

&lt;p&gt;If the row does not contain a column index the iterator will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read the number of columns in the row.&lt;/li&gt;

&lt;li&gt;Read each column fully (including it&amp;#8217;s value) from the SSTable.&lt;/li&gt;

&lt;li&gt;Add the column to the result set if it&amp;#8217;s name is in the list of columns queried for.&lt;/li&gt;

&lt;li&gt;Stop the scan early if the number of filtered columns (from the Bloom Filter) is reached.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is a basic scan operation over potentially all columns in the row. However the code knows it will only have to scan at most one page of columns that will contain at most &lt;code&gt;column_index_size_in_kb&lt;/code&gt; (64K) of columns. The number of columns in the column page will vary with regard to the size of the columns, so the amount of de-serialisation and object creation will vary for each row.&lt;/p&gt;

&lt;p&gt;When using Standard disk access with the default file buffer size is 64K the entire row may have already been read from disk. When using Memory-mapped file access the row will be backed by a &lt;a href='http://download.oracle.com/javase/6/docs/api/java/nio/MappedByteBuffer.html'&gt;MappedByteBuffer&lt;/a&gt; of up to 2GB. The MappedByteBuffer will use the native &lt;a href='http://en.wikipedia.org/wiki/Page_(computer_memory'&gt;page size&lt;/a&gt;) when it asks for data, on Mac Book this is 4096 bytes (&lt;code&gt;getconf PAGESIZE&lt;/code&gt;). The &lt;code&gt;MappedByteBuffer&lt;/code&gt; has the advantage of not needing to copy the bytes read from disk, and it will also keep the data in memory until the OS needs to page it out.&lt;/p&gt;

&lt;p&gt;In both cases it&amp;#8217;s reasonable to assume there maybe no additional IO to read all the bytes for the column page.&lt;/p&gt;

&lt;p&gt;If the row does contain a column index the Iterator will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use the the column index to work out the distinct ordered set of column pages that contain the columns which may exist in the row (filtered columns from the Bloom Filter).&lt;/li&gt;

&lt;li&gt;Seek to the start position for each page of columns identified in step 1.&lt;/li&gt;

&lt;li&gt;Scan through the entire page of columns, de-serialise each column fully (including it&amp;#8217;s value) and add it to the result set if it&amp;#8217;s name is in the list of columns queried for.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The search path is reduced by only scanning the pages that contain columns which passed through the bloom filter. However there is no facility to stop a particular page scan until the end of the page is reached.&lt;/p&gt;

&lt;p&gt;For both indexed and non indexed rows all the columns which match the query are eagerly read from disk for each SSTable and held in the Iterator.&lt;/p&gt;

&lt;h2 id='slice_from_read'&gt;Slice From Read&lt;/h2&gt;

&lt;p&gt;A slice query that does not specify a list of columns is turned into a &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/SliceFromReadCommand.java'&gt;SliceFromReadCommand&lt;/a&gt;. It will contain optional start and finish column names and a column count. Again the work for reading columns from disk is done by an Iterator, this time the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/columniterator/SSTableSliceIterator.java'&gt;SSTableSliceIterator&lt;/a&gt; which wraps either a &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/columniterator/SimpleSliceReader.java'&gt;SimpleSliceReader&lt;/a&gt; or an &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/columniterator/IndexedSliceReader.java'&gt;IndexedSliceReader&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;SSTableSliceIterator&lt;/code&gt; is responsible for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ensuring the key exists in the SSTable and seeking to it&amp;#8217;s position.&lt;/li&gt;

&lt;li&gt;Reading the key from the row.&lt;/li&gt;

&lt;li&gt;Reading the row size.&lt;/li&gt;

&lt;li&gt;Deciding which Iterator to use.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;code&gt;SimpleSliceReader&lt;/code&gt; is used when the query does not specify a start column and uses the default ascending column order (i.e. does not specify reversed). Otherwise the &lt;code&gt;IndexedSliceReader&lt;/code&gt; is used.&lt;/p&gt;

&lt;p&gt;When the &lt;code&gt;SimpleSliceReader&lt;/code&gt; is used it will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Skip and ignore the row level Bloom Filter.&lt;/li&gt;

&lt;li&gt;Skip and ignore the column index.&lt;/li&gt;

&lt;li&gt;Read the row level Tombstone.&lt;/li&gt;

&lt;li&gt;Read the column count.&lt;/li&gt;

&lt;li&gt;De-serialise each column in turn on demand as requested through the Iterator interface.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;code&gt;SimpleSliceReader&lt;/code&gt; will only stop the scan when either the end of the row is reached, or a column with a name greater than the finish column is reached.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s the responsibility of the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/filter/SliceQueryFilter.java'&gt;SliceQueryFilter&lt;/a&gt; to only request the columns it needs for the query. This involves determining which columns are &lt;code&gt;live&lt;/code&gt; and contribute to the column count specified in the &lt;code&gt;SliceRange&lt;/code&gt;. The query will be stopped by either:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reading &lt;code&gt;count&lt;/code&gt; live columns from the SSTables.&lt;/li&gt;

&lt;li&gt;Reading past the finish column if specified.&lt;/li&gt;

&lt;li&gt;Exhausting all SSTables involved in the read.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;code&gt;IndexedSliceReader&lt;/code&gt; has a much tougher life. It is used when there is a start column or the columns must be returned in reverse order. It will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Skip and ignore the row level Bloom Filter.&lt;/li&gt;

&lt;li&gt;Read the column index.&lt;/li&gt;

&lt;li&gt;De-serialise the row and read the row level tombstone if present.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Depending on the presence of the column index one of two algorithms will be used to read the columns from disk. The &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/columniterator/IndexedSliceReader.java#L200'&gt;SimpleBlockFetcher&lt;/a&gt; is used when the row does not contain a column index. It de-serialises the columns in the row, and builds a list that is either disk ordered or reverse disk ordered depending on the query. The scan is stopped by either:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reading all columns in the row.&lt;/li&gt;

&lt;li&gt;Reading a column beyond the finish column, if a finish column was specified and reverse is false.&lt;/li&gt;

&lt;li&gt;Reading a column beyond the start column, if a start column was specified and reverse is true.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Unlike the &lt;code&gt;SimpleSliceReader&lt;/code&gt; when the &lt;code&gt;IndexedSliceReader&lt;/code&gt; is used on a small row it will eagerly read all columns from the row which match the query. This only happens in the context of a row which has less then &lt;code&gt;column_index_size_in_kb&lt;/code&gt; of column data .&lt;/p&gt;

&lt;p&gt;The &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/columniterator/IndexedSliceReader.java#L140'&gt;IndexedBlockFetcher&lt;/a&gt; is used for all other cases, it will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use the column index to find the first column page to read from.&lt;/li&gt;

&lt;li&gt;Step forwards or backwards (for reversed) through the column pages for the row on demand as each page of columns is exhausted.&lt;/li&gt;

&lt;li&gt;Stop the iteration if the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/io/sstable/IndexHelper.java#L164'&gt;IndexInfo&lt;/a&gt; for the next column page does not contain columns inside the start or finish column range, if a start or finish column are specified.&lt;/li&gt;

&lt;li&gt;De-serialise all columns in each page, building a list of columns in either disk order or reverse disk order.&lt;/li&gt;

&lt;li&gt;Stop a page scan using the same criteria as the &lt;code&gt;SimpleBlockFetcher&lt;/code&gt; above.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;code&gt;IndexedBlockFetcher&lt;/code&gt; greedily reads the columns in each page just as the &lt;code&gt;SimpleBlockFetcher&lt;/code&gt; does. But will only advance to the next page when the query requests more columns and the previous block of columns has been exhausted.&lt;/p&gt;

&lt;h2 id='so_what_'&gt;So What ?&lt;/h2&gt;

&lt;p&gt;Not much.&lt;/p&gt;

&lt;p&gt;Cassandra works like other databases by using a seek and partial scan approach to finding data. Read requests that require fewer seek+scan operations will be faster than those the require more.&lt;/p&gt;

&lt;p&gt;There are a couple of things we can say about what will make some queries go faster than others though.&lt;/p&gt;

&lt;h3 id='name_locality'&gt;Name Locality&lt;/h3&gt;

&lt;p&gt;Queries by column name that select columns on fewer column pages should be faster that those which are spread out over more column pages. Querying more column pages means reading more data and creating more Column objects. The pathological case is selecting n columns from n column pages.&lt;/p&gt;

&lt;p&gt;Note that queries by column name must also de-serialise the column index, and will pay a constant cost for every query regardless of the number of columns requested or their distribution in the row.&lt;/p&gt;

&lt;h3 id='start_position'&gt;Start Position&lt;/h3&gt;

&lt;p&gt;The entire column index must be de-serialised whenever an offset into the row needs to be found. Queries by that slice columns without specifying a start column and use (default) ascending order will perform better than those that either use a start column or reverse order. However the cost of de-serialising the column index is constant and the position of the start column should have little affect on query performance.&lt;/p&gt;

&lt;p&gt;The overhead should naturally increase with the width of the rows.&lt;/p&gt;

&lt;h2 id='in_motion__the_setup'&gt;In Motion - The Setup&lt;/h2&gt;

&lt;p&gt;The &lt;a href='https://gist.github.com/1074715'&gt;query_profile gist&lt;/a&gt; contains the Python code I used to setup the data and profile queries using the &lt;a href='https://github.com/pycassa/pycassa'&gt;pycassa&lt;/a&gt; library. All tests were done locally on a 2011 Mac Book Pro with 8GB of RAM and spinning disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: you will need to edit the file and set the &lt;code&gt;CASSANDRA_PATH&lt;/code&gt; at the top of the file.&lt;/p&gt;

&lt;p&gt;The tests used columns with a 10 byte name and 25 bytes of data. Together with the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/Column.java#L116'&gt;15 bytes of column meta data&lt;/a&gt; this will create columns that use 50 bytes on disk. For the standard 64KB column page this should give 1,310 columns per column page.&lt;/p&gt;

&lt;p&gt;To verify the column paging I recompiled the 0.8.1 source to include a change to the &lt;a href='https://gist.github.com/1068855'&gt;SSTableNamesIterator that logged&lt;/a&gt; the number of index entries each row had and how many column pages were used during a read.&lt;/p&gt;

&lt;p&gt;The output made the paging look &amp;#8220;good enough&amp;#8221;, for example:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#selecting from 1 page
INFO 13:21:52,662 Column Page Count 4/1
INFO 13:22:03,309 Column Page Count 8/1
INFO 13:22:14,070 Column Page Count 77/1
INFO 13:22:26,140 Column Page Count 763/1
INFO 13:22:49,362 Column Page Count 7628/1

#selecting from 50 pages 
INFO 13:32:15,501 Column Page Count 4/4
INFO 13:32:25,696 Column Page Count 8/8
INFO 13:32:36,790 Column Page Count 77/50
INFO 13:32:54,012 Column Page Count 763/50
INFO 13:33:37,511 Column Page Count 7628/50

#sometimes it was off
INFO 13:35:02,987 Column Page Count 77/53
INFO 13:35:04,153 Column Page Count 77/50
INFO 13:35:05,325 Column Page Count 77/51
INFO 13:35:06,489 Column Page Count 77/52
INFO 13:35:15,812 Column Page Count 763/71
INFO 13:35:18,182 Column Page Count 763/63
INFO 13:35:20,536 Column Page Count 763/69
INFO 13:35:22,898 Column Page Count 763/70
INFO 13:35:51,997 Column Page Count 7628/67
INFO 13:36:07,069 Column Page Count 7628/65
INFO 13:36:22,012 Column Page Count 7628/66&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To test the latency of various queries I used the (recent) &amp;#8216;ReadLatency: &amp;#8217; metric available via &lt;code&gt;nodetool cfstats&lt;/code&gt;. This metric wraps the &lt;a href='https://github.com/apache/cassandra/blob/cassandra-0.8.1/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1180'&gt;local processing&lt;/a&gt; of a query for a single CF. The value presented by &lt;code&gt;nodetool&lt;/code&gt; is the most recent value.&lt;/p&gt;

&lt;p&gt;All tests were run 10 times and the min, max 80th and 95th percentiles of latency were recorded.&lt;/p&gt;

&lt;p&gt;Using a clean 0.8.1 install create the following keyspace using &lt;code&gt;bin/cassandra-cli&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;create keyspace query
    with strategy_options=[{replication_factor:1}]
    and placement_strategy = &amp;#39;org.apache.cassandra.locator.SimpleStrategy&amp;#39;;

use query;

create column family NoCache
    with comparator = AsciiType
    and default_validation_class = AsciiType
    and key_validation_class = AsciiType
    and keys_cached = 0
    and rows_cached = 0;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;code&gt;query_profile&lt;/code&gt; module will insert the following rows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&amp;#8220;small-row&amp;#8221; with 100 columns, 5K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;no-col-index&amp;#8221; with 1200 columns, 60K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;five-thousand&amp;#8221; with 5000 columns, 244K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;ten-thousand&amp;#8221; with 10000 columns, 488K of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;hundred-thousand&amp;#8221; with 100000 columns, 4.8M of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;one-million&amp;#8221; with 1000000 columns, 48M of data&lt;/li&gt;

&lt;li&gt;&amp;#8220;ten-million&amp;#8221; with 10000000 columns, 480M of data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Insert the data by executing:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python query_profile insert_rows&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To keep things simple flush and compact the database so we only have one SSTable:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost flush
$ bin/nodetool -h localhost compact query&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;At the end of this my &lt;code&gt;/var/lib/cassandra/data/query&lt;/code&gt; directory contained the following SSTables (ignoring the &lt;code&gt;-Compacted&lt;/code&gt; SSTables):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;-rw-r--r--  1 aaron  wheel   536M  6 Jul 13:03 NoCache-g-80-Data.db
-rw-r--r--  1 aaron  wheel   1.4K  6 Jul 13:03 NoCache-g-80-Filter.db
-rw-r--r--  1 aaron  wheel   154B  6 Jul 13:03 NoCache-g-80-Index.db
-rw-r--r--  1 aaron  wheel   4.2K  6 Jul 13:03 NoCache-g-80-Statistics.db&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;query_profile&lt;/code&gt; contains a warm up function that will slice through all the columns in all the rows, warm up the database by executing:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python query_profile warm_up&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id='in_motion__name_locality'&gt;In Motion - Name Locality&lt;/h2&gt;

&lt;p&gt;We want to test that selecting columns by name when they are tightly grouped has better performance than selecting widely distributed columns.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;name_locality&lt;/code&gt; test runs through several tests that select up to 100 columns by name from each of the rows. The numbers show a reasonably stable latency for queries that touch the same number of pages, and an increase when the number of pages increases. Which matches the theory.&lt;/p&gt;

&lt;p&gt;Note: the code outputs a WARN message when the test selects less than 100 columns from a row. For various tests this happened on the &amp;#8220;small-row&amp;#8221;, &amp;#8220;no-col-index&amp;#8221;, &amp;#8220;five-thousand&amp;#8221; and &amp;#8220;ten-thousand&amp;#8221; rows. Mostly when selecting a set number of columns from each column page.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python query_profile.py name_locality
Latency is min, 80th percentile, 95th percentile and max.
Test name locality...

100 columns by name, start of the row.
Row            small-row had latency in ms      0.285      0.362     0.3828      0.396
Row         no-col-index had latency in ms      0.848     0.9146     0.9159      0.917
Row        five-thousand had latency in ms      1.257      1.321      1.324      1.324
Row         ten-thousand had latency in ms      1.226      1.311      1.333      1.358
Row     hundred-thousand had latency in ms      1.406      1.483      1.489      1.492
Row          one-million had latency in ms      2.848      2.957      3.019      3.089
Row          ten-million had latency in ms      17.75      18.72      18.76      18.78

100 columns by name, end of the row.
Row            small-row had latency in ms      0.273     0.2846     0.2891      0.294
Row         no-col-index had latency in ms      0.674     0.7078     0.7107      0.714
Row        five-thousand had latency in ms      0.928      0.962     0.9638      0.966
Row         ten-thousand had latency in ms      0.882      0.901     0.9172      0.937
Row     hundred-thousand had latency in ms      0.868     0.9138     0.9154      0.917
Row          one-million had latency in ms      2.553       2.72       2.73      2.739
Row          ten-million had latency in ms      16.95      17.67      17.82      17.97

100 columns by name, middle of row.
Row            small-row had latency in ms      0.273     0.3134     0.3422      0.368
Row         no-col-index had latency in ms      0.835     0.9052     0.9092      0.913
Row        five-thousand had latency in ms      1.719      1.772      1.839      1.915
Row         ten-thousand had latency in ms      1.726      1.802      1.818      1.837
Row     hundred-thousand had latency in ms      1.853      1.978      6.987       13.1
Row          one-million had latency in ms      2.682      2.832      2.927      3.041
Row          ten-million had latency in ms      17.27      18.59      18.72      18.75

100 columns by name, first 2 cols from 50 random pages
Row            small-row had latency in ms      0.115      0.119     0.1195       0.12
Row         no-col-index had latency in ms       0.37     0.4008     0.4019      0.403
Row        five-thousand had latency in ms      1.504      1.596      1.614      1.637
Row         ten-thousand had latency in ms       3.14      3.445      3.493      3.547
Row     hundred-thousand had latency in ms      25.25      27.63      36.62       47.5
Row          one-million had latency in ms      26.68      28.47      28.92      29.47
Row          ten-million had latency in ms      40.95      43.55      43.96      44.13

100 columns by name, last 2 cols from 50 random pages
Row            small-row had latency in ms      0.109      0.115     0.1218       0.13
Row         no-col-index had latency in ms      0.349     0.3536     0.3553      0.357
Row        five-thousand had latency in ms      1.778      1.854      1.877      1.897
Row         ten-thousand had latency in ms       3.39      3.664      3.715      3.765
Row     hundred-thousand had latency in ms      25.17      26.77       28.0      29.35
Row          one-million had latency in ms      26.44      28.04      28.19      28.35
Row          ten-million had latency in ms      40.57      44.26      44.69      44.74

100 columns by name, random 2 cols from 50 random pages
Row            small-row had latency in ms      0.107      0.117     0.1174      0.118
Row         no-col-index had latency in ms       0.36     0.4356     0.4445      0.445
Row        five-thousand had latency in ms       1.72      1.808      2.795      3.998
Row         ten-thousand had latency in ms      3.367      3.682      3.705      3.715
Row     hundred-thousand had latency in ms      25.45      27.23      28.93      30.95
Row          one-million had latency in ms       32.6      38.43      38.91      39.45
Row          ten-million had latency in ms      48.76      53.88      54.09      54.19

100 columns by name, first col from 100 random pages
Row            small-row had latency in ms      0.106      0.113     0.1306      0.152
Row         no-col-index had latency in ms      0.326     0.3508      0.351      0.351
Row        five-thousand had latency in ms      1.361      1.451      1.483      1.518
Row         ten-thousand had latency in ms      2.853      3.071      3.102      3.136
Row     hundred-thousand had latency in ms      37.65      40.08      40.92      41.84
Row          one-million had latency in ms      51.23      54.41      55.93       57.6
Row          ten-million had latency in ms      64.67      68.56      69.24       69.6

100 columns by name, last col from 100 random pages
Row            small-row had latency in ms      0.105      0.113     0.1153      0.118
Row         no-col-index had latency in ms      0.325      0.349     0.3553      0.363
Row        five-thousand had latency in ms      1.601      1.695      1.698        1.7
Row         ten-thousand had latency in ms      3.058      3.404      4.317      5.429
Row     hundred-thousand had latency in ms      37.34      39.52      40.54      41.72
Row          one-million had latency in ms      50.76      53.92      54.58      55.14
Row          ten-million had latency in ms       64.2      68.46      69.28      70.07

100 columns by name, random col from 100 random pages
Row            small-row had latency in ms       0.11     0.1156     0.1169      0.118
Row         no-col-index had latency in ms      0.327     0.3628     0.3721      0.382
Row        five-thousand had latency in ms      1.541      1.656      1.664      1.674
Row         ten-thousand had latency in ms      2.719      3.272      3.279      3.282
Row     hundred-thousand had latency in ms      35.51      37.52       38.8      40.35
Row          one-million had latency in ms      50.11      52.91       53.8       54.7
Row          ten-million had latency in ms      65.06      68.67       69.3      69.88&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id='in_motion__start_position'&gt;In Motion - Start Position&lt;/h2&gt;

&lt;p&gt;We want to show that any start position in a (forward) slice query has worse performance than one that does not. And that the position of the start column has little impact on the query performance.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;start_position&lt;/code&gt; test runs through several tests that select a slice of up to 100 columns with different start positions.&lt;/p&gt;

&lt;p&gt;The test for &amp;#8220;100 columns from the start of the row with a start col&amp;#8221; takes significantly longer than &amp;#8220;100 columns from with no start column&amp;#8221; for all rows other than &amp;#8220;small-row&amp;#8221;. For all rows other than &amp;#8220;no-col-index&amp;#8221; I attribute this to the column index. For &amp;#8220;no-col-index&amp;#8221; row I attribute the increased latency to eagerly reading the entire page of columns. Performance was then reasonably similar for the other tests which used a start column.&lt;/p&gt;

&lt;p&gt;I&amp;#8217;m not sure why the test for &amp;#8220;100 columns from the start of the second page&amp;#8221; was an outlier. Hopefully I&amp;#8217;ll get to take a longer look, but it seemed to consistency perform worse than other tests.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ python query_profile.py start_position
Test start position...

Latency is min, 80th percentile, 95th percentile and max.
100 columns from with no start column
Row            small-row latency in ms      0.181     0.1934     0.2134      0.237
Row         no-col-index latency in ms       0.18     0.1918     0.1956        0.2
Row        five-thousand latency in ms      0.169      0.185     0.1864      0.188
Row         ten-thousand latency in ms      0.187     0.1936     0.1972      0.201
Row     hundred-thousand latency in ms      0.183     0.1936     0.1963      0.199
Row          one-million latency in ms      0.181     0.1948     0.1959      0.197
Row          ten-million latency in ms      0.178     0.1952     0.1965      0.197

Latency is min, 80th percentile, 95th percentile and max.
100 columns from the start of the row with a start col
Row            small-row latency in ms      0.135      0.142     0.1424      0.143
Row         no-col-index latency in ms       0.34     0.3498       0.35       0.35
Row        five-thousand latency in ms      0.358     0.3676     0.3684      0.369
Row         ten-thousand latency in ms       0.36      0.376      0.511      0.676
Row     hundred-thousand latency in ms      0.358        0.4     0.5005      0.622
Row          one-million latency in ms      0.456      0.495     0.5006      0.505
Row          ten-million latency in ms      1.345      1.418       1.44      1.467

Latency is min, 80th percentile, 95th percentile and max.
100 columns from the start of the second page
Row            small-row latency in ms      0.128     0.1412     0.1424      0.143
Row         no-col-index latency in ms      0.327      0.338     0.3385      0.339
Row        five-thousand latency in ms      0.637     0.6488     0.6634      0.681
Row         ten-thousand latency in ms      0.615      0.651     0.6519      0.653
Row     hundred-thousand latency in ms      0.628     0.6726     0.6762       0.68
Row          one-million latency in ms      0.706     0.7636     0.7671      0.771
Row          ten-million latency in ms      1.569      1.706      3.148       4.91

Latency is min, 80th percentile, 95th percentile and max.
100 columns starting half way through the row
Row            small-row latency in ms      0.099     0.1028     0.1125      0.124
Row         no-col-index latency in ms      0.354     0.3626      0.363      0.363
Row        five-thousand latency in ms      0.397     0.4068     0.4318      0.462
Row         ten-thousand latency in ms      0.389     0.4138     0.4167       0.42
Row     hundred-thousand latency in ms      0.358     0.3828      0.383      0.383
Row          one-million latency in ms      0.456     0.5048     0.5135      0.524
Row          ten-million latency in ms      1.347       1.47      1.489      1.504

Latency is min, 80th percentile, 95th percentile and max.
100 columns starting from the last page 
Row            small-row latency in ms      0.126      0.143     0.1434      0.144
Row         no-col-index latency in ms      0.325     0.3406     0.3425      0.343
Row        five-thousand latency in ms      0.595      0.634     0.6646      0.696
Row         ten-thousand latency in ms      0.556     0.5694     0.5795      0.591
Row     hundred-thousand latency in ms      0.441     0.4942     0.5027      0.512
Row          one-million latency in ms      0.405     0.4468     0.4507      0.454
Row          ten-million latency in ms      1.272      1.372      1.393      1.417&lt;/code&gt;&lt;/pre&gt;</content>
 </entry>
 
 <entry>
   <title>Down For Me?</title>
   <link href="http://thelastpickle.com/2011/06/13/Down-For-Me" />
   <updated>2011-06-13T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/06/13/Down-For-Me</id>
   <content type="html">&lt;p&gt;For a read or write request to &lt;em&gt;start&lt;/em&gt; in Cassandra at least as many nodes must be seen as &lt;code&gt;UP&lt;/code&gt; by the coordinator node as the request has specified via the ConsistencyLevel. Otherwise the client will get an &lt;code&gt;UnavailableException&lt;/code&gt; and the cluster will appear down for that request. That may not necessarily mean it is down for all keys or all requests.&lt;/p&gt;

&lt;p&gt;The Replication Factor, number of nodes, the Consistency Level and luck all play a part of determining how many nodes can be lost in a Cassandra cluster before it is unavailable for 100% of the keys. Before it reaches that point though the cluster may go through a period of partial failure where some keys will not be available at some CL levels.&lt;/p&gt;

&lt;p&gt;The partial failure support baked into the system is nice thing to have. But most people will be interested in keeping 100% of the keys available at the required Consistency Level. So most of discussion below talks about keeping the cluster up for 100% of the keys at the &lt;code&gt;QUORUM&lt;/code&gt; CL.&lt;/p&gt;

&lt;h2 id='which_nodes_'&gt;Which nodes ?&lt;/h2&gt;

&lt;p&gt;When it comes to counting the &lt;code&gt;UP&lt;/code&gt; nodes for a request we only consider the &lt;em&gt;Natural Endpoints&lt;/em&gt; for a key. These are the nodes identified by the &lt;code&gt;placement_strategy&lt;/code&gt; (set when the Keyspace was created) as the replicas for a key, and they never change. All read and write operations for a key, using the same partitioner, will select those same endpoints. Otherwise write operations could plonk down data that reads could never find.&lt;/p&gt;

&lt;p&gt;The row key is first &lt;em&gt;Decorated&lt;/em&gt; by the &lt;code&gt;partitioner&lt;/code&gt; (specified in &lt;code&gt;conf/cassandra.yaml&lt;/code&gt;) to create the token used to locate the row in the cluster. For example the &lt;code&gt;RandomPartitioner&lt;/code&gt; uses an MD5 transform to turn the key into a 128bit token.&lt;/p&gt;

&lt;p&gt;Is using the &lt;code&gt;SimpleStrategy&lt;/code&gt; it will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Order the nodes in the ring by their &lt;code&gt;initial_token&lt;/code&gt;.&lt;/li&gt;

&lt;li&gt;Select the node whose token range includes the token as the first replica.&lt;/li&gt;

&lt;li&gt;Select the next RF-1 nodes in the ordered ring as the remaining replicas.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The nodes loop around, so a row may be replicated on the last 2 nodes and the first 1.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;NetworkTopologyStrategy&lt;/code&gt; use a more involved algorithm that considers the Data Centre and Rack the node is assigned to by the Snitch. There is a good discussion of the process from &lt;a href='http://www.mail-archive.com/user@cassandra.apache.org/msg12092.html'&gt;Peter Schuller&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id='the_weird_consistency_level'&gt;The weird Consistency Level&lt;/h2&gt;

&lt;p&gt;There is one Consistency Level that does not behave like the others, so lets just get it out of the way first. CL &lt;code&gt;ANY&lt;/code&gt; for write requests allows the coordinator node to store the mutation in the form of a &lt;a href='http://wiki.apache.org/cassandra/HintedHandoff'&gt;Hinted Handoff&lt;/a&gt; on &lt;strong&gt;any node in the cluster&lt;/strong&gt;, which in practice means on the coordinator itself.&lt;/p&gt;

&lt;p&gt;This is useful in cases where extreme write uptime is needed. The sort of extreme where the write &lt;strong&gt;cannot&lt;/strong&gt; be reliably read until a &lt;code&gt;nodetool repair&lt;/code&gt; operation has been completed. &lt;code&gt;Hinted Handoffs&lt;/code&gt; must be delivered to Natural Endpoints before they can be included in a read operation.&lt;/p&gt;

&lt;p&gt;If you write at CL &lt;code&gt;ANY&lt;/code&gt; and some of the Natural Endpoints are up, the write and the Hints will be sent to them. The coordinator will only be used to store the Hints in cases where all the Natural Endpoints are down.&lt;/p&gt;

&lt;p&gt;For more information on Hinted Handoff see &lt;a href='http://www.datastax.com/dev/blog/understanding-hinted-handoff'&gt;Jonathan&amp;#8217;s recent post&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id='consistency_levels'&gt;Consistency Levels&lt;/h2&gt;

&lt;p&gt;For all the other Consistency Levels the read or write request is directed to one of the Natural Endpoints. Hinted Handoffs may be used as part of the request but are do not considered when determining if the cluster is available for the request.&lt;/p&gt;

&lt;p&gt;The named Consistency Levels &lt;code&gt;ONE&lt;/code&gt;, &lt;code&gt;TWO&lt;/code&gt; and &lt;code&gt;THREE&lt;/code&gt; are pretty easy to understand. Once, two or three replicas for the key must be seen as &lt;code&gt;UP&lt;/code&gt; before the operation will start. CL &lt;code&gt;ONE&lt;/code&gt; is the most often used of these.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;QUORUM&lt;/code&gt; is calculated as &lt;code&gt;floor(RF \ 2) + 1&lt;/code&gt;. This is most used CL level and in my opinion should be the starting CL for all applications until a reason is found to change (performance is &lt;em&gt;not&lt;/em&gt; a reason).&lt;/p&gt;

&lt;p&gt;For RF levels below 3 the &lt;code&gt;QUORUM&lt;/code&gt; is the same as the RF level so:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RF 1 - QUORUM = 1&lt;/li&gt;

&lt;li&gt;RF 2 - QUORUM = 2&lt;/li&gt;

&lt;li&gt;RF 3 - QUORUM = 2&lt;/li&gt;

&lt;li&gt;RF 4 - QUORUM = 3&lt;/li&gt;

&lt;li&gt;RF 5 - QUORUM = 3&lt;/li&gt;

&lt;li&gt;RF 6 - QUORUM = 4&lt;/li&gt;

&lt;li&gt;RF 7 - QUORUM = 4&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the &lt;code&gt;NetworkTopologyStrategy&lt;/code&gt; is used each data centre has it&amp;#8217;s own RF, and the standard &lt;code&gt;QUORUM&lt;/code&gt; is calculated using the total RF for the cluster.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;LOCAL_QUORUM&lt;/code&gt; and &lt;code&gt;EACH_QUORUM&lt;/code&gt; can be used with the &lt;code&gt;NetworkTopologyStrategy&lt;/code&gt; and they instruct the coordinator to also consider the Data Centre the nodes are located in. A write is always sent to all &lt;code&gt;UP&lt;/code&gt; replicas, this&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;LOCAL_QUORUM&lt;/code&gt; only the RF of the local data centre is considered when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Calculating how many nodes to block for.&lt;/li&gt;

&lt;li&gt;Checking if enough nodes are &lt;code&gt;UP&lt;/code&gt; for the request.&lt;/li&gt;

&lt;li&gt;Counting if CL nodes have responded to the request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;EACH_QUORUM&lt;/code&gt; works in a similar way but the tests apply to every DC in the cluster.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ALL&lt;/code&gt; requires that all replicas for the row be &lt;code&gt;UP&lt;/code&gt; before the request will start.&lt;/p&gt;

&lt;h2 id='a_failing_range'&gt;A failing Range&lt;/h2&gt;

&lt;p&gt;A simple, but incomplete, way to think about the cluster been available is to focus a one key range and it&amp;#8217;s replicas.&lt;/p&gt;

&lt;p&gt;Consider RF 3, at QUORUM if one node is lost the range will still be available. If two nodes are lost the range will not be available for QUORUM operations, but will still be available for ONE and ANY requests.&lt;/p&gt;

&lt;p&gt;For any number of nodes in the cluster, a range will become unavailable it more than (RF - CL) nodes are &lt;code&gt;DOWN&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now consider a cluster with 50 nodes, RF 5, QUORUM operations and the SimpleStrategy. If 3 adjacent nodes nodes go down the range assigned to the first one will no longer be available, as their will only be 2 replicas &lt;code&gt;UP&lt;/code&gt;. The cluster will be down for 2% of the possible keys, not terrible but it&amp;#8217;s no longer up for 100%. The range assigned to the second down node will have 3 &lt;code&gt;UP&lt;/code&gt; replicas and the range assigned to the third will have 4 &lt;code&gt;UP&lt;/code&gt; replicas. The nodes do not have to be adjacent in the ring for this occur, it could be any nodes in the replica set for the range. It&amp;#8217;s just easier to think about when they are adjacent.&lt;/p&gt;

&lt;p&gt;Spreading replicas for a key across nodes with different physical infrastructure is a good way to mitigate this risk. The &lt;code&gt;NetworkTopologyStrategy&lt;/code&gt; distributes the replicas for a DC across the available Racks. As defined by either the &lt;code&gt;RackInferringSnitch&lt;/code&gt;, &lt;code&gt;PropertyFileSnitch&lt;/code&gt; or the &lt;code&gt;EC2Snitch&lt;/code&gt; which uses AWS Availability Zones as racks. The &lt;code&gt;SimpleSnitch&lt;/code&gt; puts all nodes into &lt;code&gt;rack1&lt;/code&gt; in &lt;code&gt;datacenter1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In the worst case failure scenario the cluster can sustain up to (RF - CL) failures and still remain available for 100% of the keys.&lt;/p&gt;

&lt;h2 id='a_failing_cluster'&gt;A failing Cluster&lt;/h2&gt;

&lt;p&gt;The best case scenario for failure is when the node failures are evenly divided amongst the replicas for a range. So that every RF number of failures only removes one node from the available replica set for each range. To know how many nodes we can lose for a Consistency Level to still be available for 100% of the keys multiply by RF-CL.&lt;/p&gt;

&lt;p&gt;For a 5 node cluster with RF 3 at &lt;code&gt;QUORUM&lt;/code&gt; this is (5 / 3) * (3 - 2) or 1. For other cluster sizes the number is:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;def max_failure(num_nodes, rf, block_for):
    print &amp;quot;For up to %s nodes with RF %s and blocking for %s nodes...&amp;quot; % (
        num_nodes, rf, block_for)
    print &amp;quot;Number Nodes / max_failure&amp;quot;;
    for n in range(1, num_nodes + 1):
         print &amp;quot;%s / %s&amp;quot; % (n, ( int(n/rf) * (rf - block_for)))

max_failure(10, 3, 2)

For up to 10 nodes with RF 3 and blocking for 2 nodes...
Number Nodes / max_failure
1 / 0
2 / 0
3 / 1
4 / 1
5 / 1
6 / 2
7 / 2
8 / 2
9 / 3
10 / 3&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In the best case failure scenario the cluster can sustain up to floor(number of nodes / RF) * (RF - CL) failed nodes and still remain up for 100% of the keys.&lt;/p&gt;

&lt;h2 id='the_view_from_one_node'&gt;The view from one node&lt;/h2&gt;

&lt;p&gt;Considering which nodes are &lt;code&gt;UP&lt;/code&gt; or &lt;code&gt;DOWN&lt;/code&gt; is always from the perspective of the coordinator node. Network Partitions also play a part in the deciding if the cluster is available, as the nodes must be both running and contactable by the coordinator.&lt;/p&gt;

&lt;p&gt;At the small scale if a client connects to a node in the cluster that has lost connectivity to other nodes in the cluster it will consider them all &lt;code&gt;DOWN&lt;/code&gt; and be unavailable for all &lt;code&gt;QUORUM&lt;/code&gt; requests. The client will receive a &lt;code&gt;UnavailableException&lt;/code&gt; and should connect to another node and try request. Other nodes in the cluster may be in a bigger partition that contains enough &lt;code&gt;UP&lt;/code&gt; replicas for the request to complete.&lt;/p&gt;

&lt;p&gt;At a bigger scale when using Amazon AWS it&amp;#8217;s more likely that nodes an Availability Zone will lose connectivity with nodes from a different AZ then from nodes in the same AZ.&lt;/p&gt;

&lt;p&gt;With two AZ&amp;#8217;s operations at &lt;code&gt;QUORUM&lt;/code&gt; will require nodes in both AZ&amp;#8217;s as neither will hold &lt;code&gt;QUORUM&lt;/code&gt; replicas. So a network partition between the two would result in 100% of the keys been down in both AZ&amp;#8217;s.&lt;/p&gt;

&lt;p&gt;With three AZ&amp;#8217;s each AZ will hold one third of the replicas, and any two together may provide enough &lt;code&gt;UP&lt;/code&gt; replicas to support &lt;code&gt;QUORUM&lt;/code&gt; operations. The cluster could sustain a network partition so long as each AZ can talk to at least one other AZ.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Recent Cassandra AWS Reading</title>
   <link href="http://thelastpickle.com/2011/05/21/Recent-Cassandra-AWS-Reading" />
   <updated>2011-05-21T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/05/21/Recent-Cassandra-AWS-Reading</id>
   <content type="html">&lt;p&gt;There&amp;#8217;s been a few good AWS discussions recently on the Cassandra User List, and some interesting blog posts.&lt;/p&gt;

&lt;h2 id='user_list_discussions'&gt;User List discussions&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href='http://www.mail-archive.com/user@cassandra.apache.org/msg12831.html'&gt;best way to backup&lt;/a&gt; on EC2 and the &lt;a href='https://github.com/simplegeo/tablesnap'&gt;SimpleGeo.com table snap&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://www.mail-archive.com/user@cassandra.apache.org/msg12586.html'&gt;EC2 Stress Tests&lt;/a&gt; with lots of hard numbers&lt;/li&gt;

&lt;li&gt;&lt;a href='http://www.mail-archive.com/user@cassandra.apache.org/msg12502.html'&gt;Multi DC Deployment&lt;/a&gt; discussion&lt;/li&gt;

&lt;li&gt;Work on a new &lt;a href='http://www.mail-archive.com/user@cassandra.apache.org/msg13214.html'&gt;EC2 Snitch&lt;/a&gt; to make multi region deployments easier using &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-2452'&gt;CASSANDRA-2452&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Using &lt;a href='http://www.vyatta.com/'&gt;Vyatta&lt;/a&gt; and &lt;a href='http://www.mail-archive.com/user@cassandra.apache.org/msg12733.html'&gt;IP address resolution in MultiDC setup (EC2)/VIP&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Token selection for a balanced &lt;a href='http://www.mail-archive.com/user@cassandra.apache.org/msg12975.html'&gt;Replica data distributing between racks&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id='blogs_and_such'&gt;Blogs and such&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href='http://www.datastax.com/dev/blog/setting-up-a-cassandra-cluster-with-the-datastax-ami'&gt;Setting up a Cassandra cluster with the DataStax AMI&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://www.datastax.com/docs/0.8/brisk/install_brisk_ami'&gt;Installing the Brisk AMI on Amazon EC2&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;Amazon documentation on &lt;a href='http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?instance-storage-concepts.html'&gt;EC2 Instance Storage&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://www.theserverlabs.com/blog/2010/07/08/ec2-persistence-strategies/'&gt;Persistence Strategies for Amazon EC2&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://aws.amazon.com/message/65648/'&gt;Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://www.gabrielweinberg.com/blog/2011/05/raid0-ephemeral-storage-on-aws-ec2.html'&gt;RAID0 ephemeral storage on AWS EC2&lt;/a&gt;&lt;/li&gt;

&lt;li&gt;&lt;a href='http://amistrongeryet.blogspot.com/2010/04/three-latency-anomalies.html'&gt;Three Latency Anomalies&lt;/a&gt; a look at EC2 networking&lt;/li&gt;
&lt;/ul&gt;</content>
 </entry>
 
 <entry>
   <title>Deletes and Tombstones</title>
   <link href="http://thelastpickle.com/2011/05/15/Deletes-and-Tombstones" />
   <updated>2011-05-15T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/05/15/Deletes-and-Tombstones</id>
   <content type="html">&lt;p&gt;Deletes in Cassandra rely on Tombstones to support the Eventual Consistency model. Tombstones are markers that can exist at different levels of the data model and let the cluster know that a delete was recored on a replica, and when it happened. Tombstones then play a role in keeping deleted data hidden and help with freeing space used by deleted columns on disk.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s possible to delete columns, super columns and entire rows in Cassandra. However in this post I&amp;#8217;m going to look at the simple case of deleting a single column in a Standard Column Family.&lt;/p&gt;

&lt;h2 id='remembering_things_forgotten'&gt;Remembering things forgotten&lt;/h2&gt;

&lt;p&gt;In a simple case of a single node traditional RDBMS deleting data is relatively straight forward process. Rows in Index and Data pages are marked as unused and other than recording the transaction in the commit log, the fact that the delete happened is forgotten.&lt;/p&gt;

&lt;p&gt;As the &lt;a href='http://wiki.apache.org/cassandra/DistributedDeletes'&gt;Distributed Deletes&lt;/a&gt; wiki page points out, things are a bit more complicated in Cassandra. From the perspective of the coordinator node (the one the client is connected to) one or more nodes may be down when the delete is executed. So long as the requested Consistency Level is achieved the delete can still proceed, but if we forget that the delete happened it will not be possible to reach the &lt;em&gt;correct&lt;/em&gt; consistent view of the data later. The nodes that were offline will say &amp;#8220;foo == bar&amp;#8221; and the nodes that did the delete will have nothing to say.&lt;/p&gt;

&lt;p&gt;When a column is deleted a &lt;code&gt;DeletedColumn&lt;/code&gt; aka Tombstone is created in Cassandra. The DeletedColumn will have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;name: Name of the column deleted&lt;/li&gt;

&lt;li&gt;value: Current server time as seconds since the unix epoch (integer). This is known as the &lt;code&gt;localDeleteTime&lt;/code&gt; and is used during the (cassandra) GC process.&lt;/li&gt;

&lt;li&gt;timestamp: As provided by the client&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The mutation is then &lt;a href='http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/'&gt;applied&lt;/a&gt; to the memtable in one of two ways. If the memtable does not contain the named column for the row it is simply added to the memtable. If there is an existing column it is &lt;a href='https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/Column.java#L179'&gt;reconcile() &amp;#8216;d&lt;/a&gt; with the Deleted Column. The new &lt;code&gt;DeletedColumn&lt;/code&gt; will replace the existing column if it has a higher (client provided) timestamp. The localDeleteTime is not used for reconciliation. At this point any previous column value in the memtable is lost and will not be persisted to disk.&lt;/p&gt;

&lt;p&gt;We now have a tombstone. If there are no other mutations the &lt;code&gt;DeletedColumn&lt;/code&gt; will later be persisted to the SSTable just like any other column.&lt;/p&gt;

&lt;h2 id='local_reads_for_local_queries'&gt;Local Reads for &lt;a href='http://www.bbc.co.uk/comedy/clips/p006vm6j/the_league_of_gentlemen_a_local_shop_for_local_people/'&gt;Local&lt;/a&gt; Queries&lt;/h2&gt;

&lt;p&gt;During a local read for a row value the same reconciliation process that was used during the delete request runs. Multiple row fragments are retrieved from the current memtable, memtables pending flush and SSTables on disk. The fragments are reduced and the columns with the same name reconciled to arrive at the current value.&lt;/p&gt;

&lt;p&gt;For example if there is a row fragment in an SSTable for key &amp;#8220;foo&amp;#8221; that says columns &amp;#8220;bar&amp;#8221; is &amp;#8220;baz&amp;#8221;, and a DeletedColumn in another SSTable with a higher time stamp when they are reconciled the DeletedColumn will &amp;#8220;win&amp;#8221;. The current view of the row will be that the &amp;#8220;bar&amp;#8221; column is deleted.&lt;/p&gt;

&lt;p&gt;When a read query is filtering the reconciled candidate columns it will include the &lt;code&gt;DeletedColumn&lt;/code&gt; in the result set if the localDeletionTime recorded for it is beyond the current &lt;code&gt;gcBefore&lt;/code&gt; time. &lt;code&gt;gcBefore&lt;/code&gt; is determined when the query starts as the current time (to 1 second resolution) less the &lt;code&gt;GCGraceSeconds&lt;/code&gt; value specified for the Column Family. If the deletion is before &lt;code&gt;gcBefore&lt;/code&gt; it is totally ignored, more on the GC process below. When a slice based query (e.g. get the first 10 columns) executes it adds the &lt;code&gt;DeletedColumn&lt;/code&gt; to the result set, but does not count it towards the limit of columns the query has asked for.&lt;/p&gt;

&lt;p&gt;The deleted columns are filtered out of the result set as the last step before returning it to the client. In the simple case of a local read at CL ONE the result of the local query is filtered and returned. For cluster reads the result of the query must be reconciled with CL replicas before the coordinator can say it has a consistent result. And this is why the &lt;code&gt;DeletedColumn&lt;/code&gt;&amp;#8217;s are still in the result set.&lt;/p&gt;

&lt;h2 id='readers_digest'&gt;Readers Digest&lt;/h2&gt;

&lt;p&gt;If a read request involves more than one replica only the &amp;#8220;closest&amp;#8221; (as determined by the &lt;a href='http://wiki.apache.org/cassandra/ArchitectureInternals?highlight=%28snitch%29'&gt;snitch&lt;/a&gt;) is asked to return the full result set. The result set it returns is will include any &lt;code&gt;DeletedColumn&lt;/code&gt;&amp;#8217;s read during the local read.&lt;/p&gt;

&lt;p&gt;The other nodes in the cluster are asked to return &lt;a href='http://wiki.apache.org/cassandra/DigestQueries'&gt;digest&lt;/a&gt; of their local read. The digest is a &lt;a href='http://en.wikipedia.org/wiki/Md5'&gt;MD5&lt;/a&gt; hash of the columns, their values, timestamps and other meta data. Once the coordinator node has received CL read responses, including the data response, it compares the digests with each other and the digest of the data response.&lt;/p&gt;

&lt;p&gt;If they all match then the read is consistent at the CL requested. Otherwise there is an inconsistency that needs to be repaired. The inconsistency could come from one replica including a &lt;code&gt;DeletedColumn&lt;/code&gt; in it&amp;#8217;s digest while another includes the previously deleted value.&lt;/p&gt;

&lt;h2 id='read_repair_sort_of'&gt;Read Repair (sort of)&lt;/h2&gt;

&lt;p&gt;Once a &lt;a href='https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/DigestMismatchException.java'&gt;&lt;code&gt;DigestMismatch&lt;/code&gt;&lt;/a&gt; is detected the differences have to be reconciled before the read response is returned to the client. The process that does this is part of the &lt;a href='http://wiki.apache.org/cassandra/ReadRepair'&gt;ReadRepair&lt;/a&gt; feature, but depending on the circumstances it may not be considered a full Read Repair.&lt;/p&gt;

&lt;p&gt;Read Repair is considered to be happening when the coordinator requests a response from all replicas for a row, no just those that are needed to meet the Consistency Level for the request. This read repair will run for normal get / multi_get / indexed get operations but not range scans, and can be controlled with the &lt;code&gt;read_repair_chance&lt;/code&gt; config setting. Even though all replicas receive the request, the request will only block of Consistency Level replicas. For now I&amp;#8217;m going to ignore this process and just consider what happens if replicas required for the Consistency Level do not agree.&lt;/p&gt;

&lt;p&gt;Once the &lt;code&gt;DigestMismatch&lt;/code&gt; is detected all the replicas that were involved in the read are asked to do the read again and return a full data response to the coordinator. Their responses are then reconciled using the same process as a normal mutation request to get a consistent result for the query.&lt;/p&gt;

&lt;p&gt;This is where the tombstone do their work of remembering that the delete happened. For example if replica 1 says &amp;#8216;foo&amp;#8217; is a normal column, and replica 2 says &amp;#8216;foo&amp;#8217; is a &lt;code&gt;DeletedColumn&lt;/code&gt; with a higher time stamp the value from replica 2 will be used.&lt;/p&gt;

&lt;p&gt;Once a reconciled view of the row has been created each replica is asynchronously sent a mutation with the difference between it&amp;#8217;s data and the reconciled data. The read can then return the reconciled view of the data to the client while the repair of the replicas that participated in the request is going on in the background.&lt;/p&gt;

&lt;h2 id='free'&gt;Free&lt;/h2&gt;

&lt;p&gt;Existing data on disk for a column is not deleted when the delete mutation is processed. Cassandra never mutates on disk data.&lt;/p&gt;

&lt;p&gt;Instead the &lt;a href='http://wiki.apache.org/cassandra/MemtableSSTable?highlight=%28compaction%29'&gt;compaction&lt;/a&gt; process reconciles the data in multiple SSTables on disk. The row fragments from each SSTable are collated and columns with the same name reconciled using the process we&amp;#8217;ve already seen. The result of the compaction is a single SSTable that contains the same &amp;#8220;truth&amp;#8221; as the input files, but may be considerably smaller due to reconciling overwrites and deletions.&lt;/p&gt;

&lt;p&gt;For example, there could be three SSTables that contain a value for the &amp;#8220;foo&amp;#8221; column. In the first the value is &amp;#8220;bar&amp;#8221;, in the second the value is a 16KB string, and in the third it&amp;#8217;s a &lt;code&gt;DeletedColumn&lt;/code&gt;. Before compaction runs the value of the column is nothing, however on disk it uses at least 16KB. After compaction the value will still be nothing, but it will be stored in a single SSTable and use only a few bytes.&lt;/p&gt;

&lt;p&gt;Minor compaction typically runs frequently, so data that is created and deleted reasonably quickly will be deleted from disk quickly. Data that has been through several generations of compaction before it is deleted will not be deleted from disk as quickly. The &lt;code&gt;DeletedColumn&lt;/code&gt; will continue to be written into the new, compacted, SStables until it&amp;#8217;s &lt;code&gt;localDeletionTime&lt;/code&gt; occurs before the current (server) time less the &lt;code&gt;GCGraceSeconds&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id='the_reincarnation_of_paul_reveres_horse'&gt;The reincarnation of &lt;a href='http://www.bobdylan.com/songs/tombstone-blues'&gt;Paul Revere’s horse&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Compactions run locally and by default automatically based on load. Once &lt;code&gt;GCGraceSeconds&lt;/code&gt; has elapsed since the delete a new compaction on the SSTable will purge the tombstone from disk and the delete will be forgotten. But how do we guarantee that other nodes in the cluster have seen the delete before it&amp;#8217;s deleted?&lt;/p&gt;

&lt;p&gt;If a node goes down for longer than &lt;code&gt;max_hint_window_in_ms&lt;/code&gt; it will no longer have hints recorded for it. If the column was never read a Read Repair could not have run. If the column was read but Read Repair was not active and the node was not included in CL nodes it would not have received a repair.&lt;/p&gt;

&lt;p&gt;Deletes operate under Eventual Consistency just like writing a value, with the added complication that they have an built in expiry time (&lt;code&gt;GCGraceSeconds&lt;/code&gt;). If the replicas for a value have not seen the delete before that time there is a risk of deleted data reappearing. The stop that happening &lt;code&gt;nodetool reapir&lt;/code&gt; &lt;a href='http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSeconds'&gt;needs to be run&lt;/a&gt; at least every &lt;code&gt;GCGraceSeconds&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id='in_motion_on_a_single_node'&gt;In Motion on a single node&lt;/h2&gt;

&lt;p&gt;There is not a lot to look at on a single node, but it&amp;#8217;s pretty easy to see the tombstones and the column value persisted into an SSTable.&lt;/p&gt;

&lt;p&gt;One a fresh 0.7 install create the sample schema:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/cassandra-cli -h localhost -f conf/schema-sample.txt&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The jump into the &lt;code&gt;cassandra-cli&lt;/code&gt; and insert one column:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/cassandra-cli -h localhost
Connected to: &amp;quot;Test Cluster&amp;quot; on localhost/9160
Welcome to cassandra CLI.

Type &amp;#39;help;&amp;#39; or &amp;#39;?&amp;#39; for help. Type &amp;#39;quit;&amp;#39; or &amp;#39;exit;&amp;#39; to quit.
[default@unknown] use Keyspace1;                                    
Authenticated to keyspace: Keyspace1
[default@Keyspace1] set Standard1[&amp;#39;foo&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Flush the data from the memtable so our delete cannot be applied in memory:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost flush Keyspace1&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now delete the column using the cli:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@Keyspace1] del Standard1[&amp;#39;foo&amp;#39;][&amp;#39;bar&amp;#39;];        
column removed.&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Flush the delete to disk:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost flush Keyspace1&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Finally use the &lt;code&gt;sstable2json&lt;/code&gt; dump the data from the SSTables:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/sstable2json /var/lib/cassandra/data/Keyspace1/Standard1-f-1-Data.db 
{
&amp;quot;666f6f&amp;quot;: [[&amp;quot;626172&amp;quot;, &amp;quot;62617a&amp;quot;, 1305412876934000, false]]
}
$ bin/sstable2json /var/lib/cassandra/data/Keyspace1/Standard1-f-2-Data.db 
{
&amp;quot;666f6f&amp;quot;: [[&amp;quot;626172&amp;quot;, &amp;quot;4dcf05ab&amp;quot;, 1305413035092000, true]]
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The top level key in the output is the row key and each column is formatted as: name, value, timestamp, delete flag. The first SSTable still contains the &amp;#8220;bar&amp;#8221; column and the &amp;#8220;baz&amp;#8221; data written in the first set operation. The second SSTable also has the &amp;#8220;bar&amp;#8221; column, however the deleted flag is &lt;code&gt;true&lt;/code&gt; and the value is now the (server) time stamp of deletion. This is the timestamp used with &lt;code&gt;GCGraceSeconds&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Read the data back:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@Keyspace1] get Standard1[&amp;#39;foo&amp;#39;][&amp;#39;bar&amp;#39;];
Value was not found&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id='in_motion_on_a_cluster'&gt;In Motion on a cluster&lt;/h2&gt;

&lt;p&gt;It&amp;#8217;s a bit more involved but it&amp;#8217;s also possible to see a Read Repair happening on a cluster.&lt;/p&gt;

&lt;p&gt;I normally run a 2 node cluster on my mac book using the direction from &lt;a href='http://www.onemanclapping.org/2010/03/running-multiple-cassandra-nodes-on.html'&gt;Gary Dusbabek&lt;/a&gt;. There is also this really handy tool from &lt;a href='https://github.com/pcmanus/ccm'&gt;Sylvain&lt;/a&gt; that I&amp;#8217;ve been meaning to try, or you can just run a normal two node cluster.&lt;/p&gt;

&lt;p&gt;Edit &lt;code&gt;conf/cassandra.yaml&lt;/code&gt; for both nodes to disable Hinted Handoff:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# See http://wiki.apache.org/cassandra/HintedHandoff
hinted_handoff_enabled: true&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This will prevent the nodes telling each other about mutations that happen while the other is down.&lt;/p&gt;

&lt;p&gt;I&amp;#8217;ve also edited conf/log4j-server.properties to set logging at &lt;code&gt;DEBUG&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Start the nodes and create and populate schema below using &lt;code&gt;bin/cassandra-cli&lt;/code&gt; (schema definition is for a 0.7.5 cluster):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;create keyspace dev 
  with placement_strategy = &amp;#39;org.apache.cassandra.locator.SimpleStrategy&amp;#39;
  and replication_factor = 2;

use dev;

create column family data 
  with comparator = AsciiType;

set data[ascii(&amp;#39;foo&amp;#39;)][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Flush the data on both nodes:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h your-node1 flush dev
$ bin/nodetool -h your-node2 flush dev&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Shutdown node 1, connect to node 2 and delete the column using the cli:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/cassandra-cli -h your-node2
Connected to: &amp;quot;Test Cluster&amp;quot; on 127.0.0.2/9160
Welcome to cassandra CLI.

Type &amp;#39;help;&amp;#39; or &amp;#39;?&amp;#39; for help. Type &amp;#39;quit;&amp;#39; or &amp;#39;exit;&amp;#39; to quit.
[default@unknown] use dev;
Authenticated to keyspace: dev
[default@dev] del data[&amp;#39;foo&amp;#39;][&amp;#39;bar&amp;#39;];
column removed.&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Restart node 1, connect via the cli, set the consistency level to ALL to ensure Read Repair runs and read the deleted column:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/cassandra-cli -h your-node1
Connected to: &amp;quot;Test Cluster&amp;quot; on 127.0.0.1/9160
Welcome to cassandra CLI.

Type &amp;#39;help;&amp;#39; or &amp;#39;?&amp;#39; for help. Type &amp;#39;quit;&amp;#39; or &amp;#39;exit;&amp;#39; to quit.
[default@unknown] use dev;
Authenticated to keyspace: dev
[default@dev] consistencylevel as ALL;
Consistency level is set to &amp;#39;ALL&amp;#39;.
[default@dev] get data[&amp;#39;foo&amp;#39;][&amp;#39;bar&amp;#39;]; 
Value was not found&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This would have also worked at QUORUM (which is 2 for RF 2).&lt;/p&gt;

&lt;p&gt;Digging through the logs on node1 you should see that the read will block for 2 nodes and Read Repair is enabled:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;ReadCallback.java (line 84) Blockfor/repair is 2/true; setting up requests
    to localhost/127.0.0.1,/127.0.0.2&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Next should be some messages about reading the data locally and handling the response from your-node2. Then the &lt;code&gt;RowDigestResolver&lt;/code&gt; will say it&amp;#8217;s resolving 2 responses before detecting the mismatch and raising an error:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;RowDigestResolver.java (line 62) resolving 2 responses
StorageProxy.java (line 398) Digest mismatch:
    org.apache.cassandra.service.DigestMismatchException: Mismatch for key
        DecoratedKey(110673303387115207421586718101067225896, 666f6f)
        (34ec2eb2ec21eb3d05fb6f97cbf84c51 vs ba720207d87132da833ae2579487b172)
    at org.apache.cassandra.service.RowDigestResolver.resolve(
        RowDigestResolver.java:106)
    at org.apache.cassandra.service.RowDigestResolver.resolve(
        RowDigestResolver.java:30)
...&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There should now be a few messages about doing the read again, and receiving the response from your-node2 again. Once the data responses have been received by node1 it will resolve the differences and send out mutations to the nodes than need them. In this case it&amp;#8217;s node1 and you can see the mutation logged from &lt;code&gt;Table&lt;/code&gt; (aka Keyspace):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;RowRepairResolver.java (line 50) resolving 2 responses
RowRepairResolver.java (line 76) versions merged
RowRepairResolver.java (line 85) resolve: 1 ms.
Table.java (line 337) applying mutation of row 666f6f&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you run the read again you should see the RowRepairResolver log &amp;#8216;digests verified&amp;#8217; to say the data matched and there are no repairs to run.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>How are Memtables measured?</title>
   <link href="http://thelastpickle.com/2011/05/04/How-are-Memtables-measured" />
   <updated>2011-05-04T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/05/04/How-are-Memtables-measured</id>
   <content type="html">&lt;p&gt;&lt;strong&gt;Updated:&lt;/strong&gt; I&amp;#8217;ve added information on the new &lt;code&gt;memtable_total_space_in_mb&lt;/code&gt; setting in version 0.8 and improved the information about &lt;code&gt;memtable_throughput&lt;/code&gt;. Thanks for the feedback.&lt;/p&gt;

&lt;p&gt;In version 0.7 of Cassandra there are 3 configuration settings that can trigger flushing a memtable to disk. Version 0.8 adds support for a global &lt;code&gt;memtable_total_space_in_mb&lt;/code&gt; which may replace the previous 3 settings.&lt;/p&gt;

&lt;p&gt;First the 0.7 settings.&lt;/p&gt;

&lt;h2 id='memtable_flush_after_minutes'&gt;memtable_flush_after (minutes)&lt;/h2&gt;

&lt;p&gt;This is the maximum number of minutes a memtable should stay in memory for if it has received writes. When the memtable is created the current time is recorded which is then checked every 10 seconds. If after the time span either the primary memtable for the Column Family or any secondary index memtables have received writes they are replaced and flushed to disk.&lt;/p&gt;

&lt;p&gt;Unfortunately as the &lt;a href='http://wiki.apache.org/cassandra/StorageConfiguration'&gt;wiki&lt;/a&gt; points out, there is a good reason to make this value small and a good reason to make it large.&lt;/p&gt;

&lt;p&gt;A log file cannot be deleted until all of the segments / records it contains have been marked as completed. This &lt;a href='http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/'&gt;happens&lt;/a&gt; when a memtable is flushed to disk. Because the log file is shared by all Column Familes, one Column Family that has long living memtables can prevent log files from been deleted and their disk space freed up.&lt;/p&gt;

&lt;p&gt;However smaller values can cause multiple memtables to expire at the same time, prior to version 0.7 this could cause flush requests to block. Version 0.7 added &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-1099'&gt;memtable_flush_writers&lt;/a&gt; and &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-2333'&gt;memtable_flush_queue_size&lt;/a&gt; ,but it can still slow down the IO system (see conf/cassandra.yaml for more info). The best approach is to tune the other memtable thresholds to trigger when you want them, and leave this setting as a backup.&lt;/p&gt;

&lt;p&gt;The wiki recommends setting a default of 1440 minutes, or 24 hours which is also the default.&lt;/p&gt;

&lt;h2 id='memtable_operations_millions'&gt;memtable_operations (millions)&lt;/h2&gt;

&lt;p&gt;The Memtable tracks the number of operations applied to it by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;counting the number of top level columns (i.e. Columns for a Standard CF or Super Columns for a Super CF) in a mutation.&lt;/li&gt;

&lt;li&gt;considering a row level deletion as a single operation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deletions and Insertions are both considered a &lt;code&gt;mutation&lt;/code&gt;, inserting 3 columns increases the count by 3 and deleting 3 named columns increases the count by 3. Note that the number of sub columns in a Super Column is ignored. So inserting 10 sub columns into 1 super column increases the count by one, and deleting a super column by name that has 10 sub column increases the count by one.&lt;/p&gt;

&lt;p&gt;The operations threshold (and the size threshold) is checked before applying a mutation and the flush is not requested until after the mutation has completed. So it&amp;#8217;s possible for the memtable to contain more than &lt;code&gt;memtable_operations&lt;/code&gt; when it is flushed to disk.&lt;/p&gt;

&lt;p&gt;A different way to think about this setting, and &lt;code&gt;memtable_throughput&lt;/code&gt;, is as &lt;code&gt;sstable_min_operations&lt;/code&gt; and &lt;code&gt;sstable_min_bytes&lt;/code&gt;. In general operation new sstables are created after at least &lt;code&gt;sstable_min_operations&lt;/code&gt; operations have occurred or at most &lt;code&gt;sstable_min_bytes&lt;/code&gt; bytes will be written.&lt;/p&gt;

&lt;p&gt;If no value is provided when the CF is created it is set to the default memtable throughput in MB (below) / 64 * 0.3, so it&amp;#8217;s 300k ops per 64MB of throughput. If you have a CF that contains many small columns it&amp;#8217;s a good idea to look at the log entries for memtable flushes to see if the ops threshold is triggering early and causing small memtables to be frequently written.&lt;/p&gt;

&lt;h2 id='memtable_throughput'&gt;memtable_throughput&lt;/h2&gt;

&lt;p&gt;Throughput for the memtable is tracked and tested at the same time as the operation count. But counting the byte size of the data is more involved and depends on the type of the column. The size of the data when serialised to disk is counted as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;standard column byte size is&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;length of the key byte array plus 2 bytes to store the length&lt;/li&gt;

&lt;li&gt;1 byte to indicate if the column has been deleted&lt;/li&gt;

&lt;li&gt;8 bytes for the timestamp&lt;/li&gt;

&lt;li&gt;length of the value byte array plus 4 bytes to store the length&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;expiring columns (those with a TTL) add another 8 bytes to the length of a standard column.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;deleted columns (tombstones) are the same as standard columns but the value is always 4 bytes long.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;counter columns (in Cassandra v0.8) add another 8 bytes to the length of a standard column. Note that for a counter column the value will always be an 8 byte long.&lt;/p&gt;
&lt;/li&gt;

&lt;li&gt;
&lt;p&gt;super columns sum the size of all contained columns and then add&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;length of the name byte array plus 2 bytes to store the length&lt;/li&gt;

&lt;li&gt;4 bytes to indicate when it was deleted&lt;/li&gt;

&lt;li&gt;8 bytes to store the timestamp for the deletion&lt;/li&gt;

&lt;li&gt;4 bytes to store the number of sub columns&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;(Currently the calculation for the super column only includes the sum of the sub columns. I think this needs to be changed.)&lt;/p&gt;

&lt;p&gt;A row deletion will add zero bytes to the throughput counter.&lt;/p&gt;

&lt;p&gt;The byte size of the mutation is always added to the counter, if one mutation replaces columns in the memtable their byte size &lt;em&gt;is not&lt;/em&gt; subtracted from the counter.&lt;/p&gt;

&lt;p&gt;Getting this setting wrong is a very easy way to run out of memory. From version 0.7 onwards the worse case scenario is up to CF Count + Secondary Index Count + &lt;code&gt;memtable_flush_queue_size&lt;/code&gt; (defaults to 4) + &lt;code&gt;memtable_flush_writers&lt;/code&gt; (defaults to 1 per data directory) memtables in memory the JVM at once. It&amp;#8217;s best to be conservative, follow the &lt;a href='http://wiki.apache.org/cassandra/MemtableThresholds'&gt;wiki advice&lt;/a&gt; and consider that the JVM may take up to 10 times as much memory as it takes to serialise the data to disk.&lt;/p&gt;

&lt;p&gt;And that&amp;#8217;s the problem with this threshold. It&amp;#8217;s &lt;strong&gt;not&lt;/strong&gt; measuring how much memory a memtable is using in the JVM Heap, it&amp;#8217;s measuring the maximum amount of bytes it could take to serialise the data (excluding the index and bloom filter) to disk. Which makes it a difficult knob to use when tuning how much memory Cassandra uses.&lt;/p&gt;

&lt;p&gt;If a value is not provided when the Column Family is created it will default to 1/16th the maximum size of the JVM Heap at the time. This value stored with the Column Family meta data and will not change again. Typical values are around 128MB to 256MB.&lt;/p&gt;

&lt;h2 id='memtable_total_space_in_mb'&gt;memtable_total_space_in_mb&lt;/h2&gt;

&lt;p&gt;Version 0.8 &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-2006'&gt;adds&lt;/a&gt; the per node &lt;code&gt;memtable_total_space_in_mb&lt;/code&gt; setting which makes life easier and may eventually &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-2449'&gt;replace&lt;/a&gt; the 3 previous settings. While it&amp;#8217;s fun to play with the per CF settings, it can also be a pain when building real systems that need to stay up.&lt;/p&gt;

&lt;p&gt;If no value is set in &lt;code&gt;conf/cassandra.yaml&lt;/code&gt; the setting will default to one third of the JVM max Heap size. If it is set to zero the setting is disabled and only the old per CF thresholds will be used. If the global setting is enabled and there are per CF settings &lt;strong&gt;both&lt;/strong&gt; of them will be used.&lt;/p&gt;

&lt;p&gt;There are two parts to the global memtable size, measuring the real memory usage of the memtable and flushing. First the measuring.&lt;/p&gt;

&lt;p&gt;Rather than track every byte allocated the server periodically works out the ratio between the throughput as measured above and the real in memory bytes as measured by JVM. The in memory byte count is worked out using the &lt;a href='http://download.oracle.com/javase/6/docs/api/java/lang/instrument/package-summary.html'&gt;Instrumentation Java Package&lt;/a&gt; and code from &lt;a href='https://github.com/jbellis/jamm'&gt;Jonathan Ellis&lt;/a&gt;. After a mutation has been applied to the memtable, but before a flush is requested, Cassandra calculates the &amp;#8220;Live Ratio&amp;#8221; if more than twice as many operations (as calculated above) have been processed since the last time it was calculated.&lt;/p&gt;

&lt;p&gt;Measuring the Live Ratio is done asynchronously and involves measuring the real memory size of all the keys, super columns and columns in the memtable and dividing it by the throughput as measured above. For sanity the ratio is clamped between 1.0 and 64.0, if the value is outside of this range a &lt;code&gt;WARN&lt;/code&gt; level log message will let you know. Finally the ratio for the Column Family is updated to the new ratio if and only if the new ratio is higher than the previous one. An &lt;code&gt;INFO&lt;/code&gt; level message will let you know when the ratio is calculated, how long it took and if it changed.&lt;/p&gt;

&lt;p&gt;Next the &lt;code&gt;MeteredFlusher&lt;/code&gt; runs every second and uses a two phase approach to keeping the live memory use under the setting. First it looks at the total live bytes for each Column Family, including it&amp;#8217;s secondary indexes, and flushes CF&amp;#8217;s that could potentially fill the memory if &lt;a href='https://issues.apache.org/jira/browse/CASSANDRA-2006?focusedCommentId=13010860&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13010860'&gt;allowed to create&lt;/a&gt; memtables of this size. Live bytes are calculated by multiplying the throughput as perviously measured by the Live Ratio. The Flusher considers the Column Family to be using too much memory if it&amp;#8217;s current live size is more than &lt;code&gt;memtable_total_space_in_mb&lt;/code&gt; divided by the maximum number of memtables the Column Family could have in memory. The calculation for this is similar to the one presented above for &lt;code&gt;memtable_throughput&lt;/code&gt; but it includes secondary indexes and a fudge factor that takes into account how the live size is measured.&lt;/p&gt;

&lt;p&gt;For example if &lt;code&gt;memtable_total_space_in_mb&lt;/code&gt; is 100MB, and &lt;code&gt;memtable_flush_writers&lt;/code&gt; is the default 1 (with one data directory), and &lt;code&gt;memtable_flush_queue_size&lt;/code&gt; is the default 4, and a Column Family has no secondary indexes. The CF will not be allowed to get above one seventh of 100MB or 14MB, as if the CF filled the flush pipeline with 7 memtables of this size it would take 98MB. At a more sensible 2GB for &lt;code&gt;memtable_total_space_in_mb&lt;/code&gt; (1/3 of a 6GB JVM Heap) the CF will be flushed if it is using 292MB of live memory.&lt;/p&gt;

&lt;p&gt;(I&amp;#8217;ve skipped a couple of things here such as considering the bytes currently been flushed.)&lt;/p&gt;

&lt;p&gt;The flusher process will end there if the number of bytes that were flushing when it started plus the bytes for all the CF&amp;#8217;s that were not flushed in the first phase is less than &lt;code&gt;memtable_total_space_in_mb&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The second phase flushes the CF&amp;#8217;s in order of largest to smallest until the total live size (including the bytes currently been flushed) gets down below the target setting.&lt;/p&gt;

&lt;p&gt;This new setting (and the existing &lt;code&gt;flush_largest_memtables_at&lt;/code&gt;) should make it harder to shot yourself in the foot with memory management and easier for new users to feel comfortable with the server.&lt;/p&gt;

&lt;h2 id='in_motion'&gt;In Motion&lt;/h2&gt;

&lt;p&gt;You can check the per CF thresholds as well as the current tracked values for a memtable using &lt;code&gt;bin/nodetool&lt;/code&gt;, &lt;code&gt;bin/cassandra-cli&lt;/code&gt; or JConsole. I&amp;#8217;m not aware of any current features to check the &lt;code&gt;Live Ratio&lt;/code&gt; or &lt;code&gt;Live Size&lt;/code&gt; of a CF.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;bin/nodetool cfstats&lt;/code&gt; can tell you the current operation count (&amp;#8216;Memtable Columns Count&amp;#8217;) and throughput (&amp;#8216;Memtable Data Size&amp;#8217;):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ./bin/nodetool -h localhost cfstats
Keyspace: dev
    Read Count: 1
    Read Latency: 0.897 ms.
    Write Count: 2
    Write Latency: 0.051 ms.
    Pending Tasks: 0
        Column Family: data
        SSTable count: 2
        Space used (live): 9530
        Space used (total): 9530
        Memtable Columns Count: 1
        Memtable Data Size: 26
        Memtable Switch Count: 1
        Read Count: 1
        Read Latency: 0.897 ms.
        Write Count: 2
        Write Latency: 0.020 ms.
        Pending Tasks: 0
        Key cache capacity: 200000
        Key cache size: 2
        Key cache hit rate: 0.0
        Row cache: disabled
        Compacted row minimum size: 51
        Compacted row maximum size: 86
        Compacted row mean size: 73&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&amp;#8216;bin/cassandra-cli&amp;#8217; can tell you the current thresholds using either &lt;code&gt;describe keyspace&lt;/code&gt; or &lt;code&gt;show keyspaces&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@dev] describe keyspace;
Keyspace: dev:
  Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
    Options: [datacenter1:2]
  Column Families:
    ColumnFamily: data
      Key Validation Class: org.apache.cassandra.db.marshal.BytesType
      Default column value validator: org.apache.cassandra.db.marshal.BytesType
      Columns sorted by: org.apache.cassandra.db.marshal.AsciiType
      Row cache size / save period in seconds: 0.0/0
      Key cache size / save period in seconds: 200000.0/14400
      Memtable thresholds: 0.29062499999999997/62/1440 (millions of ops/MB/minutes)&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Or using JConsole connect to the server, select MBeans and then navigate to org.apacge.cassandra.db.ColumnFamilies.&amp;#60;your-keyspace&amp;#62;.&amp;#60;your-column-family&amp;#62;. There you can find the current thresholds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MemtableFlushAfterMins&lt;/li&gt;

&lt;li&gt;MemtableOperationsInMillions&lt;/li&gt;

&lt;li&gt;MemtableThroughputInMB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the running values:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MemtableColumnCount&lt;/li&gt;

&lt;li&gt;MemtableDataSize&lt;/li&gt;
&lt;/ul&gt;</content>
 </entry>
 
 <entry>
   <title>The forces of Write and Read</title>
   <link href="http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read" />
   <updated>2011-04-28T00:00:00-07:00</updated>
   <id>http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read</id>
   <content type="html">&lt;p&gt;Requests that write and read data in Cassandra, like any data base, have competing characteristics that need to be balanced. This post compares the approach taken by Cassandra to traditional Relation Database Systems.&lt;/p&gt;

&lt;p&gt;When writing data the most efficient approach to splat it somewhere on disk with the minimum of fuss. Preferably at the end of a file, or even in a new file. The write is going to take longer if the data is carefully placed in the correct ordered position in an existing file.&lt;/p&gt;

&lt;p&gt;Problem is reading data is going to be much more efficient if the data is in the correct ordered location, and preferably in just one location. If the data is unordered and spread out in a file, or multiple files, the read is going to take longer.&lt;/p&gt;

&lt;p&gt;Traditional Relational databases such as SQL Server are optimised for read requests, so write requests ensure that data is written to the correct ordered location. However the &lt;a href='http://en.wikipedia.org/wiki/Database_log'&gt;transaction log / write ahead log&lt;/a&gt; lets the database delay writing the data to the correct location on disk, providing a handy performance boost while still supporting &lt;a href='http://en.wikipedia.org/wiki/ACID'&gt;ACID&lt;/a&gt; transactions.&lt;/p&gt;

&lt;h2 id='roughly_speaking_write_requests'&gt;Roughly speaking Write requests&lt;/h2&gt;

&lt;p&gt;(very) Roughly speaking a write request in a RDBMS follows this path:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Append the write request to the end of the transaction log.&lt;/li&gt;

&lt;li&gt;Locate the rows modified by the request or those adjacent to new rows, by reading index, data or other &lt;a href='http://msdn.microsoft.com/en-us/library/ms190969.aspx'&gt;pages&lt;/a&gt; from disk if not already in memory.&lt;/li&gt;

&lt;li&gt;Modify the index and data pages in memory to insert or update rows.&lt;/li&gt;

&lt;li&gt;Acknowledge the write to the client.&lt;/li&gt;

&lt;li&gt;&lt;a href='http://en.wikipedia.org/wiki/Transaction_checkpoint#Recovery_process'&gt;Checkpoint&lt;/a&gt; the modified pages by flushing them to disk in the correct ordered location, and mark the log records associated with the changes as no longer needed. The Checkpoint process is typically executed asynchronously (&lt;a href='http://msdn.microsoft.com/en-us/library/ms189573.aspx'&gt;SQL Server&lt;/a&gt; &lt;a href='http://dev.mysql.com/doc/refman/5.0/en/innodb-checkpoints.html'&gt;MySQL&lt;/a&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;(I think in reality the log record is written once a statement completes inserting/updating an individual row. Also a record needs to be written to say if the transaction was committed or aborted, so that during log file recovery the system knows if the transaction should be &lt;a href='http://en.wikipedia.org/wiki/Transaction_processing#Rollforward'&gt;replayed&lt;/a&gt; or &lt;a href='http://en.wikipedia.org/wiki/Transaction_processing#Rollback'&gt;ignored&lt;/a&gt;. I&amp;#8217;m also ignoring all the locking.)&lt;/p&gt;

&lt;p&gt;The write ahead log lets the write request dump the data &lt;em&gt;somewhere&lt;/em&gt; on disk fast, then work on structures which are &lt;em&gt;hopefully&lt;/em&gt; in memory. For updates care is taken to check for existing data, which is then read and mutated in memory. The read request ignores the transaction log, it reads the single source of truth from the correctly ordered index and data pages that are hopefully still in memory.&lt;/p&gt;

&lt;p&gt;If the RDBMS uses &lt;a href='http://en.wikipedia.org/wiki/Multiversion_concurrency_control'&gt;Multi Version Concurrency Control&lt;/a&gt; life is a bit more difficult for the read request, so lets ignore that for now.&lt;/p&gt;

&lt;p&gt;On the other hand Cassandra is optimised for write requests, so read requests need to do more work to ensure they have the correct data.&lt;/p&gt;

&lt;p&gt;Again roughly speaking a write request for Cassandra follows the path:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Appended the write to the end of the write ahead log.&lt;/li&gt;

&lt;li&gt;Modify an in memory structure called a &lt;a href='http://wiki.apache.org/cassandra/MemtableSSTable'&gt;Memtable&lt;/a&gt; to store the row changes present in the request.&lt;/li&gt;

&lt;li&gt;Queue the Memtable to be flushed to disk (in the background) if per Memtable &lt;a href='http://wiki.apache.org/cassandra/MemtableThresholds'&gt;thresholds&lt;/a&gt; on operation count, data size or time to be violated. A new empty Memtable is immediately put in it&amp;#8217;s place.&lt;/li&gt;

&lt;li&gt;Acknowledge the write to the client.&lt;/li&gt;

&lt;li&gt;Flush the Memtable to disk by creating a new &lt;a href='http://wiki.apache.org/cassandra/ArchitectureSSTable'&gt;SSTable (Sorted String Table)&lt;/a&gt; on disk, and mark the log records as no longer needed. Typically the flush task is executed asynchronously.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;(This is the write process on a single node, I&amp;#8217;m ignoring how the write is applied to the cluster and other features required for &amp;#8216;Eventual Consistency&amp;#8217;.)&lt;/p&gt;

&lt;p&gt;Again the write ahead log lets the write process dump the data &lt;em&gt;somewhere&lt;/em&gt; on disk fast, and then work on structures which are &lt;em&gt;guaranteed&lt;/em&gt; to be in memory. No effort is taken to read the existing data, if a row was updated it may now partially exist in memory and in multiple SSTables on disk.&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s probably important to point out that the RDBMS Checkpoint process works with an &lt;em&gt;extent&lt;/em&gt; which defines it&amp;#8217;s on disk allocation unit. Sql Server uses &lt;a href='http://msdn.microsoft.com/en-us/library/ms190969.aspx'&gt;extents&lt;/a&gt; that contain 8 8KB pages. MySQL uses a &lt;a href='http://dev.mysql.com/doc/refman/5.1/en/create-tablespace.html'&gt;default EXTENT_SIZE&lt;/a&gt; of 1MB for the extent, and the InnoDB engine has a &lt;a href='http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html'&gt;hard coded&lt;/a&gt; 16KB page size. Once one extent has been written it may be necessary to seek to anther random part of the disk to write to next extent. Though I&amp;#8217;m sure they do all sorts of clever things to make it nice and fast, this is the basic unit they deal with.&lt;/p&gt;

&lt;p&gt;In Cassandra the maximum Memtable size is configured per &lt;a href='http://www.datastax.com/docs/0.7/data_model/column_families'&gt;Column Family&lt;/a&gt; (sort of like a table but I dislike that analogy). Memory thresholds of 128MB, 256MB or higher are &lt;a href='http://wiki.apache.org/cassandra/MemtableThresholds'&gt;common&lt;/a&gt;, though Memtables may flush at lower sizes due violating other thresholds or in response to certain system events. The Flush task is able to write the entire contents of the Memtable to disk as a new file without having to perform random IO seeks.&lt;/p&gt;

&lt;h2 id='dear_readers'&gt;Dear Readers&lt;/h2&gt;

&lt;p&gt;In Cassandra read requests potentially have a lot more to do that those in a traditional RDBMS. A typical read request in a RDBMS to get a row by Primary Key seeks through the nicely ordered &lt;a href='http://en.wikipedia.org/wiki/B_tree'&gt;b-tree&lt;/a&gt; index pages to find the right data page, scans the page to find the appropriate row and reads the complete row (ignoring off page large data). Reading the index and data pages may require disk access, and a &lt;a href='http://dev.mysql.com/doc/refman/5.1/en/query-cache.html'&gt;query cache&lt;/a&gt; may provide direct access to the results of a previous query without needing to negotiate the index.&lt;/p&gt;

&lt;p&gt;Cassandra has two caches that can expedite a read request, a key cache (discussed below) and a &lt;a href='http://www.datastax.com/dev/blog/maximizing-cache-benefit-with-cassandra'&gt;row cache&lt;/a&gt;. The row cache contains all the current data for the row, and a query that needs to read any column from the row can be fully resolved by reading the cache.&lt;/p&gt;

&lt;p&gt;If the requested row cannot be served from the row cache it must be read from disk, however the write path may have left fragments of the row in various SSTables. The read request follows the path:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check the in memory &lt;a href='http://en.wikipedia.org/wiki/Bloom_filter'&gt;Bloom Filter&lt;/a&gt; for each SSTable on disk, to build a list of candidate SSTables where the row &lt;em&gt;may&lt;/em&gt; exist.&lt;/li&gt;

&lt;li&gt;If enabled probe the key cache for each candidate SSTable from step 1 to get the position of the row in the data file. The key cache may hold only a sample of keys from the file so misses are possible.&lt;/li&gt;

&lt;li&gt;If the key cache missed, probe the SSTable index summary held in memory which contains a regular sampling of the keys (and their positions) in the SSTable index file to find the preceding (sampled) key. Seek to the position of the preceding sampled key in the SSTable index file, and then scan the index file until the requested key and it&amp;#8217;s data file position is found.&lt;/li&gt;

&lt;li&gt;Seek to the row position in the SSTable data file and then scan the row data to read the columns that match the read request.&lt;/li&gt;

&lt;li&gt;Reduce (potentially) multiple values for each column provided by each SSTable and the current Memtable to a single value by considering column &lt;a href='http://wiki.apache.org/cassandra/API?highlight=%28timestamp%29'&gt;timestamps&lt;/a&gt;, Time To Live settings and &lt;a href='http://wiki.apache.org/cassandra/DistributedDeletes'&gt;deletes&lt;/a&gt;.&lt;/li&gt;

&lt;li&gt;Return the result to the client.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;(This is the read path on a single node, I&amp;#8217;m ignoring how the read is processed by the cluster, features required for &amp;#8216;Eventual Consistency&amp;#8217;, and have not mentioned that memory mapped file access is used by default on 64 bit platforms.)&lt;/p&gt;

&lt;p&gt;The read request must piece together the row from potentially many files, which may each involve random IO. Each SSTable which contains any columns for a row must be read to determine if it contains columns that match the request criteria. Hot rows may be present in present in memory in the form of Cassandra key and row caches or OS caches.&lt;/p&gt;

&lt;p&gt;Rows which receive many writes (including overwriting the same column) over time have the potential to perform worse than those that are written to within a short time span. Multiple writes to the same row and column are reconciled in memory by the Memtable so that only the current values for a row are written to disk in the SSTable. The &lt;a href='http://wiki.apache.org/cassandra/MemtableSSTable'&gt;Compaction Process&lt;/a&gt; provides an opposing force to the propensity of write requests to spread data out over multiple files.&lt;/p&gt;

&lt;p&gt;With all this going on it&amp;#8217;s still possible to get millisecond or better read performance. The import thing to note is that it&amp;#8217;s possible to have poorly performing reads due to spread out rows.&lt;/p&gt;

&lt;h2 id='compacting_for_a_sustainable_future'&gt;Compacting for a sustainable future&lt;/h2&gt;

&lt;p&gt;The &lt;a href='http://wiki.apache.org/cassandra/MemtableSSTable'&gt;Compaction Process&lt;/a&gt; is there to help the read requests by reducing the number of SSTables they may have visit. Compaction is also responsible for finalising &lt;a href='http://wiki.apache.org/cassandra/DistributedDeletes'&gt;Distributed Deletes&lt;/a&gt; and Time To Live, but I will ignore those for now.&lt;/p&gt;

&lt;p&gt;Major Compactions are triggered manually via the &lt;a href='http://wiki.apache.org/cassandra/NodeTool'&gt;nodetool&lt;/a&gt; utility and compact all SSTables for a Column Family into one file. Owing to the way compaction chooses which files to process this may result in the new file not been compacted for a very long time. As a result Major compactions are no longer recommended as Minor compactions can do the same things. I&amp;#8217;ve mentioned them here just for completeness.&lt;/p&gt;

&lt;p&gt;Cassandra checks to see if a Minor compaction is needed whenever an SSTable is written to disk. The process groups the files into buckets where every file in a bucket is within 50% of the average size of files in the bucket, small files (less than 50MB) are put in the first bucket. If the bucket contains more than than the Column Family defined min_compaction_threshold files the compaction process will compact up to max_compaction_threshold files together.&lt;/p&gt;

&lt;p&gt;During compaction the row fragments from the SSTables are reconciled to create a single row. When merging columns the one with the highest timestamp is used, if the timestamps are equal the column value is used as a tie breaker.&lt;/p&gt;

&lt;p&gt;The result is a single SSTable that contains an aggregated view of the row that was present in the mulitple input SStables. And a shorter path for any read request that previously needed to potentially read from several SSTables.&lt;/p&gt;

&lt;h2 id='in_motion'&gt;In motion&lt;/h2&gt;

&lt;p&gt;It&amp;#8217;s reasonably easy to see this process happening in slow motion by constructing a contrived schema in Cassandra and playing with some of the command line tools.&lt;/p&gt;

&lt;p&gt;This schema creates two Column Families. The OneOp CF sets the Memtable to flush to disk after just one operation is written, and FiveOp after 5 operations (&lt;code&gt;memtable_operations&lt;/code&gt; is expressed in millions of operations). The &lt;code&gt;min_compaction_threshold&lt;/code&gt; tells compaction to start after we see 4 files. The script then inserts some sample data, all against the same row and the same columns to demonstrate overwrites.&lt;/p&gt;

&lt;p&gt;Note that the Memtable thresholds are checked before the operation starts, but it is not flushed to disk until after the operation completes. For writes it counts the each columns in the request as an operation. In our case it means the SSTables will be created with after 2 or 6 operations rather than the expected 1 or 5.&lt;/p&gt;

&lt;p&gt;Start the cassandra cli and paste the script into the cli.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;create keyspace ReadWrite;

use ReadWrite;

create column family OneOp 
  with memtable_operations = 0.000001
  and min_compaction_threshold = 4;
create column family FiveOp 
  with memtable_operations = 0.000005
  and min_compaction_threshold = 4;

set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz1&amp;#39;;
set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz2&amp;#39;;
set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz3&amp;#39;;
set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz3&amp;#39;;
set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz1&amp;#39;;
set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz1&amp;#39;;


set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;
set FiveOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz&amp;#39;;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The data directory for the ReadWrite keyspace should now look like.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;/var/lib/cassandra/data/ReadWrite:
total 200
drwxr-xr-x  22 aaron  wheel   748B 27 Apr 10:42 .
drwxr-xr-x   4 aaron  wheel   136B 27 Apr 10:42 ..
-rw-r--r--   1 aaron  wheel    75B 27 Apr 10:42 FiveOp-f-1-Data.db
-rw-r--r--   1 aaron  wheel    16B 27 Apr 10:42 FiveOp-f-1-Filter.db
-rw-r--r--   1 aaron  wheel    14B 27 Apr 10:42 FiveOp-f-1-Index.db
-rw-r--r--   1 aaron  wheel   4.2K 27 Apr 10:42 FiveOp-f-1-Statistics.db
-rw-r--r--   1 aaron  wheel    75B 27 Apr 10:42 FiveOp-f-2-Data.db
-rw-r--r--   1 aaron  wheel    16B 27 Apr 10:42 FiveOp-f-2-Filter.db
-rw-r--r--   1 aaron  wheel    14B 27 Apr 10:42 FiveOp-f-2-Index.db
-rw-r--r--   1 aaron  wheel   4.2K 27 Apr 10:42 FiveOp-f-2-Statistics.db
-rw-r--r--   1 aaron  wheel    76B 27 Apr 10:42 OneOp-f-1-Data.db
-rw-r--r--   1 aaron  wheel    16B 27 Apr 10:42 OneOp-f-1-Filter.db
-rw-r--r--   1 aaron  wheel    14B 27 Apr 10:42 OneOp-f-1-Index.db
-rw-r--r--   1 aaron  wheel   4.2K 27 Apr 10:42 OneOp-f-1-Statistics.db
-rw-r--r--   1 aaron  wheel    76B 27 Apr 10:42 OneOp-f-2-Data.db
-rw-r--r--   1 aaron  wheel    16B 27 Apr 10:42 OneOp-f-2-Filter.db
-rw-r--r--   1 aaron  wheel    14B 27 Apr 10:42 OneOp-f-2-Index.db
-rw-r--r--   1 aaron  wheel   4.2K 27 Apr 10:42 OneOp-f-2-Statistics.db
-rw-r--r--   1 aaron  wheel    76B 27 Apr 10:42 OneOp-f-3-Data.db
-rw-r--r--   1 aaron  wheel    16B 27 Apr 10:42 OneOp-f-3-Filter.db
-rw-r--r--   1 aaron  wheel    14B 27 Apr 10:42 OneOp-f-3-Index.db
-rw-r--r--   1 aaron  wheel   4.2K 27 Apr 10:42 OneOp-f-3-Statistics.db&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are 3 SSTables for OneOp and 2 for FiveOp, and all of them contain the a fragment of the &amp;#8216;foo1&amp;#8217; row.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;nodetool cfhistograms&lt;/code&gt; utility shows recent statistics for requests against a Column Family, where recent means &amp;#8220;since the last time it was run&amp;#8221;. The same statistics are also available via the org.apache.cassandra.db.ColumnFamilies.&amp;#60;keyspace&amp;#62;.&amp;#60;column_family&amp;#62; MBean in JConsole.&lt;/p&gt;

&lt;p&gt;Clear the stats for both Column Families on the command line.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost cfhistograms ReadWrite OneOp
$ bin/nodetool -h localhost cfhistograms ReadWrite FiveOp&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now execute two reads via the cli.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@ReadWrite] get OneOp[&amp;#39;foo1&amp;#39;];
=&amp;gt; (column=626172, value=62617a31, timestamp=1303875486224000)
Returned 1 results.
[default@ReadWrite] get FiveOp[&amp;#39;foo1&amp;#39;];
=&amp;gt; (column=626172, value=62617a, timestamp=1303875486248000)
Returned 1 results.
[default@ReadWrite] &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Check the stats for the OneOp Column Family.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost cfhistograms ReadWrite OneOp
ReadWrite/OneOp histograms
Offset      SSTables     Write Latency      Read Latency          Row Size      Column Count
1                  0                 0                 0                 0                 3
2                  0                 0                 0                 0                 0
3                  1                 0                 0                 0                 0
4                  0                 0                 0                 0                 0&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The histogram is saying that 1 request used 3 SSTables, and there are an estimated 3 columns in all the SSTables the Column Family is tracking. The Column Family should be tracking 3 SSTables, and each should have only one column for the &amp;#8216;foo1&amp;#8217; row. The insert script overwrote the same column and the Memtable absorbed one overwrite before been flushed to disk.&lt;/p&gt;

&lt;p&gt;The stats for the FiveOp Column Family also make sense.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost cfhistograms ReadWrite FiveOp
ReadWrite/FiveOp histograms
Offset      SSTables     Write Latency      Read Latency          Row Size      Column Count
1                  0                 0                 0                 0                 2
2                  1                 0                 0                 0                 0&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are only 2 SSTables for the Column Family, both contain a fragment of the &amp;#8216;foo1&amp;#8217; row and both contain only one column.&lt;/p&gt;

&lt;p&gt;A minor compaction on the OneOp Column Family can be triggered by adding just two more &lt;a href='http://www.youtube.com/watch?v=rXH_12QWWg8'&gt;wafer thin&lt;/a&gt; rows via the cli.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@ReadWrite] set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz1&amp;#39;;
Value inserted.
[default@ReadWrite] set OneOp[&amp;#39;foo1&amp;#39;][&amp;#39;bar&amp;#39;] = &amp;#39;baz2&amp;#39;;
Value inserted.
[default@ReadWrite]&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This should trigger a minor compaction as there is now 4 SSTables of similar size. The data directory should now look something like this.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ls -l /var/lib/cassandra/data/ReadWrite/
total 280
-rw-r--r--  1 aaron  wheel    75 27 Apr 15:38 FiveOp-f-1-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 15:38 FiveOp-f-1-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 15:38 FiveOp-f-1-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 15:38 FiveOp-f-1-Statistics.db
-rw-r--r--  1 aaron  wheel    75 27 Apr 15:38 FiveOp-f-2-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 15:38 FiveOp-f-2-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 15:38 FiveOp-f-2-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 15:38 FiveOp-f-2-Statistics.db
-rw-r--r--  1 aaron  wheel     0 27 Apr 16:16 OneOp-f-1-Compacted
-rw-r--r--  1 aaron  wheel    76 27 Apr 15:38 OneOp-f-1-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 15:38 OneOp-f-1-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 15:38 OneOp-f-1-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 15:38 OneOp-f-1-Statistics.db
-rw-r--r--  1 aaron  wheel     0 27 Apr 16:16 OneOp-f-2-Compacted
-rw-r--r--  1 aaron  wheel    76 27 Apr 15:38 OneOp-f-2-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 15:38 OneOp-f-2-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 15:38 OneOp-f-2-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 15:38 OneOp-f-2-Statistics.db
-rw-r--r--  1 aaron  wheel     0 27 Apr 16:16 OneOp-f-3-Compacted
-rw-r--r--  1 aaron  wheel    76 27 Apr 15:38 OneOp-f-3-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 15:38 OneOp-f-3-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 15:38 OneOp-f-3-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 15:38 OneOp-f-3-Statistics.db
-rw-r--r--  1 aaron  wheel     0 27 Apr 16:16 OneOp-f-4-Compacted
-rw-r--r--  1 aaron  wheel    76 27 Apr 16:16 OneOp-f-4-Data.db
-rw-r--r--  1 aaron  wheel    16 27 Apr 16:16 OneOp-f-4-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 16:16 OneOp-f-4-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 16:16 OneOp-f-4-Statistics.db
-rw-r--r--  1 aaron  wheel    76 27 Apr 16:16 OneOp-f-5-Data.db
-rw-r--r--  1 aaron  wheel  1936 27 Apr 16:16 OneOp-f-5-Filter.db
-rw-r--r--  1 aaron  wheel    14 27 Apr 16:16 OneOp-f-5-Index.db
-rw-r--r--  1 aaron  wheel  4264 27 Apr 16:16 OneOp-f-5-Statistics.db&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The SSTables for OneOp numbered 1 through 4 have been compacted, this is tracked in the server and a compacted marker file (e.g. &amp;#8220;OneOp-f-4-Compacted&amp;#8221;) is written to disk to preserve the information across system restarts. The unused files will be physically deleted during &lt;a href='http://wiki.apache.org/cassandra/ArchitectureInternals?highlight=%28delete%29'&gt;JVM Garbage Collection&lt;/a&gt;. If Cassandra detects it is low on disk space when about to write data to disk, it will trigger GC in an effort to reclaim unused space.&lt;/p&gt;

&lt;p&gt;There is also a new SSTable &lt;code&gt;OneOp-f-5-Data.db&lt;/code&gt; that contains the single reconciled row from the other 4 SSTables. A read request against that row should now only use one SSTable:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[default@ReadWrite] get OneOp[&amp;#39;foo1&amp;#39;];
=&amp;gt; (column=626172, value=62617a32, timestamp=1303877811057000)
Returned 1 results.
[default@ReadWrite] &lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Check the stats.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ bin/nodetool -h localhost cfhistograms ReadWrite OneOp
ReadWrite/OneOp histograms
Offset      SSTables     Write Latency      Read Latency          Row Size      Column Count
1                  1                 0                 0                 0                 1
2                  0                 0                 0                 0                 0
3                  0                 0                 0                 0                 0
4                  0                 0                 0                 0                 0&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The request used one SSTable and the Column Family has only 1 column in all the SSTables it is tracking.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Introduction to Cassandra</title>
   <link href="http://thelastpickle.com/2011/02/07/Introduction-to-Cassandra" />
   <updated>2011-02-07T00:00:00-08:00</updated>
   <id>http://thelastpickle.com/2011/02/07/Introduction-to-Cassandra</id>
   <content type="html">&lt;p&gt;I&amp;#8217;ve tidied up a previous Introduction to Cassandra presentation with better diagrams and improved explanations. It&amp;#8217;s available as &lt;a href='/files/2011-02-07-introduction-to-cassandra/introduction-to-cassandra.key'&gt;Keynote&lt;/a&gt;, &lt;a href='/files/2011-02-07-introduction-to-cassandra/introduction-to-cassandra.pdf'&gt;PDF&lt;/a&gt; or plain old &lt;a href='/files/2011-02-07-introduction-to-cassandra/introduction-to-cassandra.html'&gt;web pages&lt;/a&gt;. It&amp;#8217;s a basic introduction to the way Cassandra works as a clustered system. It covers Partitioning, Replication, Hinted Handoff, Read Repair and Anti Entrophy. It does not cover the data model.&lt;/p&gt;

&lt;p&gt;Presentation classic is still &lt;a href='http://www.slideshare.net/aaronmorton/well-railedcassandra24112010-5901169'&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have any questions or suggestions let me know. You can also &lt;a href='http://creativecommons.org/licenses/by-nc/3.0/nz/deed.en'&gt;use it&lt;/a&gt; as a starting point if you are creating a Cassandra presentation of your own.&lt;/p&gt;</content>
 </entry>
 
 <entry>
   <title>Hello World</title>
   <link href="http://thelastpickle.com/2011/01/31/Hello-World" />
   <updated>2011-01-31T00:00:00-08:00</updated>
   <id>http://thelastpickle.com/2011/01/31/Hello-World</id>
   <content type="html">&lt;p&gt;How are you?&lt;/p&gt;</content>
 </entry>
 
 
</feed>
