The Last Pickle

Medusa 0.16 was released

2023-09-21T00:00:00+00:00

The k8ssandra team is happy to announce the release of Medusa for Apache Cassandra™ v0.16. This is a special release as we did a major overhaul of the storage classes implementation. We now have less code (and less dependencies) while providing much faster and resilient storage communications.

Back to ~basics~ official SDKs

Medusa has been an open source project for about 4 years now, and a private one for a few more. Over such a long time, other software it depends upon (or doesn’t) evolves as well. More specifically, the SDKs of the major cloud providers evolved a lot. We decided to check if we could replace a lot of custom code doing asynchronous parallelisation and calls to the cloud storage CLI utilities with the official SDKs.

Our storage classes so far relied on two different ways of interacting with the object storage backends:

Apache Libcloud, which provided a Python API for abstracting ourselves from the different protocols. It was convenient and fast for uploading a lot of small files, but very inefficient for large transfers.
Specific cloud vendors CLIs, which were much more efficient with large file transfers, but invoked through subprocesses. This created an overhead that made them inefficient for small file transfers. Relying on subprocesses also created a much more brittle implementation which led the community to create a lot of issues we’ve been struggling to fix.

To cut a long story short, we did it!

We started by looking at S3, where we went for the official boto3. As it turns out, boto3 does all the chunking, throttling, retries and parallelisation for us. Yay!
Next we looked at GCP. Here we went with TalkIQ’s gcloud-aio-storage. It works very well for everything, including the large files. The only thing missing is the throughput throttling.
Finally, we used Azure’s official SDK to cover Azure compatibility. Sadly, this still works without throttling as well.

Right after finishing these replacements, we spotted the following improvements:

The integration tests duration against the storage backends dropped from ~45 min to ~15 min.
- This means Medusa became far more efficient.
- There is now much less time spent managing storage interaction thanks to it being asynchronous to the core.
The Medusa uncompressed image size we bundle into k8ssandra dropped from ~2GB to ~600MB and its build time went from 2 hours to about 15 minutes.
- Aside from giving us much faster feedback loops when working on k8ssandra, this should help k8ssandra itself move a little bit faster.
The file transfers are now much faster.
- We observed up to several hundreds of MB/s per node when moving data from a VM to blob storage within the same provider. The available network speed is the limit now.
- We are also aware that consuming the whole network throughput is not great. That’s why we now have proper throttling for S3 and are working on a solution for this in other backends too.

The only compromise we had to make was to drop Python 3.6 support. This is because the Pythons asyncio features only come in Python 3.7.

The other good stuff

Even though we are the happiest about the storage backends, there is a number of changes that should not go without mention:

We fixed a bug with hierarchical storage containers in Azure. This flavor of blob storage works more like a regular file system, meaning it has a concept of directories. None of the other backends do this (including the vanilla Azure ones), and Medusa was not dealing gracefully with this.
We are now able to build Medusa images for multiple architectures, including the arm64 one.
Medusa can now purge backups of nodes that have been decommissioned, meaning they are no longer present in the most recent backups. Use the new medusa purge-decommissioned command to trigger such a purge.

Upgrade now

We encourage all Medusa users to upgrade to version 0.16 to benefit from all these storage improvements, making it much faster and reliable.

Medusa v0.16 is the default version in the newly released k8ssandra-operator v1.9.0, and it can be used with previous releases by setting the .spec.medusa.containerImage.tag field in your K8ssandraCluster manifests.

Reaper 3.0 for Apache Cassandra was released

2021-10-29T00:00:00+00:00

The K8ssandra team is pleased to announce the release of Reaper 3.0. Let’s dive into the main new features and improvements this major version brings, along with some notable removals.

Storage backends

Over the years, we regularly discussed dropping support for Postgres and H2 with the TLP team. The effort for maintaining these storage backends was moderate, despite our lack of expertise in Postgres, as long as Reaper’s architecture was simple. Complexity grew with more deployment options, culminating with the addition of the sidecar mode.
Some features require different consensus strategies depending on the backend, which sometimes led to implementations that worked well with one backend and were buggy with others.
In order to allow building new features faster, while providing a consistent experience for all users, we decided to drop the Postgres and H2 backends in 3.0.

Apache Cassandra and the managed DataStax Astra DB service are now the only production storage backends for Reaper. The free tier of Astra DB will be more than sufficient for most deployments.

Reaper does not generally require high availability - even complete data loss has mild consequences. Where Astra is not an option, a single Cassandra server can be started on the instance that hosts Reaper, or an existing cluster can be used as a backend data store.

Adaptive Repairs and Schedules

One of the pain points we observed when people start using Reaper is understanding the segment orchestration and knowing how the default timeout impacts the execution of repairs.
Repair is a complex choreography of operations in a distributed system. As such, and especially in the days when Reaper was created, the process could get blocked for several reasons and required a manual restart. The smart folks that designed Reaper at Spotify decided to put a timeout on segments to deal with such blockage, over which they would be terminated and rescheduled.
Problems arise when segments are too big (or have too much entropy) to process within the default 30 minutes timeout, despite not being blocked. They are repeatedly terminated and recreated, and the repair appears to make no progress.
Reaper did a poor job at dealing with this for mainly two reasons:

Each retry will use the same timeout, possibly failing segments forever
Nothing obvious was reported to explain what was failing and how to fix the situation

We fixed the former by using a longer timeout on subsequent retries, which is a simple trick to make repairs more “adaptive”. If the segments are too big, they’ll eventually pass after a few retries. It’s a good first step to improve the experience, but it’s not enough for scheduled repairs as they could end up with the same repeated failures for each run.
This is where we introduce adaptive schedules, which use feedback from past repair runs to adjust either the number of segments or the timeout for the next repair run.

Adaptive schedules will be updated at the end of each repair if the run metrics justify it. The schedule can get a different number of segments or a higher segment timeout depending on the latest run.
The rules are the following:

if more than 20% segments were extended, the number of segments will be raised by 20% on the schedule
if less than 20% segments were extended (and at least one), the timeout will be set to twice the current timeout
if no segment was extended and the maximum duration of segments is below 5 minutes, the number of segments will be reduced by 10% with a minimum of 16 segments per node.

This feature is disabled by default and is configurable on a per schedule basis. The timeout can now be set differently for each schedule, from the UI or the REST API, instead of having to change the Reaper config file and restart the process.

Incremental Repair Triggers

As we celebrate the long awaited improvements in incremental repairs brought by Cassandra 4.0, it was time to embrace them with more appropriate triggers. One metric that incremental repair makes available is the percentage of repaired data per table. When running against too much unrepaired data, incremental repair can put a lot of pressure on a cluster due to the heavy anticompaction process.

The best practice is to run it on a regular basis so that the amount of unrepaired data is kept low. Since your throughput may vary from one table/keyspace to the other, it can be challenging to set the right interval for your incremental repair schedules.

Reaper 3.0 introduces a new trigger for the incremental schedules, which is a threshold of unrepaired data. This allows creating schedules that will start a new run as soon as, for example, 10% of the data for at least one table from the keyspace is unrepaired.

Those triggers are complementary to the interval in days, which could still be necessary for low traffic keyspaces that need to be repaired to secure tombstones.

These new features will allow to securely optimize tombstone deletions by enabling the only_purge_repaired_tombstones compaction subproperty in Cassandra, permitting to reduce gc_grace_seconds down to 3 hours without fearing that deleted data reappears.

Schedules can be edited

That may sound like an obvious feature but previous versions of Reaper didn’t allow for editing of an existing schedule. This led to an annoying procedure where you had to delete the schedule (which isn’t made easy by Reaper either) and recreate it with the new settings.

3.0 fixes that embarrassing situation and adds an edit button to schedules, which allows to change the mutable settings of schedules:

More improvements

In order to protect clusters from running mixed incremental and full repairs in older versions of Cassandra, Reaper would disallow the creation of an incremental repair run/schedule if a full repair had been created on the same set of tables in the past (and vice versa).

Now that incremental repair is safe for production use, it is necessary to allow such mixed repair types. In case of conflict, Reaper 3.0 will display a pop up informing you and allowing to force create the schedule/run:

We’ve also added a special “schema migration mode” for Reaper, which will exit after the schema was created/upgraded. We use this mode in K8ssandra to prevent schema conflicts and allow the schema creation to be executed in an init container that won’t be subject to liveness probes that could trigger the premature termination of the Reaper pod:

java -jar path/to/reaper.jar schema-migration path/to/cassandra-reaper.yaml

There are many other improvements and we invite all users to check the changelog in the GitHub repo.

Upgrade Now

We encourage all Reaper users to upgrade to 3.0.0, while recommending users to carefully prepare their migration out of Postgres/H2. Note that there is no export/import feature and schedules will need to be recreated after the migration.

All instructions to download, install, configure, and use Reaper 3.0.0 are available on the Reaper website.

Certificates management and Cassandra Pt II - cert-manager and Kubernetes

2021-10-28T00:00:00+00:00

The joys of certificate management

Certificate management has long been a bugbear of enterprise environments, and expired certs have been the cause of countless outages. When managing large numbers of services at scale, it helps to have an automated approach to managing certs in order to handle renewal and avoid embarrassing and avoidable downtime.

This is part II of our exploration of certificates and encrypting Cassandra. In this blog post, we will dive into certificate management in Kubernetes. This post builds on a few of the concepts in Part I of this series, where Anthony explained the components of SSL encryption.

Recent years have seen the rise of some fantastic, free, automation-first services like letsencrypt, and no one should be caught flat footed by certificate renewals in 2021. In this blog post, we will look at one Kubernetes native tool that aims to make this process much more ergonomic on Kubernetes; cert-manager.

Recap

Anthony has already discussed several points about certificates. To recap:

In asymmetric encryption and digital signing processes we always have public/private key pairs. We are referring to these as the Keystore Private Signing Key (KS PSK) and Keystore Public Certificate (KS PC).
Public keys can always be openly published and allow senders to communicate to the holder of the matching private key.
A certificate is just a public key - and some additional fields - which has been signed by a certificate authority (CA). A CA is a party trusted by all parties to an encrypted conversation.
When a CA signs a certificate, this is a way for that mutually trusted party to attest that the party holding that certificate is who they say they are.
CA’s themselves use public certificates (Certificate Authority Public Certificate; CA PC) and private signing keys (the Certificate Authority Private Signing Key; CA PSK) to sign certificates in a verifiable way.

The many certificates that Cassandra might be using

In a moderately complex Cassandra configuration, we might have:

A root CA (cert A) for internode encryption.
A certificate per node signed by cert A.
A root CA (cert B) for the client-server encryption.
A certificate per node signed by cert B.
A certificate per client signed by cert B.

Even in a three node cluster, we can envisage a case where we must create two root CAs and 6 certificates, plus a certificate for each client application; for a total of 8+ certificates!

To compound the problem, this isn’t a one-off setup. Instead, we need to be able to rotate these certificates at regular intervals as they expire.

Ergonomic certificate management on Kubernetes with cert-manager

Thankfully, these processes are well supported on Kubernetes by a tool called cert-manager.

cert-manager is an all-in-one tool that should save you from ever having to reach for openssl or keytool again. As a Kubernetes operator, it manages a variety of custom resources (CRs) such as (Cluster)Issuers, CertificateRequests and Certificates. Critically it integrates with Automated Certificate Management Environment (ACME) Issuers, such as LetsEncrypt (which we will not be discussing today).

The workfow reduces to:

Create an Issuer (via ACME, or a custom CA).
Create a Certificate CR.
Pick up your certificates and signing keys from the secrets cert-manager creates, and mount them as volumes in your pods’ containers.

Everything is managed declaratively, and you can reissue certificates at will simply by deleting and re-creating the certificates and secrets.

Or you can use the kubectl plugin which allows you to write a simple kubectl cert-manager renew. We won’t discuss this in depth here, see the cert-manager documentation for more information

Java batteries included (mostly)

At this point, Cassandra users are probably about to interject with a loud “Yes, but I need keystores and truststores, so this solution only gets me halfway”. As luck would have it, from version .15, cert-manager also allows you to create JKS truststores and keystores directly from the Certificate CR.

The fine print

There are two caveats to be aware of here:

Most Cassandra deployment options currently available (including statefulSets, cass-operator or k8ssandra) do not currently support using a cert-per-node configuration in a convenient fashion. This is because the PodTemplate.spec portions of these resources are identical for each pod in the StatefulSet. This precludes the possibility of adding per-node certs via environment or volume mounts.
There are currently some open questions about how to rotate certificates without downtime when using internode encryption.
- Our current recommendation is to use a CA PC per Cassandra datacenter (DC) and add some basic scripts to merge both CA PCs into a single truststore to be propagated across all nodes. By renewing the CA PC independently you can ensure one DC is always online, but you still do suffer a network partition. Hinted handoff should theoretically rescue the situation but it is a less than robust solution, particularly on larger clusters. This solution is not recommended when using lightweight transactions or non LOCAL consistency levels.
- One mitigation to consider is using non-expiring CA PCs, in which case no CA PC rotation is ever performed without a manual trigger. KS PCs and KS PSKs may still be rotated. When CA PC rotation is essential this approach allows for careful planning ahead of time, but it is not always possible when using a 3rd party CA.
- Istio or other service mesh approaches can fully automate mTLS in clusters, but Istio is a fairly large committment and can create its own complexities.
- Manual management of certificates may be possible using a secure vault (e.g. HashiCorp vault), sealed secrets, or similar approaches. In this case, cert manager may not be involved.

These caveats are not trivial. To address (2) more elegantly you could also implement Anthony’s solution from part one of this blog series; but you’ll need to script this up yourself to suit your k8s environment.

We are also in discussions with the folks over at cert-manager about how their ecosystem can better support Cassandra. We hope to report progress on this front over the coming months.

These caveats present challenges, but there are also specific cases where they matter less.

cert-manager and Reaper - a match made in heaven

One case where we really don’t care if a client is unavailable for a short period is when Reaper is the client.

Cassandra is an eventually consistent system and suffers from entropy. Data on nodes can become out of sync with other nodes due to transient network failures, node restarts and the general wear and tear incurred by a server operating 24/7 for several years.

Cassandra contemplates that this may occur. It provides a variety of consistency level settings allowing you to control how many nodes must agree for a piece of data to be considered the truth. But even though properly set consistency levels ensure that the data returned will be accurate, the process of reconciling data across the network degrades read performance - it is best to have consistent data on hand when you go to read it.

As a result, we recommend the use of Reaper, which runs as a Cassandra client and automatically repairs the cluster in a slow trickle, ensuring that a high volume of repairs are not scheduled all at once (which would overwhelm the cluster and degrade the performance of real clients) while also making sure that all data is eventually repaired for when it is needed.

The set up

The manifests for this blog post can be found here.

Environment

We assume that you’re running Kubernetes 1.21, and we’ll be running with a Cassandra 3.11.10 install. The demo environment we’ll be setting up is a 3 node environment, and we have tested this configuration against 3 nodes.

We will be installing the cass-operator and Cassandra cluster into the cass-operator namespace, while the cert-manager operator will sit within the cert-manager namespace.

Setting up kind

For testing, we often use kind to provide a local k8s cluster. You can use minikube or whatever solution you prefer (including a real cluster running on GKE, EKS, or AKS), but we’ll include some kind instructions and scripts here to ease the way.

If you want a quick fix to get you started, try running the setup-kind-multicluster.sh script from the k8ssandra-operator repository, with setup-kind-multicluster.sh --kind-worker-nodes 3. I have included this script in the root of the code examples repo that accompanies this blog.

A demo CA certificate

We aren’t going to use LetsEncrypt for this demo, firstly because ACME certificate issuance has some complexities (including needing a DNS or a publicly hosted HTTP server) and secondly because I want to reinforce that cert-manager is useful to organisations who are bringing their own certs and don’t need one issued. This is especially useful for on-prem deployments.

First off, create a new private key and certificate pair for your root CA. Note that the file names tls.crt and tls.key will become important in a moment.

openssl genrsa -out manifests/demoCA/tls.key 4096
openssl req -new -x509 -key manifests/demoCA/tls.key -out manifests/demoCA/tls.crt -subj "/C=AU/ST=NSW/L=Sydney/O=Global Security/OU=IT Department/CN=example.com"

(Or you can just run the generate-certs.sh script in the manifests/demoCA directory - ensure you run it from the root of the project so that the secrets appear in .manifests/demoCA/.)

When running this process on MacOS be aware of this issue which affects the creation of self signed certificates. The repo referenced in this blog post contains example certificates which you can use for demo purposes - but do not use these outside your local machine.

Now we’re going to use kustomize (which comes with kubectl) to add these files to Kubernetes as secrets. kustomize is not a templating language like Helm. But it fulfills a similar role by allowing you to build a set of base manifests that are then bundled, and which can be customised for your particular deployment scenario by patching.

Run kubectl apply -k manifests/demoCA. This will build the secrets resources using the kustomize secretGenerator and add them to Kubernetes. Breaking this process down piece by piece:

# ./manifests/demoCA
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: cass-operator
generatorOptions:
 disableNameSuffixHash: true
secretGenerator:
- name: demo-ca
  type: tls
  files:
  - tls.crt
  - tls.key

We use disableNameSuffixHash, because otherwise kustomize will add hashes to each of our secret names. This makes it harder to build these deployments one component at a time.
The tls type secret conventionally takes two keys with these names, as per the next point. cert-manager expects a secret in this format in order to create the Issuer which we will explain in the next step.
We are adding the files tls.crt and tls.key. The file names will become the keys of a secret called demo-ca.

cert-manager

cert-manager can be installed by running kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.5.3/cert-manager.yaml. It will install into the cert-manager namespace because a Kubernetes cluster should only ever have a single cert-manager operator installed.

cert-manager will install a deployment, as well as various custom resource definitions (CRDs) and webhooks to deal with the lifecycle of the Custom Resources (CRs).

A cert-manager Issuer

Issuers come in various forms. Today we’ll be using a CA Issuer because our components need to trust each other, but don’t need to be trusted by a web browser.

Other options include ACME based Issuers compatible with LetsEncrypt, but these would require that we have control of a public facing DNS or HTTP server, and that isn’t always the case for Cassandra, especially on-prem.

Dive into the truststore-keystore directory and you’ll find the Issuer, it is very simple so we won’t reproduce it here. The only thing to note is that it takes a secret which has keys of tls.crt and tls.key - the secret you pass in must have these keys. These are the CA PC and CA PSK we mentioned earlier.

We’ll apply this manifest to the cluster in the next step.

Some cert-manager certs

Let’s start with the Cassandra-Certificate.yaml resource:

spec:
  # Secret names are always required.
  secretName: cassandra-jks-keystore
  duration: 2160h # 90d
  renewBefore: 360h # 15d
  subject:
    organizations:
    - datastax
  dnsNames:
  - dc1.cass-operator.svc.cluster.local
  isCA: false
  usages:
    - server auth
    - client auth
  issuerRef:
    name: ca-issuer
    # We can reference ClusterIssuers by changing the kind here.
    # The default value is `Issuer` (i.e. a locally namespaced Issuer)
    kind: Issuer
    # This is optional since cert-manager will default to this value however
    # if you are using an external issuer, change this to that `Issuer` group.
    group: cert-manager.io
  keystores:
    jks:
      create: true
      passwordSecretRef: # Password used to encrypt the keystore
        key: keystore-pass
        name: jks-password
  privateKey:
    algorithm: RSA
    encoding: PKCS1
    size: 2048

The first part of the spec here tells us a few things:

The keystore, truststore and certificates will be fields within a secret called cassandra-jks-keystore. This secret will end up holding our KS PSK and KS PC.
It will be valid for 90 days.
15 days before expiry, it will be renewed automatically by cert manager, which will contact the Issuer to do so.
It has a subject organisation. You can add any of the X509 subject fields here, but it needs to have one of them.
It has a DNS name - you could also provide a URI or IP address. In this case we have used the service address of the Cassandra datacenter which we are about to create via the operator. This has a format of <DC_NAME>.<NAMESPACE>.svc.cluster.local.
It is not a CA (isCA), and can be used for server auth or client auth (usages). You can tune these settings according to your needs. If you make your cert a CA you can even reference it in a new Issuer, and define cute tree like structures (if you’re into that).

Outside the certificates themselves, there are additional settings controlling how they are issued and what format this happens in.

IssuerRef is used to define the Issuer we want to issue the certificate. The Issuer will sign the certificate with its CA PSK.
We are specifying that we would like a keystore created with the keystore key, and that we’d like it in jks format with the corresponding key.
passwordSecretKeyRef references a secret and a key within it. It will be used to provide the password for the keystore (the truststore is unencrypted as it contains only public certs and no signing keys).

The Reaper-Certificate.yaml is similar in structure, but has a different DNS name. We aren’t configuring Cassandra to verify that the DNS name on the certificate matches the DNS name of the parties in this particular case.

Apply all of the certs and the Issuer using kubectl apply -k manifests/truststore-keystore.

Cass-operator

Examining the cass-operator directory, we’ll see that there is a kustomization.yaml which references the remote cass-operator repository and a local cassandraDatacenter.yaml. This applies the manifests required to run up a cass-operator installation namespaced to the cass-operator namespace.

Note that this installation of the operator will only watch its own namespace for CassandraDatacenter CRs. So if you create a DC in a different namespace, nothing will happen.

We will apply these manifests in the next step.

CassandraDatacenter

Finally, the CassandraDatacenter resource in the ./cass-operator/ directory will describe the kind of DC we want:

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: dc1
spec:
  clusterName: cluster1
  serverType: cassandra
  serverVersion: 3.11.10
  managementApiAuth:
    insecure: {}
  size: 1
  podTemplateSpec:
    spec:
      containers:
        - name: "cassandra"
          volumeMounts:
          - name: certs
            mountPath: "/crypto"
      volumes:
      - name: certs
        secret:
          secretName: cassandra-jks-keystore
  storageConfig:
    cassandraDataVolumeClaimSpec:
      storageClassName: standard
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
  config:
    cassandra-yaml:
      authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
      authorizer: org.apache.cassandra.auth.AllowAllAuthorizer
      role_manager: org.apache.cassandra.auth.CassandraRoleManager
      client_encryption_options:
        enabled: true
        # If enabled and optional is set to true encrypted and unencrypted connections are handled.
        optional: false
        keystore: /crypto/keystore.jks
        keystore_password: dc1
        require_client_auth: true
        # Set trustore and truststore_password if require_client_auth is true
        truststore: /crypto/truststore.jks
        truststore_password: dc1
        protocol: TLS
        # cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA] # An earlier version of this manifest configured cipher suites but the proposed config was less secure. This section does not need to be modified.
      server_encryption_options:
        internode_encryption: all
        keystore: /crypto/keystore.jks
        keystore_password: dc1
        truststore: /crypto/truststore.jks
        truststore_password: dc1
    jvm-options:
      initial_heap_size: 800M
      max_heap_size: 800M

We provide a name for the DC - dc1.
We provide a name for the cluster - the DC would join other DCs if they already exist in the k8s cluster and we configured the additionalSeeds property.
We use the podTemplateSpec.volumes array to declare the volumes for the Cassandra pods, and we use the podTemplateSpec.containers.volumeMounts array to describe where and how to mount them.

The config.cassandra-yaml field is where most of the encryption configuration happens, and we are using it to enable both internode and client-server encryption, which both use the same keystore and truststore for simplicity. Remember that using internode encryption means your DC needs to go offline briefly for a full restart when the CA’s keys rotate.

We are not using authz/n in this case to keep things simple. Don’t do this in production.
For both encryption types we need to specify (1) the keystore location, (2) the truststore location and (3) the passwords for the keystores. The locations of the keystore/truststore come from where we mounted them in volumeMounts.
We are specifying JVM options just to make this run politely on a smaller machine. You would tune this for a production deployment.

Roll out the cass-operator and the CassandraDatacenter using kubectl apply -k manifests/cass-operator. Because the CRDs might take a moment to propagate, there is a chance you’ll see errors stating that the resource type does not exist. Just keep re-applying until everything works - this is a declarative system so applying the same manifests multiple times is an idempotent operation.

Reaper deployment

The k8ssandra project offers a Reaper operator, but for simplicity we are using a simple deployment (because not every deployment needs an operator). The deployment is standard kubernetes fare, and if you want more information on how these work you should refer to the Kubernetes docs.

We are injecting the keystore and truststore passwords into the environment here, to avoid placing them in the manifests. cass-operator does not currently support this approach without an initContainer to pre-process the cassandra.yaml using envsubst or a similar tool.

The only other note is that we are also pulling down a Cassandra image and using it in an initContainer to create a keyspace for Reaper, if it does not exist. In this container, we are also adding a ~/.cassandra/cqlshrc file under the home directory. This provides SSL connectivity configurations for the container. The critical part of the cqlshrc file that we are adding is:

[ssl]
certfile = /crypto/ca.crt
validate = true
userkey = /crypto/tls.key
usercert = /crypto/tls.crt
version = TLSv1_2

The version = TLSv1_2 tripped me up a few times, as it seems to be a recent requirement. Failing to add this line will give you back the rather fierce Last error: [SSL] internal error in the initContainer. The commands run in this container are not ideal. In particular, the fact that we are sleeping for 840 seconds to wait for Cassandra to start is sloppy. In a real deployment we’d want to health check and wait until the Cassandra service became available.

Apply the manifests using kubectl apply -k manifests/reaper.

Results

If you use a GUI, look at the logs for Reaper, you should see that it has connected to the cluster and provided some nice ASCII art to your console.

If you don’t use a GUI, you can run kubectl get pods -n cass-operator to find your Reaper pod (which we’ll call REAPER_PODNAME) and then run kubectl logs -n cass-operator REAPER_PODNAME to pull the logs.

Conclusion

While the above might seem like a complex procedure, we’ve just created a Cassandra cluster with both client-server and internode encryption enabled, all of the required certs, and a Reaper deployment which is configured to connect using the correct certs. Not bad.

Do keep in mind the weaknesses relating to key rotation, and watch this space for progress on that front.

Cassandra Certificate Management Part 1 - How to Rotate Keys Without Downtime

2021-06-15T00:00:00+00:00

Welcome to this three part blog series where we dive into the management of certificates in an Apache Cassandra cluster. For this first post in the series we will focus on how to rotate keys in an Apache Cassandra cluster without downtime.

Usability and Security at Odds

If you have downloaded and installed a vanilla installation of Apache Cassandra, you may have noticed when it is first started all security is disabled. Your “Hello World” application works out of the box because the Cassandra project chose usability over security. This is deliberately done so everyone benefits from the usability, as security requirement for each deployment differ. While only some deployments require multiple layers of security, others require no security features to be enabled.

Security of a system is applied in layers. For example one layer is isolating the nodes in a cluster behind a proxy. Another layer is locking down OS permissions. Encrypting connections between nodes, and between nodes and the application is another layer that can be applied. If this is the only layer applied, it leaves other areas of a system insecure. When securing a Cassandra cluster, we recommend pursuing an informed approach which offers defence-in-depth. Consider additional aspects such as encryption at rest (e.g. disk encryption), authorization, authentication, network architecture, and hardware, host and OS security.

Encrypting connections between two hosts can be difficult to set up as it involves a number of tools and commands to generate the necessary assets for the first time. We covered this process in previous posts: Hardening Cassandra Step by Step - Part 1 Inter-Node Encryption and Hardening Cassandra Step by Step - Part 2 Hostname Verification for Internode Encryption. I recommend reading both posts before reading through the rest of the series, as we will build off concepts explained in them.

Here is a quick summary of the basic steps to create the assets necessary to encrypt connections between two hosts.

Create the Root Certificate Authority (CA) key pair from a configuration file using openssl.
Create a keystore for each host (node or client) using keytool.
Export the Public Certificate from each host keystore as a “Signing Request” using keytool.
Sign each Public Certificate “Signing Request” with our Root CA to generate a Signed Certificate using openssl.
Import the Root CA Public Certificate and the Signed Certificate into each keystore using keytool.
Create a common truststore and import the CA Public Certificate into it using keytool.

Security Requires Ongoing Maintenance

Setting up SSL encryption for the various connections to Cassandra is only half the story. Like all other software out in the wild, there are ongoing maintenance to ensure the SSL encrypted connections continue to work.

At some point you wil need to update the certificates and stores used to implement the SSL encrypted connections because they will expire. If the certificates for a node expire it will be unable to communicate with other nodes in the cluster. This will lead to at least data inconsistencies or, in the worst case, unavailable data.

This point is specifically called out towards the end of the Inter-Node Encryption blog post. The note refers to steps 1, 2 and 4 in the above summary of commands to set up the certificates and stores. The validity periods are set for the certificates and stores in their respective steps.

One Certificate Authority to Rule Them All

Before we jump into how we handle expiring certificates and stores in a cluster, we first need to understand the role a certificate plays in securing a connection.

Certificates (and encryption) are often considered a hard topic. However, there are only a few concepts that you need to bear in mind when managing certificates.

Consider the case where two parties A and B wish to communicate with one another. Both parties distrust each other and each needs a way to prove that they are who they claim to be, as well as verify the other party is who they claim to be. To do this a mutually trusted third party needs to be brought in. In our case the trusted third party is the Certificate Authority (CA); often referred to as the Root CA.

The Root CA is effectively just a key pair; similar to an SSH key pair. The main difference is the public portion of the key pair has additional fields detailing who the public key belongs to. It has the following two components.

Certificate Authority Private Signing Key (CA PSK) - Private component of the CA key pair. Used to sign a keystore’s public certificate.
Certificate Authority Public Certificate (CA PC) - Public component of the CA key pair. Used to provide the issuer name when signing a keystore’s public certificate, as well as by a node to confirm that a third party public certificate (when presented) has been signed by the Root CA PSK.

When you run openssl to create your CA key pair using a certificate configuration file, this is the command that is run.

$ openssl req \
      -config path/to/ca_certificate.config \
      -new \
      -x509 \
      -keyout path/to/ca_psk \
      -out path/to/ca_pc \
      -days <valid_days>

In the above command the -keyout specifies the path to the CA PSK, and the -out specifies the path to the CA PC.

And in the Darkness Sign Them

In addition to a common Root CA key pair, each party has its own certificate key pair to uniquely identify it and to encrypt communications. In the Cassandra world, two components are used to store the information needed to perform the above verification check and communication encryption; the keystore and the truststore.

The keystore contains a key pair which is made up of the following two components.

Keystore Private Signing Key (KS PSK) - Hidden in keystore. Used to sign messages sent by the node, and decrypt messages received by the node.
Keystore Public Certificate (KS PC) - Exported for signing by the Root CA. Used by a third party to encrypt messages sent to the node that owns this keystore.

When created, the keystore will contain the PC, and the PSK. The PC signed by the Root CA, and the CA PC are added to the keystore in subsequent operations to complete the trust chain. The certificates are always public and are presented to other parties, while PSK always remains secret. In an asymmetric/public key encryption system, messages can be encrypted with the PC but can only be decrypted using the PSK. In this way, a node can initiate encrypted communications without needing to share a secret.

The truststore stores one or more CA PCs of the parties which the node has chosen to trust, since they are the source of trust for the cluster. If a party tries to communicate with the node, it will refer to its truststore to see if it can validate the attempted communication using a CA PC that it knows about.

For a node’s KS PC to be trusted and verified by another node using the CA PC in the truststore, the KS PC needs to be signed by the Root CA key pair. Futhermore, the CA key pair is used to sign the KS PC of each party.

When you run openssl to sign an exported Keystore PC, this is the command that is run.

$ openssl x509 \
    -req \
    -CAkey path/to/ca_psk \
    -CA path/to/ca_pc \
    -in path/to/exported_ks_pc_sign_request \
    -out paht/to/signed_ks_pc \
    -days <valid_days> \
    -CAcreateserial \
    -passin pass:<ca_psk_password>

In the above command both the Root CA PSK and CA PC are used via -CAkey and -CA respectively when signing the KS PC.

More Than One Way to Secure a Connection

Now that we have a deeper understanding of the assets that are used to encrypt communications, we can examine various ways to implement it. There are multiple ways to implement SSL encryption in an Apache Cassandra cluster. Regardless of the encryption approach, the objective when applying this type of security to a cluster is to ensure;

Hosts (nodes or clients) can determine whether they should trust other hosts in cluster.
Any intercepted communication between two hosts is indecipherable.

The three most common methods vary in both ease of deployment and resulting level of security. They are as follows.

The Cheats Way

The easiest and least secure method for rolling out SSL encryption can be done in the following way

Generation

Single CA for the cluster.
Single truststore containing the CA PC.
Single keystore which has been signed by the CA.

Deployment

The same keystore and truststore are deployed to each node.

In this method a single Root CA and a single keystore is deployed to all nodes in the cluster. This means any node can decipher communications intended for any other node. If a bad actor gains control of a node in the cluster then they will be able to impersonate any other node. That is, compromise of one host will compromise all of them. Depending on your threat model, this approach can be better than no encryption at all. It will ensure that a bad actor with access to only the network will no longer be able to eavesdrop on traffic.

We would use this method as a stop gap to get internode encryption enabled in a cluster. The idea would be to quickly deploy internode encryption with the view of updating the deployment in the near future to be more secure.

Best Bang for Buck

Arguably the most popular and well documented method for rolling out SSL encryption is

Generation

Single CA for the cluster.
Single truststore containing the CA PC.
Unique keystore for each node all of which have been signed by the CA.

Deployment

Each keystore is deployed to its associated node.
The same truststore is deployed to each node.

Similar to the previous method, this method uses a cluster wide CA. However, unlike the previous method each node will have its own keystore. Each keystore has its own certificate that is signed by a Root CA common to all nodes. The process to generate and deploy the keystores in this way is practiced widely and well documented.

We would use this method as it provides better security over the previous method. Each keystore can have its own password and host verification, which further enhances the security that can be applied.

Fort Knox

The method that offers the strongest security of the three can be rolled out in following way

Generation

Unique CA for each node.
A single truststore containing the Public Certificate for each of the CAs.
Unique keystore for each node that has been signed by the CA specific to the node.

Deployment

Each keystore with its unique CA PC is deployed to its associated node.
The same truststore is deployed to each node.

Unlike the other two methods, this one uses a Root CA per host and similar to the previous method, each node will have its own keystore. Each keystore has its own PC that is signed by a Root CA unique to the node. The Root CA PC of each node needs to be added to the truststore that is deployed to all nodes. For large cluster deployments this encryption configuration is cumbersome and will result in a large truststore being generated. Deployments of this encryption configuration are less common in the wild.

We would use this method as it provides all the advantages of the previous method and in addition, provides the ability to isolate a node from the cluster. This can be done by simply rolling out a new truststore which excludes a specific node’s CA PC. In this way a compromised node could be isolated from the cluster by simply changing the truststore. Under the previous two approaches, isolation of a compromised node in this fashion would require a rollout of an entirely new Root CA and one or more new keystores. Furthermore, each new Keystore CA would need to be signed by the new Root CA.

WARNING: Ensure your Certificate Authority is secure!

Regardless of the deployment method chosen, the whole setup will depend on the security of the Root CA. Ideally both components should be secured, or at the very least the PSK needs to be secured properly after it is generated since all trust is based on it. If both components are compromised by a bad actor, then that actor can potentially impersonate another node in the cluster. The good news is, there are a variety of ways to secure the Root CA components, however that topic goes beyond the scope of this post.

The Need for Rotation

If we are following best practices when generating our CAs and keystores, they will have an expiry date. This is a good thing because it forces us to regenerate and roll out our new encryption assets (stores, certificates, passwords) to the cluster. By doing this we minimise the exposure that any one of the components has. For example, if a password for a keystore is unknowingly leaked, the password is only good up until the keystore expiry. Having a scheduled expiry reduces the chance of a security leak becoming a breach, and increases the difficulty for a bad actor to gain persistence in the system. In the worst case it limits the validity of compromised credentials.

Always Read the Expiry Label

The only catch to having an expiry date on our encryption assets is that we need to rotate (update) them before they expire. Otherwise, our data will be unavailable or may be inconsistent in our cluster for a period of time. Expired encryption assets when forgotten can be a silent, sinister problem. If, for example, our SSL certificates expire unnoticed we will only discover this blunder when we restart the Cassandra service. In this case the Cassandra service will fail to connect to the cluster on restart and SSL expiry error will appear in the logs. At this point there is nothing we can do without incurring some data unavailability or inconsistency in the cluster. We will cover what to do in this case in a subsequent post. However, it is best to avoid this situation by rotating the encryption assets before they expire.

How to Play Musical Certificates

Assuming we are going to rotate our SSL certificates before they expire, we can perform this operation live on the cluster without downtime. This process requires the replication factor and consistency level to configured to allow for a single node to be down for a short period of time in the cluster. Hence, it works best when use a replication factor >= 3 and use consistency level <= QUORUM or LOCAL_QUORUM depending on the cluster configuration.

Create the NEW encryption assets; NEW CA, NEW keystores, and NEW truststore, using the process described earlier.
Import the NEW CA to the OLD truststore already deployed in the cluster using keytool. The OLD truststore will increase in size, as it has both the OLD and NEW CAs in it.
```
$ keytool -keystore <old_truststore> -alias CARoot -importcert -file <new_ca_pc> -keypass <new_ca_psk_password> -storepass <old_truststore_password> -noprompt
```
Where:
- <old_truststore>: The path to the OLD truststore already deployed in the cluster. This can be just a copy of the OLD truststore deployed.
- <new_ca_pc>: The path to the NEW CA PC generated.
- <new_ca_psk_password>: The password for the NEW CA PSKz.
- <old_truststore_password>: The password for the OLD truststore.
Deploy the updated OLD truststore to all the nodes in the cluster. Specifically, perform these steps on a single node, then repeat them on the next node until all nodes are updated. Once this step is complete, all nodes in the cluster will be able to establish connections using both the OLD and NEW CAs.
1. Drain the node using nodetool drain.
2. Stop the Cassandra service on the node.
3. Copy the updated OLD truststore to the node.
4. Start the Cassandra service on the node.
Deploy the NEW keystores to their respective nodes in the cluster. Perform this operation one node at a time in the same way the OLD truststore was deployed in the previous step. Once this step is complete, all nodes in the cluster will be using their NEW SSL certificate to establish encrypted connections with each other.
Deploy the NEW truststore to all the nodes in the cluster. Once again, perform this operation one node at a time in the same way the OLD truststore was deployed in Step 3.

The key to ensuring uptime in the rotation are in Steps 2 and 3. That is, we have the OLD and the NEW CAs all in the truststore and deployed on every node prior to rolling out the NEW keystores. This allows nodes to communicate regardless of whether they have the OLD or NEW keystore. This is because both the OLD and NEW assets are trusted by all nodes. The process still works whether our NEW CAs are per host or cluster wide. If the NEW CAs are per host, then they all need to be added to the OLD truststore.

Example Certificate Rotation on a Cluster

Now that we understand the theory, let’s see the process in action. We will use ccm to create a three node cluster running Cassandra 3.11.10 with internode encryption configured.

As pre-cluster setup task we will generate the keystores and truststore to implement the internode encryption. Rather than carry out the steps manually to generate the stores, we have developed a script called generate_cluster_ssl_stores that does the job for us.

The script requires us to supply the node IP addresses, and a certificate configuration file. Our certificate configuration file, test_ca_cert.conf has the following contents:

[ req ]
distinguished_name     = req_distinguished_name
prompt                 = no
output_password        = mypass
default_bits           = 2048

[ req_distinguished_name ]
C                      = AU
ST                     = NSW
L                      = Sydney
O                      = TLP
OU                     = SSLTestCluster
CN                     = SSLTestClusterRootCA
emailAddress           = info@thelastpickle.com¡

The command used to call the generate_cluster_ssl_stores.sh script is as follows.

$ ./generate_cluster_ssl_stores.sh -g -c -n 127.0.0.1,127.0.0.2,127.0.0.3 test_ca_cert.conf

Let’s break down the options in the above command.

-g - Generate passwords for each keystore and the truststore.
-c - Create a Root CA for the cluster and sign each keystore PC with it.
-n - List of nodes to generate keystores for.

The above command generates the following encryption assets.

$ ls -alh ssl_artifacts_20210602_125353
total 72
drwxr-xr-x   9 anthony  staff   288B  2 Jun 12:53 .
drwxr-xr-x   5 anthony  staff   160B  2 Jun 12:53 ..
-rw-r--r--   1 anthony  staff    17B  2 Jun 12:53 .srl
-rw-r--r--   1 anthony  staff   4.2K  2 Jun 12:53 127-0-0-1-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  2 Jun 12:53 127-0-0-2-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  2 Jun 12:53 127-0-0-3-keystore.jks
drwxr-xr-x  10 anthony  staff   320B  2 Jun 12:53 certs
-rw-r--r--   1 anthony  staff   1.0K  2 Jun 12:53 common-truststore.jks
-rw-r--r--   1 anthony  staff   219B  2 Jun 12:53 stores.password

With the necessary stores generated we can create our three node cluster in ccm. Prior to starting the cluster our nodes should look something like this.

$ ccm status
Cluster: 'SSLTestCluster'
-------------------------
node1: DOWN (Not initialized)
node2: DOWN (Not initialized)
node3: DOWN (Not initialized)

We can configure internode encryption in the cluster by modifying the cassandra.yaml files for each node as follows. The passwords for each store are in the stores.password file created by the generate_cluster_ssl_stores.sh script.

node1 - cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210602_125353/127-0-0-1-keystore.jks
  keystore_password: HQR6xX4XQrYCz58CgAiFkWL9OTVDz08e
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node2 - cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210602_125353/127-0-0-2-keystore.jks
  keystore_password: Aw7pDCmrtacGLm6a1NCwVGxohB4E3eui
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node3 - cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210602_125353/127-0-0-3-keystore.jks
  keystore_password: 1DdFk27up3zsmP0E5959PCvuXIgZeLzd
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

Now that we configured internode encryption in the cluster, we can start the nodes and monitor the logs to make sure they start correctly.

$ ccm node1 start && touch ~/.ccm/SSLTestCluster/node1/logs/system.log && tail -n 40 -f ~/.ccm/SSLTestCluster/node1/logs/system.log
...
$ ccm node2 start && touch ~/.ccm/SSLTestCluster/node2/logs/system.log && tail -n 40 -f ~/.ccm/SSLTestCluster/node2/logs/system.log
...
$ ccm node3 start && touch ~/.ccm/SSLTestCluster/node3/logs/system.log && tail -n 40 -f ~/.ccm/SSLTestCluster/node3/logs/system.log

In all cases we see the following message in the logs indicating that internode encryption is enabled.

INFO  [main] ... MessagingService.java:704 - Starting Encrypted Messaging Service on SSL port 7001

Once all the nodes have started, we can check the cluster status. We are looking to see that all nodes are up and in a normal state.

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  90.65 KiB  16           65.8%             2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  66.31 KiB  16           65.5%             f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  71.46 KiB  16           68.7%             46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

We will create a NEW Root CA along with a NEW set of stores for the cluster. As part of this process, we will add the NEW Root CA PC to OLD (current) truststore that is already in use in the cluster. Once again we can use our generate_cluster_ssl_stores.sh script to this, including the additional step of adding the NEW Root CA PC to our OLD truststore. This can be done with the following commands.

# Make the password to our old truststore available to script so we can add the new Root CA to it.

$ export EXISTING_TRUSTSTORE_PASSWORD=$(cat ssl_artifacts_20210602_125353/stores.password | grep common-truststore.jks | cut -d':' -f2)
$ ./generate_cluster_ssl_stores.sh -g -c -n 127.0.0.1,127.0.0.2,127.0.0.3 -e ssl_artifacts_20210602_125353/common-truststore.jks test_ca_cert.conf 

We call our script using a similar command to the first time we used it. The difference now is we are using one additional option; -e.

-e - Path to our OLD (existing) truststore which we will add the new Root CA PC to. This option requires us to set the OLD truststore password in the EXISTING_TRUSTSTORE_PASSWORD variable.

The above command generates the following new encryption assets. These files are located in a different directory to the old ones. The directory with the old encryption assets is ssl_artifacts_20210602_125353 and the directory with the new encryption assets is ssl_artifacts_20210603_070951

$ ls -alh ssl_artifacts_20210603_070951
total 72
drwxr-xr-x   9 anthony  staff   288B  3 Jun 07:09 .
drwxr-xr-x   6 anthony  staff   192B  3 Jun 07:09 ..
-rw-r--r--   1 anthony  staff    17B  3 Jun 07:09 .srl
-rw-r--r--   1 anthony  staff   4.2K  3 Jun 07:09 127-0-0-1-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  3 Jun 07:09 127-0-0-2-keystore.jks
-rw-r--r--   1 anthony  staff   4.2K  3 Jun 07:09 127-0-0-3-keystore.jks
drwxr-xr-x  10 anthony  staff   320B  3 Jun 07:09 certs
-rw-r--r--   1 anthony  staff   1.0K  3 Jun 07:09 common-truststore.jks
-rw-r--r--   1 anthony  staff   223B  3 Jun 07:09 stores.password

When we look at our OLD truststore we can see that it has increased in size. Originally, it was 1.0K and it is now 2.0K in size after adding the new Root CA PC it.

$ ls -alh ssl_artifacts_20210602_125353/common-truststore.jks
-rw-r--r--  1 anthony  staff   2.0K  3 Jun 07:09 ssl_artifacts_20210602_125353/common-truststore.jks

We can now roll out the updated OLD truststore. In a production Cassandra deployment we would copy the updated OLD truststore to a node and restart the Cassandra service. Then repeat this process on the other nodes in the cluster, one node at a time. In our case, our locally running nodes are already pointing to the updated OLD truststore. We need to only restart the Cassandra service.

$ for i in $(ccm status | grep UP | cut -d':' -f1); do echo "restarting ${i}" && ccm ${i} stop && sleep 3 && ccm ${i} start; done
restarting node1
restarting node2
restarting node3

After the restart, our nodes are up and in a normal state.

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  140.35 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  167.23 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  173.7 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

Our nodes are using the updated OLD truststore which has the old Root CA PC and the new Root CA PC. This means that nodes will be able to communicate using either the old (current) keystore or the new keystore. We can now roll out the new keystore one node at a time and still have all our data available.

To do the new keystore roll out we will stop the Cassandra service, update its configuration to point to the new keystore, and then start the Cassandra service. A few notes before we start:

The node will need to point to the new keystore located in the directory with the new encryption assets; ssl_artifacts_20210603_070951.
The node will still need to use the OLD truststore, so its path will remain unchanged.

node1 - stop Cassandra service

$ ccm node1 stop
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
DN  127.0.0.1  140.35 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  142.19 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  148.66 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node1 - update keystore path to point to new keystore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-1-keystore.jks
  keystore_password: V3fKP76XfK67KTAti3CXAMc8hVJGJ7Jg
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node1 - start Cassandra service

$ ccm node1 start
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  179.23 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  142.19 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  148.66 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

At this point we have node1 using the new keystore while node2 and node3 are using the old keystore. Our nodes are once again up and in a normal state, so we can proceed to update the certificates on node2.

node2 - stop Cassandra service

$ ccm node2 stop
$ ccm node3 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  224.48 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
DN  127.0.0.2  188.46 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  194.35 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node2 - update keystore path to point to new keystore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-2-keystore.jks
  keystore_password: 3uEjkTiR0xI56RUDyo23TENJjtMk8VbY
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node2 - start Cassandra service

$ ccm node2 start
$ ccm node3 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  224.48 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  227.12 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  194.35 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

At this point we have node1 and node2 using the new keystore while node3 is using the old keystore. Our nodes are once again up and in a normal state, so we can proceed to update the certificates on node3.

node3 - stop Cassandra service

$ ccm node3 stop
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  225.42 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
DN  127.0.0.3  194.35 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node3 - update keystore path to point to new keystore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-3-keystore.jks
  keystore_password: hkjMwpn2y2aYllePAgCNzkBnpD7Vxl6f
  truststore: /ssl_artifacts_20210602_125353/common-truststore.jks
  truststore_password: 8dPhJ2oshBihAYHcaXzgfzq6kbJ13tQi
...

node3 - start Cassandra service

$ ccm node3 start
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  225.42 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  239.3 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

The keystore rotation is now complete on all nodes in our cluster. However, all nodes are still using the updated OLD truststore. To ensure that our old Root CA can no longer be used to intercept messages in our cluster we need to roll out the NEW truststore to all nodes. This can be done in the same way we deployed the new keystores.

node1 - stop Cassandra service

$ ccm node1 stop
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
DN  127.0.0.1  225.42 KiB  16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node1 - update truststore path to point to new truststore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-1-keystore.jks
  keystore_password: V3fKP76XfK67KTAti3CXAMc8hVJGJ7Jg
  truststore: /ssl_artifacts_20210603_070951/common-truststore.jks
  truststore_password: 0bYmrrXaKIPJQ5UrtQQTFpPLepMweaLc
...

node1 - start Cassandra service

$ ccm node1 start
$ ccm node2 status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

Now we update the truststore for node2.

node2 - stop Cassandra service

$ ccm node2 stop
$ ccm node3 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
DN  127.0.0.2  191.31 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node2 - update truststore path to point to NEW truststore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-2-keystore.jks
  keystore_password: 3uEjkTiR0xI56RUDyo23TENJjtMk8VbY
  truststore: /ssl_artifacts_20210603_070951/common-truststore.jks
  truststore_password: 0bYmrrXaKIPJQ5UrtQQTFpPLepMweaLc
...

node2 - start Cassandra service

$ ccm node2 start
$ ccm node3 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  294.05 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

Now we update the truststore for node3.

node3 - stop Cassandra service

$ ccm node3 stop
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  208.83 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
DN  127.0.0.3  185.37 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

node3 - update truststore path to point to NEW truststore in cassandra.yaml

...
server_encryption_options:
  internode_encryption: all
  keystore: /ssl_artifacts_20210603_070951/127-0-0-3-keystore.jks
  keystore_password: hkjMwpn2y2aYllePAgCNzkBnpD7Vxl6f
  truststore: /ssl_artifacts_20210603_070951/common-truststore.jks
  truststore_password: 0bYmrrXaKIPJQ5UrtQQTFpPLepMweaLc
...

node3 - start Cassandra service

$ ccm node3 start
$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  150 KiB    16           100.0%            2661807a-d8d3-4bba-8639-6c0fada2ac88  rack1
UN  127.0.0.2  208.83 KiB  16           100.0%            f3db4bbe-1f35-4edb-8513-cb55a05393a7  rack1
UN  127.0.0.3  288.6 KiB  16           100.0%            46c2f4b5-905b-42b4-8bb9-563a03c4b415  rack1

The rotation of the certificates is now complete and all while having only a single node down at any one time! This process can be used for all three of the deployment variations. In addition, it can be used to move between the different deployment variations without incurring downtime.

Conclusion

Internode encryption plays an important role in securing the internal communication of a cluster. When deployed, it is crucial that certificate expiry dates be tracked so the certificates can be rotated before they expire. Failure to do so will result in unavailability and inconsistencies.

Using the process discussed in this post and combined with the appropriate tooling, internode encryption can be easily deployed and associated certificates easily rotated. In addition, the process can be used to move between the different encryption deployments.

Regardless of the reason for using the process, it can be executed without incurring downtime in common Cassandra use cases.

Running your Database on OpenShift and CodeReady Containers

2021-06-09T00:00:00+00:00

Let’s take an introductory run-through of setting up your database on OpenShift, using your own hardware and RedHat’s CodeReady Containers.

CodeReady Containers is a great way to run OpenShift K8s locally, ideal for development and testing. The steps in this blog post will require a machine, laptop or desktop, of decent capability; preferably quad CPUs and 16GB+ RAM.

Download and Install RedHat’s CodeReady Containers

Download and install RedHat’s CodeReady Containers (version 1.27) as described in Red Hat OpenShift 4 on your laptop: Introducing Red Hat CodeReady Containers.

First configure CodeReady Containers, from the command line

❯ crc setup

…
Your system is correctly setup for using CodeReady Containers, you can now run 'crc start' to start the OpenShift cluster

Check the version is correct.

❯ crc version

…
CodeReady Containers version: 1.27.0+3d6bc39d
OpenShift version: 4.7.11 (not embedded in executable)

Then start it, entering the Pull Secret copied from the download page. Have patience here, this can take ten minutes or more.

❯ crc start

INFO Checking if running as non-root
…
Started the OpenShift cluster.

The server is accessible via web console at:
  https://console-openshift-console.apps-crc.testing 
…

The output above will include the kubeadmin password which is required in the following oc login … command.

❯ eval $(crc oc-env)
❯ oc login -u kubeadmin -p <password-from-crc-setup-output> https://api.crc.testing:6443

❯ oc version

Client Version: 4.7.11
Server Version: 4.7.11
Kubernetes Version: v1.20.0+75370d3

Open in a browser the URL https://console-openshift-console.apps-crc.testing

Log in using the kubeadmin username and password, as used above with the oc login … command. You might need to try a few times because of the self-signed certificate used.

Once OpenShift has started and is running you should see the following webpage

Some commands to help check status and the startup process are

❯ oc status	

In project default on server https://api.crc.testing:6443

svc/openshift - kubernetes.default.svc.cluster.local
svc/kubernetes - 10.217.4.1:443 -> 6443

View details with 'oc describe <resource>/<name>' or list resources with 'oc get all'.	

Before continuing, go to the CodeReady Containers Preferences dialog. Increase CPUs and Memory to >12 and >14GB correspondingly.

Create the OpenShift Local Volumes

Cassandra needs persistent volumes for its data directories. There are different ways to do this in OpenShift, from enabling local host paths in Rancher persistent volumes, to installing and using the OpenShift Local Storage Operator, and of course persistent volumes on the different cloud provider backends.

This blog post will use vanilla OpenShift volumes using folders on the master k8s node.

Go to the “Terminal” tab for the master node and create the required directories. The master node is found on the /cluster/nodes/ webpage.

Click on the node, named something like crc-m89r2-master-0, and then click on the “Terminal” tab. In the terminal, execute the following commands:

sh-4.4# chroot /host
sh-4.4# mkdir -p /mnt/cass-operator/pv000
sh-4.4# mkdir -p /mnt/cass-operator/pv001
sh-4.4# mkdir -p /mnt/cass-operator/pv002
sh-4.4# 

Persistent Volumes are to be created with affinity to the master node, declared in the following yaml. The name of the master node can vary from installation to installation. If your master node is not named crc-gm7cm-master-0 then the following command replaces its name. First download the cass-operator-1.7.0-openshift-storage.yaml file, check the name of the node in the nodeAffinity sections against your current CodeReady Containers instance, updating if necessary.

❯ wget https://thelastpickle.com/files/openshift-intro/cass-operator-1.7.0-openshift-storage.yaml

# The name of your master node
❯ oc get nodes -o=custom-columns=NAME:.metadata.name --no-headers

# If it is not crc-gm7cm-master-0
❯ sed -i '' "s/crc-gm7cm-master-0/$(oc get nodes -o=custom-columns=NAME:.metadata.name --no-headers)/" cass-operator-1.7.0-openshift-storage.yaml

Create the Persistent Volumes (PV) and Storage Class (SC).

❯ oc apply -f cass-operator-1.7.0-openshift-storage.yaml

persistentvolume/server-storage-0 created
persistentvolume/server-storage-1 created
persistentvolume/server-storage-2 created
storageclass.storage.k8s.io/server-storage created

To check the existence of the PVs.

❯ oc get pv | grep server-storage

server-storage-0   10Gi   RWO    Delete   Available   server-storage     5m19s
server-storage-1   10Gi   RWO    Delete   Available   server-storage     5m19s
server-storage-2   10Gi   RWO    Delete   Available   server-storage     5m19s

To check the existence of the SC.

❯ oc get sc

NAME             PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
server-storage   kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  5m36s

More information on using the can be found in the RedHat documentation for OpenShift volumes.

Deploy the Cass-Operator

Now create the cass-operator. Here we can use the upstream 1.7.0 version of the cass-operator. After creating (applying) the cass-operator, it is important to quickly execute the oc adm policy … commands in the following step so the pods have the privileges required and are successfully created.

❯ oc apply -f https://raw.githubusercontent.com/k8ssandra/cass-operator/v1.7.0/docs/user/cass-operator-manifests.yaml

namespace/cass-operator created
serviceaccount/cass-operator created
secret/cass-operator-webhook-config created
W0606 14:25:44.757092   27806 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0606 14:25:45.077394   27806 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
customresourcedefinition.apiextensions.k8s.io/cassandradatacenters.cassandra.datastax.com created
clusterrole.rbac.authorization.k8s.io/cass-operator-cr created
clusterrole.rbac.authorization.k8s.io/cass-operator-webhook created
clusterrolebinding.rbac.authorization.k8s.io/cass-operator-crb created
clusterrolebinding.rbac.authorization.k8s.io/cass-operator-webhook created
role.rbac.authorization.k8s.io/cass-operator created
rolebinding.rbac.authorization.k8s.io/cass-operator created
service/cassandradatacenter-webhook-service created
deployment.apps/cass-operator created
W0606 14:25:46.701712   27806 warnings.go:70] admissionregistration.k8s.io/v1beta1 ValidatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 ValidatingWebhookConfiguration
W0606 14:25:47.068795   27806 warnings.go:70] admissionregistration.k8s.io/v1beta1 ValidatingWebhookConfiguration is deprecated in v1.16+, unavailable in v1.22+; use admissionregistration.k8s.io/v1 ValidatingWebhookConfiguration
validatingwebhookconfiguration.admissionregistration.k8s.io/cassandradatacenter-webhook-registration created

❯ oc adm policy add-scc-to-user privileged -z default -n cass-operator

clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "default"

❯ oc adm policy add-scc-to-user privileged -z cass-operator -n cass-operator

clusterrole.rbac.authorization.k8s.io/system:openshift:scc:privileged added: "cass-operator"

Let’s check the deployment happened.

❯ oc get deployments -n cass-operator

NAME            READY   UP-TO-DATE   AVAILABLE   AGE
cass-operator   1/1     1            1           14m

Let’s also check the cass-operator pod was created and is successfully running. Note that the kubectl command is used here, for all k8s actions the oc and kubectl commands are interchangable.

❯ kubectl get pods -w -n cass-operator

NAME                             READY   STATUS    RESTARTS   AGE
cass-operator-7675b65744-hxc8z   1/1     Running   0          15m

Troubleshooting: If the cass-operator does not end up in Running status, or if any pods in later sections fail to start, it is recommended to use the OpenShift UI Events webpage for easy diagnostics.

Setup the Cassandra Cluster

The next step is to create the cluster. The following deployment file creates a 3 node cluster. It is largely a copy from the upstream cass-operator version 1.7.0 file example-cassdc-minimal.yaml but with a small modification made to allow all the pods to be deployed to the same worker node (as CodeReady Containers only uses one k8s node by default).

❯ oc apply -n cass-operator -f https://thelastpickle.com/files/openshift-intro/cass-operator-1.7.0-openshift-minimal-3.11.yaml

cassandradatacenter.cassandra.datastax.com/dc1 created

Let’s watch the pods get created, initialise, and eventually becoming running, using the kubectl get pods … watch command.

❯ kubectl get pods -w -n cass-operator

NAME                             READY   STATUS    RESTARTS   AGE
cass-operator-7675b65744-28fhw   1/1     Running   0          102s
cluster1-dc1-default-sts-0       0/2     Pending   0          0s
cluster1-dc1-default-sts-1       0/2     Pending   0          0s
cluster1-dc1-default-sts-2       0/2     Pending   0          0s
cluster1-dc1-default-sts-0       2/2     Running   0          3m
cluster1-dc1-default-sts-1       2/2     Running   0          3m
cluster1-dc1-default-sts-2       2/2     Running   0          3m

Use the Cassandra Cluster

With the Cassandra pods each up and running, the cluster is ready to be used. Test it out using the nodetool status command.

❯ kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- nodetool status

Defaulting container name to cassandra.
Use 'kubectl describe pod/cluster1-dc1-default-sts-0 -n cass-operator' to see all of the containers in this pod.
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.217.0.73  84.42 KiB  1            83.6%             672baba8-9a05-45ac-aad1-46427027b57a  default
UN  10.217.0.72  70.2 KiB   1            65.3%             42758a86-ea7b-4e9b-a974-f9e71b958429  default
UN  10.217.0.71  65.31 KiB  1            51.1%             2fa73bc2-471a-4782-ae63-5a34cc27ab69  default

The above command can be run on `cluster1-dc1-default-sts-1` and `cluster1-dc1-default-sts-2` too.

Next, test out cqlsh. For this authentication is required, so first get the CQL username and password.

# Get the cql username
❯ kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " username" | awk -F" " '{print $2}' | base64 -d && echo ""

# Get the cql password
❯ kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " password" | awk -F" " '{print $2}' | base64 -d && echo ""

❯ kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- cqlsh -u <cql-username> -p <cql-password>

Connected to cluster1 at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.7 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cluster1-superuser@cqlsh>

Keep It Clean

CodeReady Containers are very simple to clean up, especially because it is a packaging of OpenShift intended only for development purposes. To wipe everything, just “delete”

❯ crc stop
❯ crc delete

If, on the other hand, you only want to delete individual steps, each of the following can be done (but in order).

❯ oc delete -n cass-operator -f https://thelastpickle.com/files/openshift-intro/cass-operator-1.7.0-openshift-minimal-3.11.yaml

❯ oc delete -f https://raw.githubusercontent.com/k8ssandra/cass-operator/v1.7.0/docs/user/cass-operator-manifests.yaml

❯ oc delete -f cass-operator-1.7.0-openshift-storage.yaml

Apache Cassandra's Continuous Integration Systems

2021-04-29T00:00:00+00:00

With Apache Cassandra 4.0 just around the corner, and the feature freeze on trunk lifted, let’s take a dive into the efforts ongoing with the project’s testing and Continuous Integration systems.

continuous integration in open source

Every software project benefits from sound testing practices and having a continuous integration in place. Even more so for open source projects. From contributors working around the world in many different timezones, particularly prone to broken builds and longer wait times and uncertainties, to contributors just not having the same communication bandwidths between each other because they work in different companies and are scratching different itches.

This is especially true for Apache Cassandra. As an early-maturity technology used everywhere on mission critical data, stability and reliability are crucial for deployments. Contributors from many companies: Alibaba, Amazon, Apple, Bloomberg, Dynatrace, DataStax, Huawei, Instaclustr, Netflix, Pythian, and more; need to coordinate and collaborate and most importantly trust each other.

During the feature freeze the project was fortunate to not just stabilise and fix tons of tests, but to also expand its continuous integration systems. This really helps set the stage for a post 4.0 roadmap that features heavy on pluggability, developer experience and safety, as well as aiming for an always-shippable trunk.

@ cassandra

The continuous integration systems at play are CircleCI and ci-cassandra.apache.org

CircleCI is a commercial solution. The main usage today of CircleCI is pre-commit, that is testing your patches while they get reviewed before they get merged. To effectively use CircleCI on Cassandra requires either the medium or high resource profiles that enables the use of hundreds of containers and lots of resources, and that’s basically only available for folk working in companies that are paying for a premium CircleCI account. There are lots stages to the CircleCI pipeline, and developers just trigger those stages they feel are relevant to test that patch on.

ci-cassandra is our community CI. It is based on CloudBees, provided by the ASF and running 40 agents (servers) around the world donated by numerous different companies in our community. Its main usage is post-commit, and its pipelines run every stage automatically. Today the pipeline consists of 40K tests. And for the first time in many years, on the lead up to 4.0, pipeline runs are completely green.

ci-cassandra is setup with a combination of Jenkins DSL script, and declarative Jenkinsfiles. These jobs use the build scripts found here.

forty thousand tests

The project has many types of tests. It has proper unit tests, and unit tests that have some embedded Cassandra server. The unit tests are run in a number of different parameterisations: from different Cassandra configuration, JDK 8 and JDK 11, to supporting the ARM architecture. There’s CQLSH tests written in Python against a single ccm node. Then there’s the Java distributed tests and Python distributed tests. The Python distributed tests are older, use CCM, and also run parameterised. The Java distributed tests are a recent addition and run the Cassandra nodes inside the JVM. Both types of distributed tests also include testing the upgrade paths of different Cassandra versions. Most new distributed tests today are written as Java distributed tests. There are also burn and microbench (JMH) tests.

distributed is difficult

Testing distributed tech is hardcore. Anyone who’s tried to run the Python upgrade dtests locally knows the pain. Running the tests in Docker helps a lot, and this is what CircleCI and ci-cassandra predominantly does. The base Docker images are found here. Distributed tests can fall over for numerous reasons, exacerbated in ci-cassandra with heterogenous servers around the world and all the possible network and disk issues that can occur. Just for the 4.0 release over 200 Jira tickets were focused just on strengthening flakey tests. Because ci-cassandra has limited storage, the logs and test results to all runs are archived in nightlies.apache.org/cassandra.

call for help

There’s still heaps to do. This is all part-time and volunteer efforts. No one in the community is dedicated to these systems or as a build engineer. The project can use all the help it can get.

There’s a ton of exciting stuff to add. Some examples are microbench and JMH reports, Jacoco test coverage reports, Harry for fuzz testing, Adelphi or Fallout for end-to-end performance and comparison testing, hooking up Apache Yetus for efficient resource usage, or putting our Jenkins stack into a k8s operator run script so you can run the pipeline on your own k8s cluster.

So don’t be afraid to jump in, pick your poison, we’d love to see you!

Reaper 2.2 for Apache Cassandra was released

2021-02-22T00:00:00+00:00

We’re pleased to announce that Reaper 2.2 for Apache Cassandra was just released. This release includes a major redesign of how segments are orchestrated, which allows users to run concurrent repairs on nodes. Let’s dive into these changes and see what they mean for Reaper’s users.

New Segment Orchestration

Reaper works in a variety of standalone or distributed modes, which create some challenges in meeting the following requirements:

A segment is processed successfully exactly once.
No more than one segment is running on a node at once.
Segments can only be started if the number of pending compactions on a node involved is lower than the defined threshold.

To make sure a segment won’t be handled by several Reaper instances at once, Reaper relies on LightWeight Transactions (LWT) to implement a leader election process. A Reaper instance will “take the lead” on a segment by using a LWT and then perform the checks for the last two conditions above.

To avoid race conditions between two different segments involving a common set of replicas that would start at the same time, a “master lock” was placed after the checks to guarantee that a single segment would be able to start. This required a double check to be performed before actually starting the segment.

There were several issues with this design:

It involved a lot of LWTs even if no segment could be started.
It was a complex design which made the code hard to maintain.
The “master lock” was creating a lot of contention as all Reaper instances would compete for the same partition, leading to some nasty situations. This was especially the case in sidecar mode as it involved running a lot of Reaper instances (one per Cassandra node).

As we were seeing suboptimal performance and high LWT contention in some setups, we redesigned how segments were orchestrated to reduce the number of LWTs and maximize concurrency during repairs (all nodes should be busy repairing if possible).
Instead of locking segments, we explored whether it would be possible to lock nodes instead. This approach would give us several benefits:

We could check which nodes are already busy without issuing JMX requests to the nodes.
We could easily filter segments to be processed to retain only those with available nodes.
We could remove the master lock as we would have no more race conditions between segments.

One of the hard parts was that locking several nodes in a consistent manner would be challenging as they would involve several rows, and Cassandra doesn’t have a concept of an atomic transaction that can be rolled back as RDBMS do. Luckily, we were able to leverage one feature of batch statements: All Cassandra batch statements which target a single partition will turn all operations into a single atomic one (at the node level). If one node out of all replicas was already locked, then none would be locked by the batched LWTs. We used the following model for the leader election table on nodes:

CREATE TABLE reaper_db.running_repairs (
    repair_id uuid,
    node text,
    reaper_instance_host text,
    reaper_instance_id uuid,
    segment_id uuid,
    PRIMARY KEY (repair_id, node)
) WITH CLUSTERING ORDER BY (node ASC)

The following LWTs are then issued in a batch for each replica:

BEGIN BATCH

UPDATE reaper_db.running_repairs USING TTL 90
SET reaper_instance_host = 'reaper-host-1', 
    reaper_instance_id = 62ce0425-ee46-4cdb-824f-4242ee7f86f4,
    segment_id = 70f52bc2-7519-11eb-809e-9f94505f3a3e
WHERE repair_id = 70f52bc0-7519-11eb-809e-9f94505f3a3e AND node = 'node1'
IF reaper_instance_id = null;

UPDATE reaper_db.running_repairs USING TTL 90
SET reaper_instance_host = 'reaper-host-1',
    reaper_instance_id = 62ce0425-ee46-4cdb-824f-4242ee7f86f4,
    segment_id = 70f52bc2-7519-11eb-809e-9f94505f3a3e
WHERE repair_id = 70f52bc0-7519-11eb-809e-9f94505f3a3e AND node = 'node2'
IF reaper_instance_id = null;

UPDATE reaper_db.running_repairs USING TTL 90
SET reaper_instance_host = 'reaper-host-1',
    reaper_instance_id = 62ce0425-ee46-4cdb-824f-4242ee7f86f4,
    segment_id = 70f52bc2-7519-11eb-809e-9f94505f3a3e
WHERE repair_id = 70f52bc0-7519-11eb-809e-9f94505f3a3e AND node = 'node3'
IF reaper_instance_id = null;

APPLY BATCH;

If all the conditional updates are able to be applied, we’ll get the following data in the table:

cqlsh> select * from reaper_db.running_repairs;

 repair_id                            | node  | reaper_instance_host | reaper_instance_id                   | segment_id
--------------------------------------+-------+----------------------+--------------------------------------+--------------------------------------
 70f52bc0-7519-11eb-809e-9f94505f3a3e | node1 |        reaper-host-1 | 62ce0425-ee46-4cdb-824f-4242ee7f86f4 | 70f52bc2-7519-11eb-809e-9f94505f3a3e
 70f52bc0-7519-11eb-809e-9f94505f3a3e | node2 |        reaper-host-1 | 62ce0425-ee46-4cdb-824f-4242ee7f86f4 | 70f52bc2-7519-11eb-809e-9f94505f3a3e
 70f52bc0-7519-11eb-809e-9f94505f3a3e | node3 |        reaper-host-1 | 62ce0425-ee46-4cdb-824f-4242ee7f86f4 | 70f52bc2-7519-11eb-809e-9f94505f3a3e
 

If one of the conditional updates fails because one node is already locked for the same repair_id, then none will be applied.

Note: the Postgres backend also benefits from these new features through the use of transactions, using commit and rollback to deal with success/failure cases.

The new design is now much simpler than the initial one:

Segments are now filtered on those that have no replica locked to avoid wasting energy in trying to lock them and the pending compactions check also happens before any locking.

This reduces the number of LWTs by four in the simplest cases and we expect more challenging repairs to benefit from even more reductions:

At the same time, repair duration on a 9-node cluster showed 15%-20% improvements thanks to the more efficient segment selection.

One prerequisite to make that design efficient was to store the replicas for each segment in the database when the repair run is created. You can now see which nodes are involved for each segment and which datacenter they belong to in the Segments view:

Concurrent repairs

Using the repair id as the partition key for the node leader election table gives us another feature that was long awaited: Concurrent repairs.
A node could be locked by different Reaper instances for different repair runs, allowing several repairs to run concurrently on each node. In order to control the level of concurrency, a new setting was introduced in Reaper: maxParallelRepairs
By default it is set to 2 and should be tuned carefully as heavy concurrent repairs could have a negative impact on clusters performance.
If you have small keyspaces that need to be repaired on a regular basis, they won’t be blocked by large keyspaces anymore.

Future upgrades

As some of you are probably aware, JFrog has decided to sunset Bintray and JCenter. Bintray is our main distribution medium and we will be working on replacement repositories. The 2.2.0 release is unaffected by this change but future upgrades could require an update to yum/apt repos. The documentation will be updated accordingly in due time.

Upgrade now

We encourage all Reaper users to upgrade to 2.2.0. It was tested successfully by some of our customers which had issues with LWT pressure and blocking repairs. This version is expected to make repairs faster and more lightweight on the Cassandra backend. We were able to remove a lot of legacy code and design which were fit to single token clusters, but failed at spreading segments efficiently for clusters using vnodes.

The binaries for Reaper 2.2.0 are available from yum, apt-get, Maven Central, Docker Hub, and are also downloadable as tarball packages. Remember to backup your database before starting the upgrade.

All instructions to download, install, configure, and use Reaper 2.2.0 are available on the Reaper website.

Creating Flamegraphs with Apache Cassandra in Kubernetes (cass-operator)

2021-01-31T00:00:00+00:00

In a previous blog post recommending disabling read repair chance, some flamegraphs were generated to demonstrate the effect read repair chance had on a cluster. Let’s go through how those flamegraphs were captured, step-by-step using Apache Cassandra 3.11.6, Kubernetes and the cass-operator, nosqlbench and the async-profiler.

In previous blog posts we would have used the existing tools of tlp-cluster or ccm, tlp-stress or cassandra-stress, and sjk. Here we take a new approach that is a lot more fun, as with k8s the same approach can be used locally or in the cloud. No need to switch between ccm clusters for local testing and tlp-cluster for cloud testing. Nor are you bound to AWS for big instance testing, that’s right: no vendor lock-in. Cass-operator and K8ssandra is getting a ton of momentum from DataStax, so it is only deserved and exciting to introduce them to as much of the open source world as we can.

This blog post is not an in-depth dive into using cass-operator, rather a simple teaser to demonstrate how we can grab some flamegraphs, as quickly as possible. The blog post is split into three sections

Setting up Kubernetes and getting Cassandra running
Getting access to Cassandra from outside Kubernetes
Stress testing and creating flamegraphs

Setup

Let’s go through a quick demonstration using Kubernetes, the cass-operator, and some flamegraphs.

First, download four yaml configuration files that will be used. This is not strictly necessary for the latter three, as kubectl may reference them by their URLs, but let’s download them for the sake of having the files locally and being able to make edits if and when desired.

wget https://thelastpickle.com/files/2021-01-31-cass_operator/01-kind-config.yaml
wget https://thelastpickle.com/files/2021-01-31-cass_operator/02-storageclass-kind.yaml
wget https://thelastpickle.com/files/2021-01-31-cass_operator/11-install-cass-operator-v1.1.yaml
wget https://thelastpickle.com/files/2021-01-31-cass_operator/13-cassandra-cluster-3nodes.yaml

The next steps involve kind and kubectl to create a local cluster we can test. To use kind you have docker running locally, it is recommended to have 4 CPU and 12GB RAM for this exercise.

kind create cluster --name read-repair-chance-test --config 01-kind-config.yaml

kubectl create ns cass-operator
kubectl -n cass-operator apply -f 02-storageclass-kind.yaml
kubectl -n cass-operator apply -f 11-install-cass-operator-v1.1.yaml

# watch and wait until the pod is running
watch kubectl -n cass-operator get pod

# create 3 node C* cluster
kubectl -n cass-operator apply -f 13-cassandra-cluster-3nodes.yaml

# again, wait for pods to be running
watch kubectl -n cass-operator get pod

# test the three nodes are up
kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- nodetool status

Access

For this example we are going to run NoSqlBench from outside the k8s cluster, so we will need access to a pod’s Native Protocol interface via port-forwarding. This approach is practical here because it was desired to have the benchmark connect to just one coordinator. In many situations you would instead run NoSqlBench from a separate dedicated pod inside the k8s cluster.

# get the cql username
kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " username" | awk -F" " '{print $2}' | base64 -d && echo ""

# get the cql password
kubectl -n cass-operator get secret cluster1-superuser -o yaml | grep " password" | awk -F" " '{print $2}' | base64 -d && echo ""

# port forward the native protocol (CQL)
kubectl -n cass-operator port-forward --address 0.0.0.0 cluster1-dc1-default-sts-0 9042:9042 

The above sets up the k8s cluster, a k8s storageClass, and the cass-operator with a three node Cassandra cluster. For a more in depth look at this setup checkout this tutorial.

Stress Testing and Flamegraphs

With a cluster to play with, let’s generate some load and then go grab some flamegraphs.

Instead of using SJK (Swiss Java Knife), as our previous blog posts have done, we will use the async-profiler. The async-profiler does not suffer from Safepoint bias problem, an issue we see more often than we would like in Cassandra nodes (protip: make sure you configure ParallelGCThreads and ConcGCThreads to the same value).

Open a new terminal window and do the following.

# get the latest NoSqlBench jarfile
wget https://github.com/nosqlbench/nosqlbench/releases/latest/download/nb.jar

# generate some load, use credentials as found above

java -jar nb.jar cql-keyvalue username=<cql_username> password=<cql_password> whitelist=127.0.0.1 rampup-cycles=10000 main-cycles=500000 rf=3 read_cl=LOCAL_ONE

# while the load is still running,
# open a shell in the coordinator pod, download async-profiler and generate a flamegraph
kubectl -n cass-operator exec -it cluster1-dc1-default-sts-0 -- /bin/bash

wget https://github.com/jvm-profiling-tools/async-profiler/releases/download/v1.8.3/async-profiler-1.8.3-linux-x64.tar.gz

tar xvf async-profiler-1.8.3-linux-x64.tar.gz

async-profiler-1.8.3-linux-x64/profiler.sh -d 300 -f /tmp/flame_away.svg <CASSANDRA_PID>

exit

# copy the flamegraph out of the pod
kubectl -n cass-operator cp cluster1-dc1-default-sts-0:/tmp/flame_away.svg flame_away.svg

Keep It Clean

After everything is done, it is time to clean up after yourself.

Delete the CassandraDatacenters first, otherwise Kubernetes will block deletion because we use a finalizer. Note, this will delete all data in the cluster.

kubectl delete cassdcs --all-namespaces --all

Remove the operator Deployment, CRD, etc.

# this command can take a while, be patient

kubectl delete -f https://raw.githubusercontent.com/datastax/cass-operator/v1.5.1/docs/user/cass-operator-manifests-v1.16.yaml

# if troubleshooting, to forcibly remove resources, though
# this should not be necessary, and take care as this will wipe all resources

kubectl delete "$(kubectl api-resources --namespaced=true --verbs=delete -o name | tr "\n" "," | sed -e 's/,$//')" --all

To remove the local Kubernetes cluster altogether

kind delete cluster --name read-repair-chance-test

To stop and remove the docker containers that are left running…

docker stop $(docker ps | grep kindest | cut -d" " -f1)
docker rm $(docker ps -a | grep kindest | cut -d" " -f1)

More… the cass-operator tutorials

There is a ton of documentation and tutorials getting released on how to use the cass-operator. If you are keen to learn more the following is highly recommended: Managing Cassandra Clusters in Kubernetes Using Cass-Operator.

The Impacts of Changing the Number of VNodes in Apache Cassandra

2021-01-29T00:00:00+00:00

Apache Cassandra’s default value for num_tokens is about to change in 4.0! This might seem like a small edit note in the CHANGES.txt, however such a change can have a profound effect on day-to-day operations of the cluster. In this post we will examine how changing the value for num_tokens impacts the cluster and its behaviour.

There are many knobs and levers that can be modified in Apache Cassandra to tune its behaviour. The num_tokens setting is one of those. Like many settings it lives in the cassandra.yaml file and has a defined default value. That’s where it stops being like many of Cassandra’s settings. You see, most of Cassandra’s settings will only affect a single aspect of the cluster. However, when changing the value of num_tokens there is an array of behaviours that are altered. The Apache Cassandra project has committed and resolved CASSANDRA-13701 which changed the default value for num_tokens from 256 to 16. This change is significant, and to understand the consequences we first need to understand the role that num_tokens play in the cluster.

Never try this on production

Before we dive into any details it is worth noting that the num_tokens setting on a node should never ever be changed once it has joined the cluster. For one thing the node will fail on a restart. The value of this setting should be the same for every node in a datacenter. Historically, different values were expected for heterogeneous clusters. While it’s rare to see, nor would we recommend, you can still in theory double the num_tokens on nodes that are twice as big in terms of hardware specifications. Furthermore, it is common to see the nodes in a datacenter have a value for num_tokens that differs to nodes in another datacenter. This is partly how changing the value of this setting on a live cluster can be safely done with zero downtime. It is out of scope for this blog post, but details can be found in migration to a new datacenter.

The Basics

The num_tokens setting influences the way Cassandra allocates data amongst the nodes, how that data is retrieved, and how that data is moved between nodes.

Under the hood Cassandra uses a partitioner to decide where data is stored in the cluster. The partitioner is a consistent hashing algorithm that maps a partition key (first part of the primary key) to a token. The token dictates which nodes will contain the data associated with the partition key. Each node in the cluster is assigned one or more unique token values from a token ring. This is just a fancy way of saying each node is assigned a number from a circular number range. That is, “the number” being the token hash, and “a circular number range” being the token ring. The token ring is circular because the next value after the maximum token value is the minimum token value.

An assigned token defines the range of tokens in the token ring the node is responsible for. This is commonly known as a “token range”. The “token range” a node is responsible for is bounded by its assigned token, and the next smallest token value going backwards in the ring. The assigned token is included in the range, and the smallest token value going backwards is excluded from the range. The smallest token value going backwards typically resides on the previous neighbouring node. Having a circular token ring means that the range of tokens a node is responsible for, could include both the minimum and maximum tokens in the ring. In at least one case the smallest token value going backwards will wrap back past the maximum token value in the ring.

For example, in the following Token Ring Assignment diagram we have a token ring with a range of hashes from 0 to 99. Token 10 is allocated to Node 1. The node before Node 1 in the cluster is Node 5. Node 5 is allocated token 90. Therefore, the range of tokens that Node 1 is responsible for is between 91 and 10. In this particular case, the token range wraps around past the maximum token in the ring.

Note that the above diagram is for only a single data replica. This is because only a single node is assigned to each token in the token ring. If multiple replicas of the data exists, a node’s neighbours become replicas for the token as well. This is illustrated in the Token Ring Assignment diagram below.

The reason the partitioner is defined as a consistent hashing algorithm is because it is just that; no matter how many times you feed in a specific input, it will always generate the same output value. It ensures that every node, coordinator, or otherwise, will always calculate the same token for a given partition key. The calculated token can then be used to reliably pinpoint the nodes with the sought after data.

Consequently, the minimum and maximum numbers for the token ring are defined by the partitioner. The default Murur3Partitioner based on the Murmur hash has for example, a minimum and maximum range of -2^63 to +2^63 - 1. The legacy RandomPartitioner (based on the MD5 hash) on the other hand has a range of 0 to 2^127 - 1. A critical side effect of this system is that once a partitioner for a cluster is picked, it can never be changed. Changing to a different partitioner requires the creation of a new cluster with the desired partitioner and then reloading the data into the new cluster.

Further information on consistent hashing functionality can be found in the Apache Cassandra documentation.

Back in the day…

Back in the pre-1.2 era, nodes could only be manually assigned a single token. This was done and can still be done today using the initial_token setting in the cassandra.yaml file. The default partitioner at that point was the RandomPartitioner. Despite token assignment being manual, the partitioner made the process of calculating the assigned tokens fairly straightforward when setting up a cluster from scratch. For example, if you had a three node cluster you would divide 2^127 - 1 by 3 and the quotient would give you the correct increment amount for each token value. Your first node would have an initial_token of 0, your next node would have an initial_token of (2^127 - 1) / 3, and your third node would have an initial_token of (2^127 - 1) / 3 * 2. Thus, each node will have the same sized token ranges.

Dividing the token ranges up evenly makes it less likely individual nodes are overloaded (assuming identical hardware for the nodes, and an even distribution of data across the cluster). Uneven token distribution can result in what is termed “hot spots”. This is where a node is under pressure as it is servicing more requests or carrying more data than other nodes.

Even though setting up a single token cluster can be a very manual process, their deployment is still common. Especially for very large Cassandra clusters where the node count typically exceeds 1,000 nodes. One of the advantages of this type of deployment, is you can ensure that the token distribution is even.

Although setting up a single token cluster from scratch can result in an even load distribution, growing the cluster is far less straight forward. If you insert a single node into your three node cluster, the result is that two out of the four nodes will have a smaller token range than the other two nodes. To fix this problem and re-balance, you then have to run nodetool move to relocate tokens to other nodes. This is a tedious and expensive task though, involving a lot of streaming around the whole cluster. The alternative is to double the size of your cluster each time you expand it. However, this usually means using more hardware than you need. Much like having an immaculate backyard garden, maintaining an even token range per node in a single token cluster requires time, care, and attention, or alternatively, a good deal of clever automation.

Scaling in a single token world is only half the challenge. Certain failure scenarios heavily reduce time to recovery. Let’s say for example you had a six node cluster with three replicas of the data in a single datacenter (Replication Factor = 3). Replicas might reside on Node 1 and Node 4, Node 2 and Node 5, and lastly on Node 3 and Node 6. In this scenario each node is responsible for a sixth of each of the three replicas.

In the above diagram, the tokens in the token ring are assigned an alpha character. This is to make tracking the token assignment to each node easier to follow. If the cluster had an outage where Node 1 and Node 6 are unavailable, you could only use Nodes 2 and 5 to recover the unique sixth of the data they each have. That is, only Node 2 could be used to recover the data associated with token range ‘F’, and similarly only Node 5 could be used to recover the data associated with token range ‘E’. This is illustrated in the diagram below.

vnodes to the rescue

To solve the shortcomings of a single token assignment, Cassandra version 1.2 was enhanced to allow a node to be assigned multiple tokens. That is a node could be responsible for multiple token ranges. This Cassandra feature is known as “virtual node” or vnodes for short. The vnodes feature was introduced via CASSANDRA-4119. As per the ticket description, the goals of vnodes were:

Reduced operations complexity for scaling up/down.
Reduced rebuild time in event of failure.
Evenly distributed load impact in the event of failure.
Evenly distributed impact of streaming operations.
More viable support for heterogeneity of hardware.

The introduction of this feature gave birth to the num_tokens setting in the cassandra.yaml file. The setting defined the number of vnodes (token ranges) a node was responsible for. By increasing the number of vnodes per node, the token ranges become smaller. This is because the token ring has a finite number of tokens. The more ranges it is divided up into the smaller each range is.

To maintain backwards compatibility with older 1.x series clusters, the num_tokens defaulted to a value of 1. Moreover, the setting was effectively disabled on a vanilla installation. Specifically, the value in the cassandra.yaml file was commented out. The commented line and previous development commits did give a glimpse into the future of where the feature was headed though.

As foretold by the cassandra.yaml file, and the git commit history, when Cassandra version 2.0 was released out the vnodes feature was enabled by default. The num_tokens line was no longer commented out, so its effective default value on a vanilla installation was 256. Thus ushering in a new era of clusters that had relatively even token distributions, and were simple to grow.

With nodes consisting of 256 vnodes and the accompanying additional features, expanding the cluster was a dream. You could insert one new node into your cluster and Cassandra would calculate and assign the tokens automatically! The token values were randomly calculated, and so over time as you added more nodes, the cluster would converge on being in a balanced state. This engineering wizardry put an end to spending hours doing calculations and nodetool move operations to grow a cluster. The option was still there though. If you had a very large cluster or another requirement, you could still use the initial_token setting which was commented out in Cassandra version 2.0. In this case, the value for the num_tokens still had to be set to the number of tokens manually defined in the initial_token setting.

Remember to read the fine print

This gave us a feature that was like a personal devops assistant; you handed them a node, told them to insert it, and then after some time it had tokens allocated and was part of the cluster. However, in a similar vein, there is a price to pay for the convenience…

While we get a more even token distribution when using 256 vnodes, the problem is that availability degrades earlier. Ironically, the more we break the token ranges up the more quickly we can get data unavailability. Then there is the issue of unbalanced token ranges when using a small number of vnodes. By small, I mean values less than 32. Cassandra’s random token allocation is hopeless when it comes to small vnode values. This is because there are insufficient tokens to balance out the wildly different token range sizes that are generated.

Pics or it didn’t happen

It is very easy to demonstrate the availability and token range imbalance issues, using a test cluster. We can set up a single token range cluster with six nodes using ccm. After calculating the tokens, configuring and starting our test cluster, it looked like this.

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  71.17 KiB  1            33.3%             8d483ae7-e7fa-4c06-9c68-22e71b78e91f  rack1
UN  127.0.0.2  65.99 KiB  1            33.3%             cc15803b-2b93-40f7-825f-4e7bdda327f8  rack1
UN  127.0.0.3  85.3 KiB   1            33.3%             d2dd4acb-b765-4b9e-a5ac-a49ec155f666  rack1
UN  127.0.0.4  104.58 KiB  1            33.3%             ad11be76-b65a-486a-8b78-ccf911db4aeb  rack1
UN  127.0.0.5  71.19 KiB  1            33.3%             76234ece-bf24-426a-8def-355239e8f17b  rack1
UN  127.0.0.6  30.45 KiB  1            33.3%             cca81c64-d3b9-47b8-ba03-46356133401b  rack1

We can then create a test keyspace and populated it using cqlsh.

$ ccm node1 cqlsh
Connected to SINGLETOKEN at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.9 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> CREATE KEYSPACE test_keyspace WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3 };
cqlsh> CREATE TABLE test_keyspace.test_table (
...   id int,
...   value text,
...   PRIMARY KEY (id));
cqlsh> CONSISTENCY LOCAL_QUORUM;
Consistency level set to LOCAL_QUORUM.
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (1, 'foo');
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (2, 'bar');
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (3, 'net');
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (4, 'moo');
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (5, 'car');
cqlsh> INSERT INTO test_keyspace.test_table (id, value) VALUES (6, 'set');

To confirm that the cluster is perfectly balanced, we can check the token ring.

$ ccm node1 nodetool ring test_keyspace


Datacenter: datacenter1
==========
Address    Rack   Status  State   Load        Owns     Token
                                                       6148914691236517202
127.0.0.1  rack1  Up      Normal  125.64 KiB  50.00%   -9223372036854775808
127.0.0.2  rack1  Up      Normal  125.31 KiB  50.00%   -6148914691236517206
127.0.0.3  rack1  Up      Normal  124.1 KiB   50.00%   -3074457345618258604
127.0.0.4  rack1  Up      Normal  104.01 KiB  50.00%   -2
127.0.0.5  rack1  Up      Normal  126.05 KiB  50.00%   3074457345618258600
127.0.0.6  rack1  Up      Normal  120.76 KiB  50.00%   6148914691236517202

We can see in the “Owns” column all nodes have 50% ownership of the data. To make the example easier to follow we can manually add a letter representation next to each token number. So the token ranges could be represented in the following way:

$ ccm node1 nodetool ring test_keyspace


Datacenter: datacenter1
==========
Address    Rack   Status  State   Load        Owns     Token                 Token Letter
                                                       6148914691236517202   F
127.0.0.1  rack1  Up      Normal  125.64 KiB  50.00%   -9223372036854775808  A
127.0.0.2  rack1  Up      Normal  125.31 KiB  50.00%   -6148914691236517206  B
127.0.0.3  rack1  Up      Normal  124.1 KiB   50.00%   -3074457345618258604  C
127.0.0.4  rack1  Up      Normal  104.01 KiB  50.00%   -2                    D
127.0.0.5  rack1  Up      Normal  126.05 KiB  50.00%   3074457345618258600   E
127.0.0.6  rack1  Up      Normal  120.76 KiB  50.00%   6148914691236517202   F

We can then capture the output of ccm node1 nodetool describering test_keyspace and change the token numbers to the corresponding letters in the above token ring output.

$ ccm node1 nodetool describering test_keyspace

Schema Version:6256fe3f-a41e-34ac-ad76-82dba04d92c3
TokenRange:
  TokenRange(start_token:A, end_token:B, endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.4], rpc_endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
  TokenRange(start_token:C, end_token:D, endpoints:[127.0.0.4, 127.0.0.5, 127.0.0.6], rpc_endpoints:[127.0.0.4, 127.0.0.5, 127.0.0.6], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1)])
  TokenRange(start_token:B, end_token:C, endpoints:[127.0.0.3, 127.0.0.4, 127.0.0.5], rpc_endpoints:[127.0.0.3, 127.0.0.4, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1)])
  TokenRange(start_token:D, end_token:E, endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.1], rpc_endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
  TokenRange(start_token:F, end_token:A, endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.3], rpc_endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1)])
  TokenRange(start_token:E, end_token:F, endpoints:[127.0.0.6, 127.0.0.1, 127.0.0.2], rpc_endpoints:[127.0.0.6, 127.0.0.1, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1)])

Using the above output, specifically the end_token, we can determine all the token ranges assigned to each node. As mentioned earlier, the token range is defined by the values after the previous token (start_token) up to and including the assigned token (end_token). The token ranges assigned to each node looked like this:

In this setup, if node3 and node6 were unavailable, we would lose an entire replica. Even if the application is using a Consistency Level of LOCAL_QUORUM, all the data is still available. We still have two other replicas across the other four nodes.

Now let’s consider the case where our cluster is using vnodes. For example purposes we can set num_tokens to 3. It will give us a smaller number of tokens making for an easier to follow example. After configuring and starting the nodes in ccm, our test cluster initially looked like this.

For the majority of production deployments where the cluster size is less than 500 nodes, it is recommended that you use a larger value for `num_tokens`. Further information can be found in the Apache Cassandra Production Recommendations.

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens  Owns (effective)  Host ID                               Rack
UN  127.0.0.1  71.21 KiB  3       46.2%             7d30cbd4-8356-4189-8c94-0abe8e4d4d73  rack1
UN  127.0.0.2  66.04 KiB  3       37.5%             16bb0b37-2260-440c-ae2a-08cbf9192f85  rack1
UN  127.0.0.3  90.48 KiB  3       28.9%             dc8c9dfd-cf5b-470c-836d-8391941a5a7e  rack1
UN  127.0.0.4  104.64 KiB  3      20.7%             3eecfe2f-65c4-4f41-bbe4-4236bcdf5bd2  rack1
UN  127.0.0.5  66.09 KiB  3       36.1%             4d5adf9f-fe0d-49a0-8ab3-e1f5f9f8e0a2  rack1
UN  127.0.0.6  71.23 KiB  3       30.6%             b41496e6-f391-471c-b3c4-6f56ed4442d6  rack1

Right off the blocks we can see signs that the cluster might be unbalanced. Similar to what we did with the single node cluster, here we create the test keyspace and populate it using cqlsh. We then grab a read out of the token ring to see what that looks like. Once again, to make the example easier to follow we manually add a letter representation next to each token number.

$ ccm node1 nodetool ring test_keyspace

Datacenter: datacenter1
==========
Address    Rack   Status  State   Load        Owns    Token                 Token Letter
                                                      8828652533728408318   R
0.0.5  rack1  Up      Normal  121.09 KiB  41.44%  -7586808982694641609  A
0.0.1  rack1  Up      Normal  126.49 KiB  64.03%  -6737339388913371534  B
0.0.2  rack1  Up      Normal  126.04 KiB  66.60%  -5657740186656828604  C
0.0.3  rack1  Up      Normal  135.71 KiB  39.89%  -3714593062517416200  D
0.0.6  rack1  Up      Normal  126.58 KiB  40.07%  -2697218374613409116  E
0.0.1  rack1  Up      Normal  126.49 KiB  64.03%  -1044956249817882006  F
0.0.2  rack1  Up      Normal  126.04 KiB  66.60%  -877178609551551982   G
0.0.4  rack1  Up      Normal  110.22 KiB  47.96%  -852432543207202252   H
0.0.5  rack1  Up      Normal  121.09 KiB  41.44%  117262867395611452    I
0.0.6  rack1  Up      Normal  126.58 KiB  40.07%  762725591397791743    J
0.0.3  rack1  Up      Normal  135.71 KiB  39.89%  1416289897444876127   K
0.0.1  rack1  Up      Normal  126.49 KiB  64.03%  3730403440915368492   L
0.0.4  rack1  Up      Normal  110.22 KiB  47.96%  4190414744358754863   M
0.0.2  rack1  Up      Normal  126.04 KiB  66.60%  6904945895761639194   N
0.0.5  rack1  Up      Normal  121.09 KiB  41.44%  7117770953638238964   O
0.0.4  rack1  Up      Normal  110.22 KiB  47.96%  7764578023697676989   P
0.0.3  rack1  Up      Normal  135.71 KiB  39.89%  8123167640761197831   Q
0.0.6  rack1  Up      Normal  126.58 KiB  40.07%  8828652533728408318   R

As we can see from the “Owns” column above, there are some large token range ownership imbalances. The smallest token range ownership is by node 127.0.0.3 at 39.89%. The largest token range ownership is by node 127.0.0.2 at 66.6%. This is about 26% difference!

Once again, we capture the output of ccm node1 nodetool describering test_keyspace and change the token numbers to the corresponding letters in the above token ring.

$ ccm node1 nodetool describering test_keyspace

Schema Version:4b2dc440-2e7c-33a4-aac6-ffea86cb0e21
TokenRange:
    TokenRange(start_token:J, end_token:K, endpoints:[127.0.0.3, 127.0.0.1, 127.0.0.4], rpc_endpoints:[127.0.0.3, 127.0.0.1, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:K, end_token:L, endpoints:[127.0.0.1, 127.0.0.4, 127.0.0.2], rpc_endpoints:[127.0.0.1, 127.0.0.4, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:E, end_token:F, endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.4], rpc_endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:D, end_token:E, endpoints:[127.0.0.6, 127.0.0.1, 127.0.0.2], rpc_endpoints:[127.0.0.6, 127.0.0.1, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:I, end_token:J, endpoints:[127.0.0.6, 127.0.0.3, 127.0.0.1], rpc_endpoints:[127.0.0.6, 127.0.0.3, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:A, end_token:B, endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.3], rpc_endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:R, end_token:A, endpoints:[127.0.0.5, 127.0.0.1, 127.0.0.2], rpc_endpoints:[127.0.0.5, 127.0.0.1, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:M, end_token:N, endpoints:[127.0.0.2, 127.0.0.5, 127.0.0.4], rpc_endpoints:[127.0.0.2, 127.0.0.5, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:H, end_token:I, endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.3], rpc_endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:L, end_token:M, endpoints:[127.0.0.4, 127.0.0.2, 127.0.0.5], rpc_endpoints:[127.0.0.4, 127.0.0.2, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:N, end_token:O, endpoints:[127.0.0.5, 127.0.0.4, 127.0.0.3], rpc_endpoints:[127.0.0.5, 127.0.0.4, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:P, end_token:Q, endpoints:[127.0.0.3, 127.0.0.6, 127.0.0.5], rpc_endpoints:[127.0.0.3, 127.0.0.6, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:Q, end_token:R, endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.1], rpc_endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:F, end_token:G, endpoints:[127.0.0.2, 127.0.0.4, 127.0.0.5], rpc_endpoints:[127.0.0.2, 127.0.0.4, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:C, end_token:D, endpoints:[127.0.0.3, 127.0.0.6, 127.0.0.1], rpc_endpoints:[127.0.0.3, 127.0.0.6, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:G, end_token:H, endpoints:[127.0.0.4, 127.0.0.5, 127.0.0.6], rpc_endpoints:[127.0.0.4, 127.0.0.5, 127.0.0.6], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:B, end_token:C, endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.6], rpc_endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.6], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:O, end_token:P, endpoints:[127.0.0.4, 127.0.0.3, 127.0.0.6], rpc_endpoints:[127.0.0.4, 127.0.0.3, 127.0.0.6], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack1)])

Finally, we can determine all the token ranges assigned to each node. The token ranges assigned to each node looked like this:

Using this we can see what happens if we had the same outage as the single token cluster did, that is, node3 and node6 are unavailable. As we can see node3 and node6 are both responsible for tokens C, D, I, J, P, and Q. Hence, data associated with those tokens would be unavailable if our application is using a Consistency Level of LOCAL_QUORUM. To put that in different terms, unlike our single token cluster, in this case 33.3% of our data could no longer be retrieved.

Rack ‘em up

A seasoned Cassandra operator will notice that so far we have run our token distribution tests on clusters with only a single rack. To help increase the availability when using vnodes racks can be deployed. When racks are used Cassandra will try to place single replicas in each rack. That is, it will try to ensure no two identical token ranges appear in the same rack.

The key here is to configure the cluster so that for a given datacenter the number of racks is the same as the replication factor.

Let’s retry our previous example where we set num_tokens to 3, only this time we’ll define three racks in the test cluster. After configuring and starting the nodes in ccm, our newly configured test cluster initially looks like this:

$ ccm node1 nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens  Owns (effective)  Host ID                               Rack
UN  127.0.0.1  71.08 KiB  3       31.8%             49df615d-bfe5-46ce-a8dd-4748c086f639  rack1
UN  127.0.0.2  71.04 KiB  3       34.4%             3fef187e-00f5-476d-b31f-7aa03e9d813c  rack2
UN  127.0.0.3  66.04 KiB  3       37.3%             c6a0a5f4-91f8-4bd1-b814-1efc3dae208f  rack3
UN  127.0.0.4  109.79 KiB  3      52.9%             74ac0727-c03b-476b-8f52-38c154cfc759  rack1
UN  127.0.0.5  66.09 KiB  3       18.7%             5153bad4-07d7-4a24-8066-0189084bbc80  rack2
UN  127.0.0.6  66.09 KiB  3       25.0%             6693214b-a599-4f58-b1b4-a6cf0dd684ba  rack3

We can still see signs that the cluster might be unbalanced. This is a side issue, as the main point to take from the above is that we now have three racks defined in the cluster with two nodes assigned in each. Once again, similar to the single node cluster, we can create the test keyspace and populate it using cqlsh. We then grab a read out of the token ring to see what it looks like. Same as the previous tests, to make the example easier to follow, we manually add a letter representation next to each token number.

ccm node1 nodetool ring test_keyspace


Datacenter: datacenter1
==========
Address    Rack   Status  State   Load        Owns    Token                 Token Letter
                                                      8993942771016137629   R
0.0.5  rack2  Up      Normal  122.42 KiB  34.65%  -8459555739932651620  A
0.0.4  rack1  Up      Normal  111.07 KiB  53.84%  -8458588239787937390  B
0.0.3  rack3  Up      Normal  116.12 KiB  60.72%  -8347996802899210689  C
0.0.1  rack1  Up      Normal  121.31 KiB  46.16%  -5712162437894176338  D
0.0.4  rack1  Up      Normal  111.07 KiB  53.84%  -2744262056092270718  E
0.0.6  rack3  Up      Normal  122.39 KiB  39.28%  -2132400046698162304  F
0.0.2  rack2  Up      Normal  121.42 KiB  65.35%  -1232974565497331829  G
0.0.4  rack1  Up      Normal  111.07 KiB  53.84%  1026323925278501795   H
0.0.2  rack2  Up      Normal  121.42 KiB  65.35%  3093888090255198737   I
0.0.2  rack2  Up      Normal  121.42 KiB  65.35%  3596129656253861692   J
0.0.3  rack3  Up      Normal  116.12 KiB  60.72%  3674189467337391158   K
0.0.5  rack2  Up      Normal  122.42 KiB  34.65%  3846303495312788195   L
0.0.1  rack1  Up      Normal  121.31 KiB  46.16%  4699181476441710984   M
0.0.1  rack1  Up      Normal  121.31 KiB  46.16%  6795515568417945696   N
0.0.3  rack3  Up      Normal  116.12 KiB  60.72%  7964270297230943708   O
0.0.5  rack2  Up      Normal  122.42 KiB  34.65%  8105847793464083809   P
0.0.6  rack3  Up      Normal  122.39 KiB  39.28%  8813162133522758143   Q
0.0.6  rack3  Up      Normal  122.39 KiB  39.28%  8993942771016137629   R

Once again we capture the output of ccm node1 nodetool describering test_keyspace and change the token numbers to the corresponding letters in the above token ring.

$ ccm node1 nodetool describering test_keyspace

Schema Version:aff03498-f4c1-3be1-b133-25503becf208
TokenRange:
    TokenRange(start_token:B, end_token:C, endpoints:[127.0.0.3, 127.0.0.1, 127.0.0.2], rpc_endpoints:[127.0.0.3, 127.0.0.1, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2)])
    TokenRange(start_token:L, end_token:M, endpoints:[127.0.0.1, 127.0.0.3, 127.0.0.5], rpc_endpoints:[127.0.0.1, 127.0.0.3, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2)])
    TokenRange(start_token:N, end_token:O, endpoints:[127.0.0.3, 127.0.0.5, 127.0.0.4], rpc_endpoints:[127.0.0.3, 127.0.0.5, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:P, end_token:Q, endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.4], rpc_endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:K, end_token:L, endpoints:[127.0.0.5, 127.0.0.1, 127.0.0.3], rpc_endpoints:[127.0.0.5, 127.0.0.1, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3)])
    TokenRange(start_token:R, end_token:A, endpoints:[127.0.0.5, 127.0.0.4, 127.0.0.3], rpc_endpoints:[127.0.0.5, 127.0.0.4, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3)])
    TokenRange(start_token:I, end_token:J, endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.1], rpc_endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:Q, end_token:R, endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.4], rpc_endpoints:[127.0.0.6, 127.0.0.5, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:E, end_token:F, endpoints:[127.0.0.6, 127.0.0.2, 127.0.0.4], rpc_endpoints:[127.0.0.6, 127.0.0.2, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:H, end_token:I, endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.1], rpc_endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:D, end_token:E, endpoints:[127.0.0.4, 127.0.0.6, 127.0.0.2], rpc_endpoints:[127.0.0.4, 127.0.0.6, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2)])
    TokenRange(start_token:A, end_token:B, endpoints:[127.0.0.4, 127.0.0.3, 127.0.0.2], rpc_endpoints:[127.0.0.4, 127.0.0.3, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2)])
    TokenRange(start_token:C, end_token:D, endpoints:[127.0.0.1, 127.0.0.6, 127.0.0.2], rpc_endpoints:[127.0.0.1, 127.0.0.6, 127.0.0.2], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2)])
    TokenRange(start_token:F, end_token:G, endpoints:[127.0.0.2, 127.0.0.4, 127.0.0.3], rpc_endpoints:[127.0.0.2, 127.0.0.4, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3)])
    TokenRange(start_token:O, end_token:P, endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.4], rpc_endpoints:[127.0.0.5, 127.0.0.6, 127.0.0.4], endpoint_details:[EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.6, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:J, end_token:K, endpoints:[127.0.0.3, 127.0.0.5, 127.0.0.1], rpc_endpoints:[127.0.0.3, 127.0.0.5, 127.0.0.1], endpoint_details:[EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1)])
    TokenRange(start_token:G, end_token:H, endpoints:[127.0.0.4, 127.0.0.2, 127.0.0.3], rpc_endpoints:[127.0.0.4, 127.0.0.2, 127.0.0.3], endpoint_details:[EndpointDetails(host:127.0.0.4, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.2, datacenter:datacenter1, rack:rack2), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3)])
    TokenRange(start_token:M, end_token:N, endpoints:[127.0.0.1, 127.0.0.3, 127.0.0.5], rpc_endpoints:[127.0.0.1, 127.0.0.3, 127.0.0.5], endpoint_details:[EndpointDetails(host:127.0.0.1, datacenter:datacenter1, rack:rack1), EndpointDetails(host:127.0.0.3, datacenter:datacenter1, rack:rack3), EndpointDetails(host:127.0.0.5, datacenter:datacenter1, rack:rack2)])

Lastly, we once again determine all the token ranges assigned to each node:

As we can see from the way Cassandra has assigned the tokens, there is now a complete data replica spread across two nodes in each of our three racks. If we go back to our failure scenario where node3 and node6 become unavailable, we can still service queries using a Consistency Level of LOCAL_QUORUM. The only elephant in the room here is node3 has a lot more tokens distributed to it than other nodes. Its counterpart in the same rack, node6, is at the opposite end with fewer tokens allocated to it.

Too many vnodes spoil the cluster

Given the token distribution issues with a low numbers of vnodes, one would think the best option is to have a large vnode value. However, apart from having a higher chance of some data being unavailable in a multi-node outage, large vnode values also impact streaming operations. To repair data on a node, Cassandra will start one repair session per vnode. These repair sessions need to be processed sequentially. Hence, the larger the vnode value the longer the repair times, and the overhead needed to run a repair.

In an effort to fix slow repair times as a result of large vnode values, CASSANDRA-5220 was introduced in 3.0. This change allows Cassandra to group common token ranges for a set of nodes into a single repair session. It increased the size of the repair session as multiple token ranges were being repaired, but reduced the number of repair sessions being executed in parallel.

We can see the effect that vnodes have on repair by running a simple test on a cluster backed by real hardware. To do this test we first need create a cluster that uses single tokens run a repair. Then we can create the same cluster except with 256 vnodes, and run the same repair. We will use tlp-cluster to create a Cassandra cluster in AWS with the following properties.

Instance size: i3.2xlarge
Node count: 12
Rack count: 3 (4 nodes per rack)
Cassandra version: 3.11.9 (latest stable release at the time of writing)

The commands to build this cluster are as follows.

$ tlp-cluster init --azs a,b,c --cassandra 12 --instance i3.2xlarge --stress 1 TLP BLOG "Blogpost repair testing"
$ tlp-cluster up
$ tlp-cluster use --config "cluster_name:SingleToken" --config "num_tokens:1" 3.11.9
$ tlp-cluster install

Once we provision the hardware we set the initial_token property for each of the nodes individually. We can calculate the initial tokens for each node using a simple Python command.

Python 2.7.16 (default, Nov 23 2020, 08:01:20)
[GCC Apple LLVM 12.0.0 (clang-1200.0.30.4) [+internal-os, ptrauth-isa=sign+stri on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> num_tokens = 1
>>> num_nodes = 12
>>> print("\n".join(['[Node {}] initial_token: {}'.format(n + 1, ','.join([str(((2**64 / (num_tokens * num_nodes)) * (t * num_nodes + n)) - 2**63) for t in range(num_tokens)])) for n in range(num_nodes)]))
[Node 1] initial_token: -9223372036854775808
[Node 2] initial_token: -7686143364045646507
[Node 3] initial_token: -6148914691236517206
[Node 4] initial_token: -4611686018427387905
[Node 5] initial_token: -3074457345618258604
[Node 6] initial_token: -1537228672809129303
[Node 7] initial_token: -2
[Node 8] initial_token: 1537228672809129299
[Node 9] initial_token: 3074457345618258600
[Node 10] initial_token: 4611686018427387901
[Node 11] initial_token: 6148914691236517202
[Node 12] initial_token: 7686143364045646503

After starting Cassandra on all the nodes, around 3 GB of data per node can be preloaded using the following tlp-stress command. In this command we set our keyspace replication factor to 3 and set gc_grace_seconds to 0. This is done to make hints expire immediately when they are created, which ensures they are never delivered to the destination node.

ubuntu@ip-172-31-19-180:~$ tlp-stress run KeyValue --replication "{'class': 'NetworkTopologyStrategy', 'us-west-2':3 }" --cql "ALTER TABLE tlp_stress.keyvalue WITH gc_grace_seconds = 0" --reads 1 --partitions 100M --populate 100M --iterations 1

Upon completion of the data loading, the cluster status looks like this.

ubuntu@ip-172-31-30-95:~$ nodetool status
Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.31.30.95   2.78 GiB   1            25.0%             6640c7b9-c026-4496-9001-9d79bea7e8e5  2a
UN  172.31.31.106  2.79 GiB   1            25.0%             ceaf9d56-3a62-40be-bfeb-79a7f7ade402  2a
UN  172.31.2.74    2.78 GiB   1            25.0%             4a90b071-830e-4dfe-9d9d-ab4674be3507  2c
UN  172.31.39.56   2.79 GiB   1            25.0%             37fd3fe0-598b-428f-a84b-c27fc65ee7d5  2b
UN  172.31.31.184  2.78 GiB   1            25.0%             40b4e538-476a-4f20-a012-022b10f257e9  2a
UN  172.31.10.87   2.79 GiB   1            25.0%             fdccabef-53a9-475b-9131-b73c9f08a180  2c
UN  172.31.18.118  2.79 GiB   1            25.0%             b41ab8fe-45e7-4628-94f0-a4ec3d21f8d0  2a
UN  172.31.35.4    2.79 GiB   1            25.0%             246bf6d8-8deb-42fe-bd11-05cca8f880d7  2b
UN  172.31.40.147  2.79 GiB   1            25.0%             bdd3dd61-bb6a-4849-a7a6-b60a2b8499f6  2b
UN  172.31.13.226  2.79 GiB   1            25.0%             d0389979-c38f-41e5-9836-5a7539b3d757  2c
UN  172.31.5.192   2.79 GiB   1            25.0%             b0031ef9-de9f-4044-a530-ffc67288ebb6  2c
UN  172.31.33.0    2.79 GiB   1            25.0%             da612776-4018-4cb7-afd5-79758a7b9cf8  2b

We can then run a full repair on each node using the following commands.

$ source env.sh
$ c_all "nodetool repair -full tlp_stress"

The repair times recorded for each node were.

[2021-01-22 20:20:13,952] Repair command #1 finished in 3 minutes 55 seconds
[2021-01-22 20:23:57,053] Repair command #1 finished in 3 minutes 36 seconds
[2021-01-22 20:27:42,123] Repair command #1 finished in 3 minutes 32 seconds
[2021-01-22 20:30:57,654] Repair command #1 finished in 3 minutes 21 seconds
[2021-01-22 20:34:27,740] Repair command #1 finished in 3 minutes 17 seconds
[2021-01-22 20:37:40,449] Repair command #1 finished in 3 minutes 23 seconds
[2021-01-22 20:41:32,391] Repair command #1 finished in 3 minutes 36 seconds
[2021-01-22 20:44:52,917] Repair command #1 finished in 3 minutes 25 seconds
[2021-01-22 20:47:57,729] Repair command #1 finished in 2 minutes 58 seconds
[2021-01-22 20:49:58,868] Repair command #1 finished in 1 minute 58 seconds
[2021-01-22 20:51:58,724] Repair command #1 finished in 1 minute 53 seconds
[2021-01-22 20:54:01,100] Repair command #1 finished in 1 minute 50 seconds

These times give us a total repair time of 36 minutes and 44 seconds.

The same cluster can be reused to test repair times when 256 vnodes are used. To do this we execute the following steps.

Shut down Cassandra on all the nodes.
Delete the contents in each of the directories data, commitlog, hints, and saved_caches (these are located in /var/lib/cassandra/ on each node).
Set num_tokens in the cassandra.yaml configuration file to a value of 256 and remove the initial_token setting.
Start up Cassandra on all the nodes.

After populating the cluster with data its status looked like this.

ubuntu@ip-172-31-30-95:~$ nodetool status
Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  172.31.30.95   2.79 GiB   256          24.3%             10b0a8b5-aaa6-4528-9d14-65887a9b0b9c  2a
UN  172.31.2.74    2.81 GiB   256          24.4%             a748964d-0460-4f86-907d-a78edae2a2cb  2c
UN  172.31.31.106  3.1 GiB    256          26.4%             1fc68fbd-335d-4689-83b9-d62cca25c88a  2a
UN  172.31.31.184  2.78 GiB   256          23.9%             8a1b25e7-d2d8-4471-aa76-941c2556cc30  2a
UN  172.31.39.56   2.73 GiB   256          23.5%             3642a964-5d21-44f9-b330-74c03e017943  2b
UN  172.31.10.87   2.95 GiB   256          25.4%             540a38f5-ad05-4636-8768-241d85d88107  2c
UN  172.31.18.118  2.99 GiB   256          25.4%             41b9f16e-6e71-4631-9794-9321a6e875bd  2a
UN  172.31.35.4    2.96 GiB   256          25.6%             7f62d7fd-b9c2-46cf-89a1-83155feebb70  2b
UN  172.31.40.147  3.26 GiB   256          27.4%             e17fd867-2221-4fb5-99ec-5b33981a05ef  2b
UN  172.31.13.226  2.91 GiB   256          25.0%             4ef69969-d9fe-4336-9618-359877c4b570  2c
UN  172.31.33.0    2.74 GiB   256          23.6%             298ab053-0c29-44ab-8a0a-8dde03b4f125  2b
UN  172.31.5.192   2.93 GiB   256          25.2%             7c690640-24df-4345-aef3-dacd6643d6c0  2c

When we run the same repair test for the single token cluster on the vnode cluster, the following repair times were recorded.

[2021-01-22 22:45:56,689] Repair command #1 finished in 4 minutes 40 seconds
[2021-01-22 22:50:09,170] Repair command #1 finished in 4 minutes 6 seconds
[2021-01-22 22:54:04,820] Repair command #1 finished in 3 minutes 43 seconds
[2021-01-22 22:57:26,193] Repair command #1 finished in 3 minutes 27 seconds
[2021-01-22 23:01:23,554] Repair command #1 finished in 3 minutes 44 seconds
[2021-01-22 23:04:40,523] Repair command #1 finished in 3 minutes 27 seconds
[2021-01-22 23:08:20,231] Repair command #1 finished in 3 minutes 23 seconds
[2021-01-22 23:11:01,230] Repair command #1 finished in 2 minutes 45 seconds
[2021-01-22 23:13:48,682] Repair command #1 finished in 2 minutes 40 seconds
[2021-01-22 23:16:23,630] Repair command #1 finished in 2 minutes 32 seconds
[2021-01-22 23:18:56,786] Repair command #1 finished in 2 minutes 26 seconds
[2021-01-22 23:21:38,961] Repair command #1 finished in 2 minutes 30 seconds

These times give us a total repair time of 39 minutes and 23 seconds.

While the time difference is quite small for 3 GB of data per node (up to an additional 45 seconds per node), it is easy to see how the difference could balloon out when we have data sizes in the order of hundreds of gigabytes per node.

Unfortunately, all data streaming operations like bootstrap and datacenter rebuild fall victim to the same issue repairs have with large vnode values. Specifically, when a node needs to stream data to another node a streaming session is opened for each token range on the node. This results in a lot of unnecessary overhead, as data is transferred via the JVM.

Secondary indexes impacted too

To add insult to injury, the negative effect of a large vnode values extends to secondary indexes because of the way the read path works.

When a coordinator node receives a secondary index request from a client, it fans out the request to all the nodes in the cluster or datacenter depending on the locality of the consistency level. Each node then checks the SSTables for each of the token ranges assigned to it for a match to the secondary index query. Matches to the query are then returned to the coordinator node.

Hence, the larger the number of vnodes, the larger the impact to the responsiveness of the secondary index query. Furthermore, the performance impacts on secondary indexes grow exponentially with the number of replicas in the cluster. In a scenario where multiple datacenters have nodes using many vnodes, secondary indexes become even more inefficient.

A new hope

So what we are left with then is a property in Cassandra that really hits the mark in terms of reducing the complexities when resizing a cluster. Unfortunately, their benefits come at the expense of unbalanced token ranges on one end, and degraded operations performance at the other. That being said, the vnodes story is far from over.

Eventually, it became a well-known fact in the Apache Cassandra project that large vnode values had undesirable side effects on a cluster. To combat this issue, clever contributors and committers added CASSANDRA-7032 in 3.0; a replica aware token allocation algorithm. The idea was to allow a low value to be used for num_tokens while maintaining relatively even balanced token ranges. The enhancement includes the addition of the allocate_tokens_for_keyspace setting in the cassandra.yaml file. The new algorithm is used instead of the random token allocator when an existing user keyspace is assigned to the allocate_tokens_for_keyspace setting.

Behind the scenes, Cassandra takes the replication factor of the defined keyspace and uses it when calculating the token values for the node when it first enters the cluster. Unlike the random token generator, the replica aware generator is like an experienced member of a symphony orchestra; sophisticated and in tune with its surroundings. So much so, that the process it uses to generate token ranges involves:

Constructing an initial token ring state.
Computing candidates for new tokens by splitting all existing token ranges right in the middle.
Evaluating the expected improvements from all candidates and forming a priority queue.
Iterating through the candidates in the queue and selecting the best combination.
- During token selection, re-evaluate the candidate improvements in the queue.

While this was good advancement for Cassandra, there are a few gotchas to watch out for when using the replica aware token allocation algorithm. To start with, it only works with the Murmur3Partitioner partitioner. If you started with an old cluster that used another partitioner such as the RandomPartitioner and have upgraded over time to 3.0, the feature is unusable. The second and more common stumbling block is that some trickery is required to use this feature when creating a cluster from scratch. The question was common enough that we wrote a blog post specifically on how to use the new replica aware token allocation algorithm to set up a new cluster with even token distribution.

As you can see, Cassandra 3.0 made a genuine effort to address vnode’s rough edges. What’s more, there are additional beacons of light on the horizon with the upcoming Cassandra 4.0 major release. For instance, a new allocate_tokens_for_local_replication_factor setting has been added to the cassandra.yaml file via CASSANDRA-15260. Similar to its cousin the allocate_tokens_for_keyspace setting, the replica aware token allocation algorithm is activated when a value is supplied to it.

However, unlike its close relative, it is more user-friendly. This is because no phaffing is required to create a balanced cluster from scratch. In the simplest case, you can set a value for the allocate_tokens_for_local_replication_factor setting and just start adding nodes. Advanced operators can still manually assign tokens to the initial nodes to ensure the desired replication factor is met. After that, subsequent nodes can be added with the replication factor value assigned to the allocate_tokens_for_local_replication_factor setting.

Arguably, one of the longest time coming and significant changes to be released with Cassandra 4.0 is the update to the default value of the num_tokens setting. As mentioned at the beginning of this post thanks to CASSANDRA-13701 Cassandra 4.0 will ship with a num_tokens value set to 16 in the cassandra.yaml file. In addition, the allocate_tokens_for_local_replication_factor setting is enabled by default and set to a value of 3.

These changes are much better user defaults. On a vanilla installation of Cassandra 4.0, the replica aware token allocation algorithm kicks in as soon as there are enough hosts to satisfy a replication factor of 3. The result is an evenly distributed token ranges for new nodes with all the benefits that a low vnodes value has to offer.

Conclusion

The consistent hashing and token allocation functionality form part of Cassandra’s backbone. Virtual nodes take the guess work out of maintaining this critical functionality, specifically, making cluster resizing quicker and easier. As a rule of thumb, the lower the number of vnodes, the less even the token distribution will be, leading to some nodes being over worked. Alternatively, the higher the number of vnodes, the slower cluster wide operations take to complete and more likely data will be unavailable if multiple nodes are down. The features in 3.0 and the enhancements to those features thanks to 4.0, allow Cassandra to use a low number of vnodes while still maintaining a relatively even token distribution. Ultimately, it will produce a better out-of-the-box experience for new users when running a vanilla installation of Cassandra 4.0.

Get Rid of Read Repair Chance

2021-01-12T00:00:00+00:00

Apache Cassandra has a feature called Read Repair Chance that we always recommend our clients to disable. It is often an additional ~20% internal read load cost on your cluster that serves little purpose and provides no guarantees.

What is read repair chance?

The feature comes with two schema options at the table level: read_repair_chance and dclocal_read_repair_chance. Each representing the probability that the coordinator node will query the extra replica nodes, beyond the requested consistency level, for the purpose of read repairs.

The original setting read_repair_chance now defines the probability of issuing the extra queries to all replicas in all data centers. And the newer dclocal_read_repair_chance setting defines the probability of issuing the extra queries to all replicas within the current data center.

The default values are read_repair_chance = 0.0 and dclocal_read_repair_chance = 0.1. This means that cross-datacenter asynchronous read repair is disabled and asynchronous read repair within the datacenter occurs on 10% of read requests.

What does it cost?

Consider the following cluster deployment:

A keyspace with a replication factor of three (RF=3) in a single data center
The default value of dclocal_read_repair_chance = 0.1
Client reads using a consistency level of LOCAL_QUORUM
Client is using the token aware policy (default for most drivers)

In this setup, the cluster is going to see ~10% of the read requests result in the coordinator issuing two messaging system queries to two replicas, instead of just one. This results in an additional ~5% load.

If the requested consistency level is LOCAL_ONE, which is the default for the java-driver, then ~10% of the read requests result in the coordinator increasing messaging system queries from zero to two. This equates to a ~20% read load increase.

With read_repair_chance = 0.1 and multiple datacenters the situation is much worse. With three data centers each with RF=3, then 10% of the read requests will result in the coordinator issuing eight extra replica queries. And six of those extra replica queries are now via cross-datacenter queries. In this use-case it becomes a doubling of your read load.

Let’s take a look at this with some flamegraphs…

The first flamegraph shows the default configuration of dclocal_read_repair_chance = 0.1. When the coordinator’s code hits the AbstractReadExecutor.getReadExecutor(..) method, it splits paths depending on the ReadRepairDecision for the table. Stack traces containing either AlwaysSpeculatingReadExecutor, SpeculatingReadExecutor or NeverSpeculatingReadExecutor provide us a hint to which code path we are on, and whether either the read repair chance or speculative retry are in play.

The second flamegraph shows the behaviour when the configuration has been changed to dclocal_read_repair_chance = 0.0. The AlwaysSpeculatingReadExecutor flame is gone and this demonstrates the degree of complexity removed from runtime. Specifically, read requests from the client are now forwarded to every replica instead of only those defined by the consistency level.

ℹ️ These flamegraphs were created with Apache Cassandra 3.11.9, Kubernetes and the cass-operator, nosqlbench and the async-profiler.

Previously we relied upon the existing tools of tlp-cluster, ccm, tlp-stress and cassandra-stress. This new approach with new tools is remarkably easy, and by using k8s the same approach can be used locally or against a dedicated k8s infrastructure. That is, I don't need to switch between ccm clusters for local testing and tlp-cluster for cloud testing. The same recipe applies everywhere. Nor am I bound to AWS for my cloud testing. It is also worth mentioning that these new tools are gaining a lot of focus and momentum from DataStax, so the introduction of this new approach to the open source community is deserved.

The full approach and recipe to generating these flamegraphs will follow in a [subsequent blog post](/blog/2021/01/31/cassandra_and_kubernetes_cass_operator.html).

What is the benefit of this additional load?

The coordinator returns the result to the client once it has received the response from one of the replicas, per the user’s requested consistency level. This is why we call the feature asynchronous read repairs. This means that read latencies are not directly impacted though the additional background load will indirectly impact latencies.

Asynchronous read repairs also means that there’s no guarantee that the response to the client is repaired data. In summary, 10% of the data you read will be guaranteed to be repaired after you have read it. This is not a guarantee clients can use or rely upon. And it is not a guarantee Cassandra operators can rely upon to ensure data at rest is consistent. In fact it is not a guarantee an operator would want to rely upon anyway, as most inconsistencies are dealt with by hints and nodes down longer than the hint window are expected to be manually repaired.

Furthermore, systems that use strong consistency (i.e. where reads and writes are using quorum consistency levels) will not expose such unrepaired data anyway. Such systems only need repairs and consistent data on disk for lower read latencies (by avoiding the additional digest mismatch round trip between coordinator and replicas) and ensuring deleted data is not resurrected (i.e. tombstones are properly propagated).

So the feature gives us additional load for no usable benefit. This is why disabling the feature is always an immediate recommendation we give everyone.

It is also the rationale for the feature being removed altogether in the next major release, Cassandra version 4.0. And, since 3.0.17 and 3.11.3, if you still have values set for these properties in your table, you may have noticed the following warning during startup:

dclocal_read_repair_chance table option has been deprecated and will be removed in version 4.0

Get Rid of It

For Cassandra clusters not yet on version 4.0, do the following to disable all asynchronous read repairs:

cqlsh -e 'ALTER TABLE <keyspace_name>.<table_name> WITH read_repair_chance = 0.0 AND dclocal_read_repair_chance = 0.0;'

When upgrading to Cassandra 4.0 no action is required, these settings are ignored and disappear.