The Khangaonkar Report

Replication in modern data systems

2024-07-20T16:27:00.000-07:00

Overview

Replication means making a copy of the data for future use in the case of failures or may be to scale.

Why is it a big deal ? We copy files for backup all the time. For static files, that do not change, making a copy is 1 copy command. But if the data is being updated by users all the time. How often do you run the command to copy. How do you keep the copy in sync with the source ?

That is the problem of replication in databases and data systems. All databases have replication built in that you can setup with a command or two. So why read or discuss it? If you are building a distributed systems that involves data, you will need to replicate data. The concepts from databases will be useful.

While replication is most well known for its use with databases, it is also a critical part of distributed systems where the data is unstructured such as distributed file systems (HDFS) or messaging systems (Apache Kafka)

This post covers replication in traditional single node systems as well as modern distributed systems.

Why do we need replication ?

There are several reasons why replication is needed. It is more than just taking a backup.

Redundancy

Make a copy of the data. When the main server becomes unavailable for any reason, switch to the copy. This is ensure that the data is always available.

Scalability

Your data becomes really popular and the database gets a lot of read requests and cannot keep up. So you make copies of the database and have a load balancer distribute the request across to the copies (replicas).

Geo distribution of data

Bring the data close of user. You have users in Americas, Europe and Asia. Data from americas is replicated to Europe and Asia, so users there can read data locally without making a round trip to the americas for every read.

Secondary use cases

These are lesser known and unconventional use cases. They might be done higher up in the stack at the application layer or middleware than in the database.

Mirroring

Mirroring involves replicating the requests to the application to a copy of the entire application stack. You can think of this as application level replication.

For example, for a REST service, this involves sending the http request, not just to the production service but also to a mirror service.

The mirror service reads and writes from the mirror database. Mirror database is a previous replica that was in sync with the leader. Just before starting mirroring, it is discontinued as a replica so it does not get duplicates.

Mirroring can be used for testing large complex changes against production traffic.

Data in the mirror database is then compared with data in the production database for accuracy.

Testing

A regular database replica is used as a test database. Various kinds of tests - feature tests, performance tests, concurrency tests, scalability tests can be run with services running with the replica. This is a different use case from mirroring.

Migration

This can be used to eliminate or reduce downtimes needed for migration.

Create additional replicas.

Run migration on them.

Rollover the application services to the new database replicas.

Replication strategies

Single leader

This is the most common pattern. It shown in Figure 1.

One server is designated as the leader. The others are followers. All writes go to the leader. The leader replicates the writes to the followers.

The advantages are :

Setting up is fairly easy.

Reads become scalable. You can put a load balancer in front and distribute read requests to followers.

High availability: If the leader fails, you fail over to one of the followers and let it become the leader.

The disadvantages are :

All the writes go to one server , the leader. So this can become a bottleneck. Writes are not scaled.

If you read from a replica that is behind on replication, you might read stale data.

Multi leader

Writes can go to more than one server.

Multi leader replication is needed when

(1) Writes and replication needs to happen across geographically distributed areas.

(2) Connectivity to single leader is not guaranteed. The is usually the case with mobile devices or laptops or when people want the ability to work offline and/or multiple devices.

In the geo distributed case, the writes go to a local local leader. The local leader not only replicates to local replicas but also to the distributed leader (who replicate to their replicas).

In the mobile case, the writes are store locally and the replicated periodically when connectivity is available.

Advantages:

Writes are also scaled.

Writes can done locally or close to clients. Better latency for writes.

Disadvantages:

Since writes happen at multiple leaders. There can be conflict. The conflicts need to be resolved.

Leaderless

In the leaderless model, all nodes are equal and no node is designated leader. Writes can go to any node and that node replicates the write to other nodes. This is the model made popular by AWS Dynamo and later adopted by Cassandra.

Consensus based replication

All the above methods have either write conflict or read consistency issues. Raft and Paxos are two well protocols for replicating log entries. Data to be replicated is modeled as a list of entries in log. The short story is that one server sends one entry or a sequence of entries to others and it is considered committed if a majority of servers acknowledge having received them. Raft has leader election but Paxos is leaderless. Raft protocol describes in detail leader election, replication, server crashes, recovery and consistency checks. The paper is a good read for anyone interested in distributed systems.

Replication Implementations

The first three techniques apply to databases which deal with structured data and are a little more complicated.

Statement based replication

In this approach, the SQL statements such as INSERT/UPDATE/DELETE etc are forwarded as they are from the leaders to the followers. While this can work in most cases, it does not work in certain cases such as timestamps or when you generate an id or a random number.

It is not efficient either. If you insert a record and then delete it, why replicate both commands ?

Write ahead log (WAL) replication

Databases first append every write to the WAL before doing anything else, before writing it to structured storage from where it will be read. WAL is used for recovery. If the database crashed, it state is reconstructed from the WAL. A recent slogan has been "The WAL is the database". Replication here involves replicating the WAL.

A disadvantage is that WAL entries contain where specific storage details like which byte in which block is to be updated. This can create compatibility issues if the leader and followers are on different versions.

Logical replication

A logical log on the other hand captures at a row, how the table was changed. You can view this as an approach somewhere between statement based and WAL replication.

Change data capture is a form of logical replication. It is used to replicate changes in a database to other third party systems. A popular use case is data warehousing where data from multiple sources is aggregated and summarized for analytics.

Unstructured data replication

For unstructured data as in distributed file systems the unit for replication is a block of data. Data is first partitioned into blocks and each block is replicated independently.

Potential issues with replication

Replication Lag

Most of the time replication is asynchronous. Client writes to the leader and returns before any acknowledgement that it has been replicated. Synchronous replication is not viable due both performance and availability issues. A single failure can hold up all replications.

Lost write

However, one problem this creates is that if you read immediately after a write, the replica you are reading from may not yet have your last write.

Inconsistent read

If you read multiple times in quick succession ( same read) , each read may get a different result depending on which replica services the read ( as the replicas may be in different stages of replication)

Cassandra addressed this issue using quorum. CockroachDb uses a consensus protocol like Raft.

Write Conflicts

Write conflict is an issue in multi leader replication. This happens when multiple clients update the same data while talking to a different master. The database does not know which update to accept and how they should be merged. This is similar to a merge conflict in git.

An approach to handle conflicts is to store both versions on write. But on read, send both versions to the client and let the the client resolve the conflict

Replication is real world systems

The product documentation for database on replication can be quite confusing. It best to follow a tutorial or blog in the internet.

Postgres

The documentation and blogs describe it in 2 ways.

You can set it up as synchronous, asynchronous , streaming , log file based etc

And it can be WAL based or logical replication. Statement based is rarely seen.

The following tutorials have instructions to set it up:

https://kinsta.com/blog/postgresql-replication/

https://www.cherryservers.com/blog/how-to-set-up-postgresql-database-replication

In snapshot replication, a snapshot of the database is taken and replicated to followers.

Instead of streaming, you can also setup the replication as file based, where the WAL files are periodically shipped to followers.

In WAL replication, replication slots lets the leader track how much of the WAL is replicated to each replica. This helps the leader not discard segments not yet replicated. But this consumes resources on the leader. Replication slots need to be managed and deleted when not needed.

Mysql

The traditional way in mysql was a logical replication based on their binlog file - a binary format for logical changes.

The newer way is based on global transaction identifier (GTID) which is built on top of the binlog. It can be either statement based or row based.

The following tutorial shows how to setup replication in mysql

https://www.digitalocean.com/community/tutorials/how-to-set-up-replication-in-mysql

Dynamo / Cassandra

In this architecture, replication is fundamental to the architecture. All you need to do is to set the replication factor to greater than 1. All servers are equal - no leader and no follower. Writes can go to any server. Partitioning is also fundamental to the architecture. The server that receives the write redirects the write to appropriate server. From here it is replicated to other servers based on the replication factor.

Consistency issues are addressed using quorum based tunable consistency. Quorum mean a majority which is (RF/2+1) agree on something. If you have replication factor (RF) 3, quorum is 2. So on a write, at least 2 nodes need to acknowledge that the write was saved. On read, at least 2 nodes need to agree on the return value. In general, to avoid inconsistencies, you want Read quorum (R)+ Write quorum (W) > RF .

CockroachDb

CockroachDB uses the Raft distributed consensus protocol to ensure that a majority of replicas are in consensus before any change is committed. This is the safest approach to ensure consistency but comes at a cost.

Apache Kafka

In Kafka, messages are sent and received from topics. Topics are split into partition. Each partition has one leader and a configurable number of replicas. Writes go to the leader which replicates to the replicas. Reads can go to the replicas. Each broker is a leader for some partitions but a follower for other partitions. Like Cassandra and CockroachDb, replication is core to the architecture and easy to setup.

Apache Hadoop (HDFS)

This applies to any distributed file system. The file is a sequence of blocks of data. HDFS has a name node and data nodes. Name node maintains a map of which data nodes have the blocks of a file. Each block is replicated to a configurable number of data nodes.

Conclusion

Replication is a critical piece of any distributed data system. It has to be part of the core architecture. It cannot come after the fact like it did in the past. While redundancy and HA are well known benefits, there are other benefits such geo distribution of data as well. It can cause some effects such as read consistency. Care should be taken to address those. Different products use different strategies. You should be familiar with the replication strategies, configuration and side effects for your data product. If you are building a new system with data, understanding how existing systems replicate and the issues they face, can help you design your replication.

How modern distributed systems scale by partitioning ?

2024-07-03T17:12:00.000-07:00

1.0 Introduction

In the last 20 years, software systems moved to the internet and handled large volumes of data and millions of requests. Most people interact with these systems using a browser or a mobile device. At the back end, is not one powerful computer but generally a network of commodity computers. Both the processing and storage of data is spread across multiple computers. In this blog we discuss how large datasets can be stored using multiple commodity computers.

Partitioning is the process of breaking up a large dataset in parts so that each part can fit easily on the disk on one one node and be efficiently managed by each node. For very large data sets that cannot fit on 1 machine, data needs to be broken up into parts ( partition or shard). Each partition is stored on a different machine. This is just natural horizontal scaling. But most important is that, when it time to read the partitioned data, we need to be able to find (efficiently) which partition and node has the data we want to read.

Storage space is not the only benefit of partitioning. You are also spreading the compute required to read, write and process the data.

Partitioning is generally combined with replication to make the partitions highly available. But we do not discuss replication here. That is a topic for another blog.

2.0 Types of partitioning

There are 2 types of data that need to be considered: unstructured and structured.

Most discussions on partitioning discuss partitioning of data in databases ( structured data ) but not unstructured data which is outside databases in plain files. This blog discusses both unstructured data and structured data.

2.1 Structured data

The problem is more interesting for databases because it is not enough to break up the dataset into smaller parts. During reads you need to be able to find the data. And you need to do it fast. When the database receives a query - "Give me records for Customer X", How does it know which node hosts the data ? Does the database have to send the request to all the nodes ? That would be quite inefficient.

The goal is thus to partition data and query it efficiently. Another goal is to ensure that distribution of data between partitions is even. You do not want a situation where partition 1 has 70% of the data and the other 5 partitions has the remaining 30%. This will overload partition 1 ( known as a hot spot) and you lose the benefits of partitioning.

2 strategies are commonly used for database partitioning.

2.1.1 Range based partitioning

Database records are generally stored sorted based on the primary key.

Initially there is one partition with zero records.

As clients write to the database, the size of the partition increases. When it reaches a certain size say 10MB or 64MB, it is split into two partitions.

Each partition may be assigned to a different node.

This process is repeated as more data is added and partitions grow. If data is deleted and partitions shrink, then small partitions can be merged

To efficiently query data, the database needs to do some book keeping

-- which key range is in which partition

-- which partition is at which node

Starting with 1 partition and 1 node is not efficient for obvious reasons and databases typically start with configured number of partitions or a number of partitions proportional to the number of nodes.

To balance the load on nodes, partitions may need to be moved between nodes.

This is the strategy used by HBase, CockroachDb, MongoDb.

2.1.2 Hash based partitioning

The hash value calculated from the key is used to determine the location where the record can be stored.

The wrong way to determine the node is by using hash mod n, where n is the number of nodes. The problem with this approach is that when nodes are added or removed from the cluster, a very high percentage of the keys need to be removed.

A better approach is to start with a fixed number of partitions , way more that the number of the nodes the cluster will ever have , say for example 1000 or 10000. Partitions are logical. Hash ranges are assigned to partitions. Partitions are assigned to nodes either using numPartitions mod numNodes or other algorithms. This is shown in the top half of Figure 4. The bottom half of figure 4 visualizes the same as a ring as is done in many articles that to refers to this as consistent hashing. Think of partitions being placed on the ring. Each partition owns the key space from the position of the previous position to its position. The cluster needs to maintain a mapping of partitions to nodes. When a new node is added, the cluster can take a few partitions from existing nodes and assign to the new nodes. When a node is removed, the cluster assigns its partitions to other nodes. Looking up a key is a extra level of indirection. The hash of key maps to a partition. The partition node map tell you which node has the partition that has the key. Many studies has shown that this lead to less movement of keys between nodes as the cluster changes.

In popular press, this has the poorly understood name "consistent hashing". It is just hast hash based partitioning. Nothing consistent and nothing to do with consistency.

2.1.3 Secondary Indexes

So far we have been talking about partitioning by the primary key, also know as the primary index.

To speed up retrieval of records, databases also have secondary indexes which can be very large and might need to be partitioned.

One approach can be to keep the secondary indexes local to the node on which the primary index partition is. The advantage of this approach is that since all related rows are on the same node, inserts/updates/deletes are are all local. But queries on the secondary index requires sending queries to all nodes and aggregating the responses.

Another approach is to create a global secondary index and partition it as an independent entity. However since the secondary index partitions might be on a different partition from the primary partition, CUD operations are more expensive. Transactions might be distributed. However range queries on secondary indexes are more efficient since closer records (by sort) are on the same partition.

2.2 Unstructured data

Unstructured data refer to ordinary files that have text or binary data. Of course we are talking about large files or many large files. This is the use case for a distributed file system such as HDFS (hadoop file system) or GFS ( Amazon ). Logically the implementation of distributed file system is similar to say a linux filesystem. You view the file system as a list of blocks of fixed size. On a single node linux file system , all the disk blocks are on one node. In a distributed file system, the blocks are spread across multiple nodes. In HDFS, the name node maintains the metadata for the distributed filesystem -- given a file , which blocks make up the file and which nodes have the blocks. To create a file, the name node may assign a block on a particular node and the client talks directly to service called data node running on the target node to write to block. To read a block, the name node directs the client to the data node that hosts the block and the client reads directly from that block. But the basic algorithm is simple -- break up the file data into blocks and spread them across nodes.

Another example is partitioning of topic logs in Apache Kafka is a messaging system ( they like to call it event streaming) where producers write messages to a topic and consumers read messages from the topic. The storage for a topic is a log file. New messages are appended to the end of log file. They are read from the front. Obviously the logs can grow beyond what can fit on a node. So log is broken into partitions and distributed across multiple nodes. The broker serves producers and consumer. A Kafka cluster has multiple brokers with each broker running on a seprate node managing multiple partitions.

3.0 Rebalancing

Rebalancing is the process of moving partitions between nodes to make the distribution of load even across all the nodes. This is necessary when nodes join or leave the cluster or if the cluster starts receiving more data for certain keys. Either way rebalancing is an expensive operation that needs lot of CPU, memory and network bandwidth. It can have an impact on the performance of regular CRUD processing. In a ideal world, we would like rebalancing to happen automatically behind the scenes, without end users knowing about it. But for performance reasons listed above, that rarely works well in practice. Some databases require an admin to manually start a rebalance, which can be done during a period of low load and monitored.

4.0 Routing

How does a client know which partition to connect to ? The short answer is that the database has to maintain mapping of partition to node. In the case of hash based partitioning, hash maps to a partition which maps to a node. In the case of range based partitioning the key maps to a key range which maps to a partition which maps to a node. The partition node map is available to nodes. If a client can connect to any node, then if the node does not have the partition to handle the clients request, it can redirect the client to the appropriate node or it can get the data from the target node and return to client.

5.0 Use case

There are cases where you might not have a ready made database doing server side partitioning for you and you might need to do it yourself. Or even when the database does it for you, you still need to pick the right partition key for the partitioning to be optimal. Let us look at some large datasets and discuss how they might be logically partitioned.

Let us say you are building a twitter like system.

Say 100 million tweets of 140 character per day

100M * ( 280 bytes + 20 bytes for id, timestamp)

30 GB / day

10 TB / year

Need to store 5 years to data

Need to store and query 50 TB

How does twitter work ?

Users follow other users.

When a user connects, we need to show the most recent tweets from the users he follows.

So we need to store about 50 TB of tweets. Given a user, we need to query say the 50 most recent tweets from the users he follows.

Using commodity hardware, 50TB would need say 25 nodes. What key would you use to partition the data ?

Option 1 : hash based partitioning based on user.

To store tweets, a hash of the user is to used to locate the node where the tweet is stored. To query - for each user that the user follows, use the hash to query the node for that users tweets. A problem with this approach is that some users tweet way more than other users. Their nodes are going to be overloaded while others are idle. Unbalanced load.

Option 2 : hash based on randomly generated tweet id

The problem with this approach is the for every query, you have to query every server and aggregate the results. Inefficient for queries.

Option 3: hash based on timestamp

Timestamp is relevant because for each feed request we want the latest tweets. It would be good if tweets are sorted by timestamp. However with such as hash, at any given point in time , one server is overloaded as all the writes are going to the server.

Option 4:

Given the choices, inefficient query (option 2) is more tolerable that unbalanced load (option 1) which could crash some of the servers making the system unavailable. But we also want queries to return the most recent data (sorted by timestamp). So we can improve querying little bit by combining option 2 and 3. Assume a timestamp in epoch time in increments of 1 sec, the tweetid could be timestamp + auto incremented seq. The sequence gives randomness to the tweet id and will give uniform distribution across node.

So given a epoch 1692547708 you will have tweet ids like

1692547708 1

1692547708 2

1692547708 3

1692547708 n

Of course you are wondering how to generate unique tweet ids in a distributed system. That is a topic for another blog.

6.0 Summary

Partitioning data and spreading it across nodes is fundamental to distributed system. Special thought needs to be given to how the partitioned data can be queried efficiently. Hash based partitioning and range based partitioning are two popular strategies. Nodes can fail or additional nodes may need to added to scale. To ensure that load is even across nodes, partitions may be moved between nodes in a process called rebalancing. For best results design your partition keys so that load is distributed evenly and querying is efficient.

7.0 Related Content

CockroachDB Review

CockroachDb Review: Should I use CockroachDb ?

2024-06-30T16:26:00.000-07:00

Overview

CockroachDb is a modern distributed database that promises linear scalability with strict serializability.

Server side sharding is automatic. Nodes can be added easily as needed. Claims to provide the SERIALIZABLE isolation level.

Most distributed databases such as Cassandra, MongoDb, HBase etc sacrifice consistency to achieve high availability. CockroachDb distinguishes itself by claiming to be distributed and the same time offer strong consistency that even single node databases do not offer.

This falls into a database category called NewSql or DistributedSQL as opposed to NoSql (Cassandra, MongoDb)

When to choose CockroachDb ?

You should choose CockroachDb if

Your data is of a global scale.

As data size increases, you need to scale horizontally to several nodes.

You need data to be distributed and localized in specific geographical regions. For
example EU data resides in Europe while US data resides in US.

You need strong consistency. Serializable isolation level.

You need to keep the SQL / relational data model.

You need distributed transactions.

You may want to pass on it if

   You data size can easily fit on a node for the foreseeable future.
   You organization is more comfortable with a stable proven database. (CockroachDb is
   still maturing).
   You data model is heavily normalized and you do a lot of joins in your queries. While this
   database can support joins, it is still not recommended in a highly distributed
   environment.

Architecture

Architecture is based on Google's Spanner paper.

It is a key value store with a SQL interface on top of it.

Database is a cluster of nodes. All nodes are equal. Nodes may join and leave the cluster at any time.

Sorted map of key values pairs. Fully ordered monolithic key space. All tables/indexes go into the same key space by encoding tablename/indexname/key together.

Sharding

Key value pairs are broken up into contiguous ranges.

When range size reaches 512 Mib (2 power 20) It is split into 2 ranges.

Each range is assigned to a node and replicated.

If you have 1 node all the shards are in that node. To scale, you add more nodes and the shards get distributed across nodes. A minimum of 3 nodes is recommended.

Very easily spin up node(s) and add to cluster anywhere.

Btree like index structure used to locate shard that has a key.

Replication

Data in each range is replicated using the Raft consensus algorithm.

A minimum replication factor of 3 is needed.

This provides the high availability. Data is available as long as a majority of the nodes in the cluster are available.

Geo-partitioning

By adding a country or region to the primary key, you can limit storage to keys to a particular region. So European data can be make to reside in Europe, US data in US and so. This has 2 benefits
There is a performance benefit and data is local to its users.

It can satisfy legal requirements where data is not allowed to leave a country or region.

Read/Write

Reads

Any node can receive a request to read a key/value.

Request is forwarded to the node that is the raft leader for that table/range.

Leader returns the data to the node that requested it. Since leader returns the data, no consensus is required.

Node returns it to the client.

Writes

Any node can receive a request to write a key/value.

Request is forwarded to the node that is the raft leader for that table/range.

Leader writes the value to its log and initiates consensus with replicas for the range. When majority acknowledges, the key/value is considered committed and leader notifies the requesting node which notifies the client

Transactions

Supports transactions that spans multiple tables and rows.

Transactions can be distributed, that is span multiple nodes.

The supported isolation level is strict serializability which is the highest isolation level. Strict serializability means that not only are transactions ordered, but they are ordered as per wall clock time.
Transaction protocol is an improvement over two phase commit. In parallel, participants acquire locks and create write intents. The transaction is marked staged. When the client commits, if all locks are acquired and writes are replicated, the coordinator immediately returns success to client. In background the transaction is marked committed. This is one round trip between transaction coordinator and each participant - unlike two phase commit - which requires two round trips.

Hybrid logical clocks are used to timestamp each transaction. Timestamp is the version for MVCC.

Data Model

Clients see only the SQL row column relation model
Wire protocol is same as Postgresql wire protocol.

Performance

Efficient range scan.
Geo partitioning improves performance by locality.
Distributed SQL execution.
Distributed transactions will be slow.
Generally you do not want distributed transactions over large distances. If you build a 3 node CockroachDb cluster with 1 node in NewYork, 1 in London and 1 in San Francisco, the write latencies are going to be very high due to the round trips for RAFT and distributed transactions. The cluster topology needs to be designed appropriately to give you the lowest latency at the desired level of high availability.

Administration

Good command line tools and UI console make the the administration easy.
Since all nodes are equals, number of moving parts that need to be administered is low.

Summary

If you need a globally distributed database with strict serializability, this is definitely a database to look at. It has good pedigree. However remember that distributed databases are not drop in replacement for your traditional RDBMSs. Distributed queries especially joins and distributed transaction can be slow. So some application redesign, some denormalization is always required.

Note: Moved from heavydutysoftware.com

Quick Review: Mysql NDB cluster

2022-02-28T16:04:00.000-08:00

This is a quick 2 min overview of Mysql NDB Cluster. The goal is to help you decide within a minute or two, whether this is an appropriate solution for you.

Cluster of in-memory Mysql databases with a shared nothing architecture.

Consists of Mysql nodes and data nodes.

Mysql nodes are Mysql servers that get data from data nodes. Data nodes hold the data using the NDB storage engine. There are also admin nodes.

NDB nodes serve the data from memory. Data is persisted at checkpoints.

Data is partitioned and replicated.

Up to 48 data nodes and 2 replicas for each fragment of data.

ACID compliant.

READ_COMMITTED isolation level.

Sharding of data is done automatically. No involvement of user or application is required.

Data is replicated for high availability. Node failures are handled automatically.

Clients can access data using NDB Api. Both SQL and NOSQL styles are possible.

This is not a good general purpose database. It is suitable for certain specific use cases of telecom and game but not for general OLTP.

Feels like it has too many moving parts to manage.

High performance -- it is serving data from memory.

Summary

Not a general purpose distributed databases. Unless you are in telecom or gaming or know for sure why this meets your use case, don'nt even think about it.

If you are on Mysql and want high availability, try Mysql InnoDb Cluster, which is much easier to understand and use.

Quick Review: Mysql InnoDb Cluster

2022-02-14T16:27:00.001-08:00

This is a quick 2 min overview of Mysql InnoDb Cluster. The goal is to help you decide within a minute or two, whether this is an appropriate solution for you.

Simple HA solution for Mysql.

Built on top of MySql group replication.

It has 3 Components:

Replication: Uses existing mysql asynchronous replication capabilities. Default is Primary and secondary configuration. Writes go to master which replicates to slaves. Slaves can service reads

Mysql router: Provides routing between your application and the cluster. Supports automatic failover. If the primary dies. The router will redirect writes to the secondary that takes over.

Mysql shell: This is an advance shell that let you code and configure the cluster.

Works best over a local area network. Performance degrades over wide area networks

Easy to setup. Simple commands that are entered on the mysql shell.

var cluster = dba.createCluster('testCluster')

cluster.addInstance('server1@host1:3306')

cluster.addInstance('server2@host2:3306')

cluster.addInstance('server3@host3:3306')

Cluster elects the primary. If you want a particular server to be the primary, you can give it extra weight.

Client do not connect directly to the servers. Rather they connect to the Mysql router that provides the routing as well failover.

MySql InnoDB clusterSet provide additional resiliency by replicating data from a primary cluster to a cluster in another datacenter or location. If the primary cluster becomes available, one of the secondary cluster can become the primary.

Summary

Provides scalability for reads and some HA for Mysql deployments. Simple, easy to use solution. No sharding. Some consistency issues will there when you read from replicas that lag a little bit

References:

https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-innodb-cluster.html

Building Globally Distributed Applications

2020-11-01T06:26:00.000-08:00

A globally distributed application is one where the services and data for the application are partitioned and replicated across multiple regions over the globe. Popular distributed applications that everyone is familiar with are Facebook, Amazon.com, Gmail, Twitter, Instagram. However more and more enterprise applications are finding the need to become distributed because their user base is increasingly distributed around the globe. But not every company has the expertise of a Facebook or Amazon or Google. When going distributed, it is not enough to just spin up instances of your service on AWS or Google cloud on various regions. There are issues related to data that must be addressed for the application to work correctly. While consumer centric social media applications can tolerate some correctness issues or lags in data, the same might not be true for enterprise applications. This blog discusses the data and database issues related to a globally distributed application. Lastly, we discuss 2 research papers that been around since early part of this decade, but whose relevance is increasing in recent times.

Building globally distributed applications that are scalable, highly available and consistent can be challenging. Sharding has to be managed by the application. Keep it highly available requires non database tools. When you have been on a single node database whether it is Mysql or Postgresql etc, it is tempting to scale by manual sharding or one of the clustering solutions available for those databases. It might appear easy at the beginning but the cost of managing the system increases exponentially with scale. Additionally, sharding and replication lead to consistency issues and bugs that need to be addressed. Scaling with single node databases like Mysql beyond a certain point has extremely high operational overhead.

NoSql databases such as Cassandra, Riak, MongoDB etc offer scalability and high availability but at the expense of data consistency. That might be ok for some social media or consumer applications where the dollar value of individual transaction is very small. But not in enterprise applications where the correctness of each transaction is worth several thousands of dollars. In enterprise applications, we need distributed data to behave the same way that we are used to with single node databases.

Let us look at some common correctness issues that crop up with distributed data.

Example 1 : A distributed on line store with servers in San Francisco, New York and Paris.

Each server has 2 tables products and inventory with the following data.
Products:(product)
widget1
widget2
Inventory: (product, count):
widget1,6
widget2,1

Customer Jose connects to server in San Francisco and buys widget2 at time t1. At time t2, Customer Pierre connects to a server in Paris and also buys widget2. Assume t2 > t1 but t2-t1 is small.

Expected Behavior : Jose successfully completes transaction and gets the product. Since inventory of widget2 is now zero, Pierre’s transaction is aborted.
Actual Behavior (in an eventually consistent system): Both transactions complete. But only one of the customers gets the product. The other customer is later sent an apologetic email that widget2 is out of stock.

Example 2: A distributed document sharing system with servers in New York, London, Tokyo

Operation1: In London, User X creates a new empty document marked private.
Operation2. User X makes update 1 to document.
Operation3: User X deletes update 1.
Operation4: User X makes update 2.
Operation5: User X changes the document from private to public.
Due to network issues, only operations 1,2, 5 reach Tokyo. 3 and 4 do not.
In Tokyo, User Y tries to read the shared document.

Expected behavior: The document status is private and Y cannot read the document.
Actual behavior: Y is able to read the document but an incorrect version. The document has update1 which is deleted and is missing update2 which needs to be there.

The problems above are known as consistency issues. Different clients are seeing different views of the data. What is the correct view ?

Consistency here refers to C in the CAP theorem, not the C in ACID. Here Consistency means every thread in a concurrent application correctly reads the most recent write at that point in time.

How do you fix the above issues ? In a single node database, Example1 can be fixed by locking the row in the inventory table during update and Example2 is not even an issue because all the data is in one node. But in a distributed application data might be split across shards and shards replicated for high availability. User of the system might connect to any shard/server and read/write data. With NoSql databases, the application has to handle any in consistencies.

In traditional RDBMSs , database developers are given a knob called isolation level to control what concurrent threads can read. In this old blog I explain what isolation levels are. The safest isolation level is the SERIALIZABLE where the database behaves as if the transactions were executing in a serial order with no overlap, even though in reality they are executing concurrently. Most developers use the default isolation level which is generally READ_COMMITTED OR READ_REPEATABLE. In reality, these isolation levels are poorly documented and implemented differently by different vendors. The result is that in highly concurrent applications, there are consistency bugs even in traditional single node RDBMs. In a distributed database with data spread across shards and replicated for read scalability, the problem is compounded further. Most NoSql vendors punt the problem by claiming eventual consistency, meaning if there are no writes for a while, eventually all reads on all nodes will read the last write.

Consistency is often confused with isolation, which describes how the database behave under concurrent execution of the transactions. At the safest isolation level, the database behaves as if the transactions were executing in serial order, even though in reality they are executing concurrently. At the safest consistency level, every thread in a concurrent application correctly reads the most recent write. But most database documentations are not clear on how to achieve this in an application.

The problems in examples 1 and 2 would not occur if those applications/databases had the notion of a global transaction order with respect to real time. In example 1, Pierre’s transaction at t2 should see the inventory as 0 because a transaction at t1 <t2 set it to zero. In example 2, Y should only be able to read upto operation2 . It should not be able to read operation5 without operations 3,4 which occured before 5.

In database literature, the term for this requirement is called “Strict Serializability” or sometimes “external consistency”. Since this technical definitions can be confusing, it is often referred to as strong consistency.

2 research papers that have been around for a while provide answers on how this problems might be fixed. The papers are the Spanner paper and the Calvin paper.

Their approach is solving the problem can summarized as follows:
1. timestamp transactions with something that reflect their occurrence in real time
2. Order transactions based on timestamp
3. Commit transactions in the above order.

But the details of how they do it are significantly different. Let us look at how they do it.

Spanner paper from Google

Spanner is database built at Google and the paper describes the motivation and design of Spanner. Spanners approach involves
1. The use of atomic clocks and GPS to synchronize clocks across hosts in different regions and the true time API to give accurate time across nodes, regions or continent.
2. For a read/write transaction, spanner calls the true time API to get a timestamp. To address overlaps between transactions that are close to each other, the timestamp is assigned after locks are acquired and before they are released.
3. The commit order equals timestamp order.
4. Read for particular timestamp is sent to any shard/replica that has the data at that timestamp.
5. Read without timestamp (latest read) are serviced by assigning a timestamp.
6. Writes that cross multiple shards use two phase commit.
And of course,
7. It can scale horizontally to 1000s of nodes by sharding.
8. Each shard is replicated.
And most importantly,
9. Even though, it is a key value store, it provide SQL support to make it easy for application programmers.
CockroachDb and Yugabyte are 2 commercial databases based on spanner.

Calvin Paper

The Calvin paper addresses the above problem using distributed consensus protocols like Raft or Paxos.
1. Every transaction has to first go through distributed consensus and secure a spot in a linear replication log.
2. One can view the index in the log as the timestamp.
3. The committed entries in the replication log are then executed in the exact same serial order by every node in the distributed database.
4. Since the transaction log is replicated to every shard, it does not need or use two phase commit. In a transaction involving multiple shards, if a shard dies before committing a particular transaction, then on restart it just has to execute the uncommitted transaction from it replication log.
5. No dependency on wall clocks or time API.
6. No two phase commit.
7. No mention of SQL support.

FaunaDb is an example of a database based on Calvin.

This class of databases that offer horizontal scalability on a global scale without sacrificing consistency is also called NewSql.

In summary, if you are a building a globally distributed application that needs strong consistency, doing it on your own with SQL or NoSql database can be non trivial. Consistency is hard enough in a single node database. But on a distributed database, consistency bugs are harder to troubleshoot and even harder to fix. You might want to consider one of the NewSql databases to make life easier. Review the Spanner and Calvin papers to understand the architectural choices that are available. This will help you pick a database that is right for you. Spanner and Calvin papers have been around for almost a decade. But they have become more relevant now as real databases based on them become more popular. Most importantly understand what is consistency is and apply it, for lack of which can cause severe correctness bugs in your application.

References:

The Spanner paper

The Calvin paper

Consistency and Isolation

A Microservices Introduction

2019-01-20T18:31:00.000-08:00

Modern distributed applications are built as a suite of microservices. In this blog we discuss the characteristics of microservices. We will also compare microservices to its predecessors like SOA and monolithic applications. We point out the benefits and downsides of a microservices architecture.

1.0 Introduction

Let us start with a little bit of history and go back to late 90s or early 2000's. Web applications were monolithic. A single web container would serve the entire application. Even worse, a single web container would serve multiple applications. Not only was this not scalable, it was a development and maintenance nightmare. A single bug could bring multiple applications down. And there was an ownership issue. You had multiple teams/developers contributing code. When there was a bug, the ownership was not clear and bugs would bounce around among developers.

Around mid 2000's the new buzz word was service oriented architecture SOA. This was promoted by large web application server companies. See my blog on SOA written in 2010. SOA encouraged number of good design philosophies such as interface based programming, loosely coupled applications, asynchronous interaction. REST, XML,JSON and messaging platforms enabled SOA. SOA was a big improvement, but the tools and deployment technologies were still heavyweight.

The microservices architecture is the next step in evolution further improving the ideas from SOA.

Many dismiss microservices as another buzz word. But having developed real world applications using tools listed in section 3.0, I see real value and benefit in this architecture.

2.0 Description

The main idea around microservices is that large complex systems are easier to build, maintain and scale using independently built and owned smaller services that work together.

Each microservice is a modular fine grained application providing a specific service. Let us say you have an application that has a UI , authentication, Apis for customer info, Apis for uploading documents, Apis for analytics. You may have a microservice for the UI, a microservice for customer apis, a document upload microservice, an analytics microservice.

A microservice is fully functional.

A microservice performs one specific business or IT function.

The development of the microservice can be done independently.

A microservice runs as its own process.

A microservice communicates using common protocols such as REST/Http.

A microservice offers services via its Apis. It can communicate with other microservices using their APIs.

A microservice is deployable to production independently.

When your application has multiple microservices, each could be developed in a different or the most suitable programming language or framework suited for that service.

A microservice should scale horizontally by just running more instances of the microservice.

Testing, bug fixing, performance tuning etc on the microservice should happen independently without affecting other microservices.

The above listed characteristics make it easier to build large complex systems.

3.0 Enabling Technologies

A number of newer frameworks have made building microservices easier.

For Java programmers, Dropwizard and SpringBoot are very useful frameworks for building microservices. The old way was monolithic application servers like websphere, weblogic , jboss etc. Dropwizard and SpringBoot turn the table by embedding the web server within your java application. Development is much easier as you are developing a plain java application with a main method. The entire microservice is packaged in one jar and can be run with the java -jar command. For additional information, please read my blog comparing Dropwizard to Tomcat. For Javascript, python and other languages there are similar frameworks.

To start with microservices, a framework as mentioned above is all you need. Once you have developed and use several microservices, the following platforms may be useful.

Docker is a containerization technology that makes it easier to manage production deployments. This is of interest for a dev-ops person who has to roll out services to production.

Kubernetes is platform for automation, deployment and scaling of containerized applications.

4.0 Disadvantages

For smaller business and smaller applications, the overhead of many microservices could be a problem. If your infrastructure is one or two $20 per month VMs on AWS (or other cloud providers) you will not have enough memory/cpu/disk for multiple microservices.

The increased network communication is a cost.

Each microservice is its own process. The remote calls have a serialization/deserialization cost.

5.0 Conclusion

Microservices are a logical next step in the evolution of the development of complex applications.
They are a best practice. But they are not a silver bullet that solve every problem.

Apache kafka Streams

2018-12-02T08:23:00.000-08:00

Apache Kafka is a popular distributed messaging and streaming open source system. A key differentiator for Kafka is that its distributed broker architecture makes it highly scalable. Earlier versions of Kafka were more about messaging. I have a number of blogs on Kafka messaging some of which are listed below in the related blogs section.

This blog introduces Kafka streams which builds on messaging.

1.0 Introduction

In a traditional Kafka producer/consumer application, producers write messages to a topic and consumers consume the messages. The consumer may process the message and then write it to a database , filesystem or even discard it. For a consumer to write the message back to another topic, it has to create a producer.

Kafka streams is a higher level library that lets you build a processing pipeline on streams of messages where each stream processor reads a message, does some analytics such as counting, categorizing , aggregation etc and then potentially writes a result back to another topic.

2.0 Use Cases

Analytics from e-commerce site usage.

Analytics from any distributed application.

Distributed processing of any kind of event or data steams.

Transforming from monolithic to micro-services architecture.

Moving away from database intensive architectures.

3.0 When to use Kafka streams ?

If yours is a traditional messaging application, where you need the broker to hold on to messages till they get processed by a consumer, then the producer/consumer framework might be suitable. Here kafka competes with ActiveMQ, Websphere MQ and other traditional message brokers. Here the processing of each message is independent of other messages.

If yours is a analytics style application, where you have do different forms of counting, aggregation, slicing /dicing on a stream of data, then Kafka streams might be an appropriate library. Here the processing is for a set of messages in the stream. In this space. Kafka competes with analytics frameworks like Apache Spark, Storm, splunk etc.

If you were to use producers/consumers for an analytics style application, you would end up creating many producers/consumer, you would probably have to read and write a database several time, you would need to maintain in memory state and probably use a third party library for analytics primitives. Kafka streams library makes all this easier for you.

4.0 Features

Some key features of Kafka streams are:

Provides an API for stream processing primitives such counting, aggregation, categorization etc. API supports timing windows.

Message processing is one at the time. In producer/consumer , messages are generally processed in batches.

Fault tolerant local state is provided by the library. In consumers, any state has to be managed by the application.

Supports exactly once or once and only once message message delivery. In producer/consumer, it is at least once delivery.

No need to deal will lower level messaging concepts like partitions, producers, consumers, polling.

Self contained complete library that handles both messaging and processing. No need for other third party libraries like Spark.

5.0 Concepts

A stream is an unbounded sequence of Kafka messages on a topic.

A stream processor is a piece of code that gets a message, does some processing on it, perhaps stores some in memory state and then writes to another topic for processing by another processor. This is also known as a node.

A stream application is a set of processors where the output of one processor is further processed by one or more other processors. A stream application can depicted as graph with the processors as vertexes and streams/topics as edges.

A source processor is the first node in the topology. It has no upstream processors and gets messages from a topic. A sink processor has no downstream processors and will typically write a result somewhere.

The figure below shows a sample application topology

6.0 Programming model

2 core programming models. Below are some sample code snippets.

6.1 Streams DSL

This is a higher level API build on top of the processor API. Great for beginners.

KStream models the stream of messages. KTable is the in-memory story. You can convert from stream to table and vice versa.

Example: Simple analytics on a stream of pageviews from e-commerce site

// consume from a topic

StreamBuilder builder = new StreamBuilder() ;
KStream pageViewlines = builder.stream("someTopic") ;

// From each line extract productid and create a table key=productid,value=count
// We get page view count by product

KTable productCounts = pageViewlines.flatMapValues(value->getProduct(value))
.groupBy((key,value)->value)
.count() ;

// write the running counts to another topic or storage

productCounts.toStream.to("productCountsTopic",Produced.with(serdes.String(),serdes.Long()) ;

6.2 Processor API

same example using processor API

public class ProductFromPageViewProcessor implements Processor {

private KeyValueStore pcountStore ;

// Do any initialization here
// such as loading stores or scheduling punctuate
public void init(ProcessorContext context) {

// get the store that will store counts
pcountStore = (KeyValueStore)context.getStateStore("pcounts") ;

// schedule a punctuate to to periodically send the product counts to a downstream processor
// every 5 secs

context.schedule(5000,PunctuationType.STREAM_TIME,(timestamp)->{

// iterate over all values in the pcountStore
KeyValueIter iter = pCountStore.all()
while(iter.hasNext()) {
KeyValue val = iter.next() ;
context.forward(val.key,val.value) ;
}

context.commit() ;

} ;

// Called once for every line or message on the consumed topic
public void process(String k, String line) {
String productId = getProductId(line) ;
Long count = pCountStore.get(productId) ;
if (count == null) {
pCountStore.put(productId,1) ;
} else {
pCountStore.put(productId,count+1) ;
}
}

}
}

7.0 Conclusion

As you can see from both API examples, it is about processing streams of data, doing some analytics and producing results. No need to poll or deal with lower level details like partitions and consumers. The streams model moves you away from legacy database intensive architectures, where data is written to a database first and then slow inefficient queries try to do analytics.

Some disadvantages of Kafka Streams are:

You are tied to Kafka and have to go through a Kafka topic. Other Streaming libraries like Spark are more generic and might have more analytics features.

There are no ways to pause and resume a stream. If load suddenly increases or you want to pause the system for maintenance, there is no clean mechanism. In producer/consumer, there is explicit pause/resume API. Other streaming libraries also have some flow control mechanisms.

8.0 Related Blogs

8.1 Apache Kafka Introduction

8.2 Apache kafka Basic tutorial

8.3 Apache Kafka once and once delivery

ElasticSearch Tutorial

2018-08-26T16:02:00.000-07:00

ElasticSearch is a distributed , scalable, search and analytics engine.

It is similar to Apache Solr with a difference that is built to be scalable from ground up.

Like Solr, ElasticSearch is built on top of Apache Lucene which is a full text search library.

What is difference between a database and a search engine ? Read this blog.

1.0 Key features

Based on very successful search library Apache Lucene.
Provides the ablity to store and search documents.
Supports full text search.
Schema free.
Ability to analyze data - count , summarize ,aggregate etc.
Horizontally scalable and distributed architecture.
REST API support.
Easy to install and operate.
API support for several languages.

2.0 Concepts

An elasticsearch server process called a node is a single instance of a java process.

A key differentiator for elasticsearch is that it was built to be horizontally scalable from ground up.

In production environment, you generally run multiple nodes. A cluster is a collection of nodes that store your data.

A document is a unit of data that can be stored in elasticsearch. JSON is the format.

An Index is a collection of documents of a particular type. For example you might have one index for customer documents and another for product information. Index is the data structure that helps the search engine find the document fast. The document being stored is analyzed and broken into tokens based on rules. Each token is indexed - meaning - given the token -there is pointer back to the document - just like the index at the back of the book. Full text search or the ability to search on any token or partial token in the document is what differentiates a search engine from a more traditional database.

Elasticsearch documentation sometimes use the term inverted index to refer to their indexes. This author believes that the term "inverted index" is just confusing and this is nothing but an index.

In the real world, you never use just one node. You will use an elasticsearch cluster with multiple nodes. To scale horizontally, elasticsearch partitions the index into shards that get assigned to nodes. For redundancy, the shards are also replicated, so that they are available at multiple nodes.

3.0 Install ElasticSearch

Download from https://www.elastic.co/downloads/elasticsearch the latest version of elasticsearch. You will download elasticsearch-version.tar.gz.

Untar it to a directory of your choice.

4.0 Start ElasticSearch

For this tutorial we will use just a single node. The rest of the tutorial will use curl to send http requests to a elasticsearch node to demonstrate basic functions. Most of it is self explanatory.

To start elasticsearch type

install_dir/bin/elasticsearch

To confirm it is running

curl -X GET "localhost:9200/_cat/health?v"

5.0 Create an index

Let us create a index person to store person information such as name , sex , age , person etc

curl -X PUT "localhost:9200/person"{"acknowledged":true,"shards_acknowledged":true,"index":"person"}

List the indexes created so far

curl -X GET "localhost:9200/_cat/indices?v"

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size

yellow open person AJCSCg0gTXaX6N5g6malnA 5 1 0 0 1.1kb 1.1kb

6.0 Add Documents

Let us add a few documents to the person index.
In the url, _doc is the type of document. It is way to group documents of a particular type
In /person/_doc/1, the number 1 is the id of the document we provided. If we do not provide an id , elasticsearch with generate an id.
You will notice that the data elasticsearch accepts is JSON.

curl -X PUT "localhost:9200/person/_doc/1" -H 'Content-Type: application/json' -d'

{

"name": "Big Stalk",

"sex":"male",

"age":41,

"interests":"Hiking Cooking Reading"

}

curl -X PUT "localhost:9200/person/_doc/2" -H 'Content-Type: application/json' -d'

{

"name": "Kelly Kidney",

"sex":"female",

"age":35,

"interests":"Dancing Cooking Painting"

}

curl -X PUT "localhost:9200/person/_doc/3" -H 'Content-Type: application/json' -d'

{

"name": "Marco Dill",

"sex":"male",

"age":26,

"interests":"Sports Reading Painting"

}

curl -X PUT "localhost:9200/person/_doc/4" -H 'Content-Type: application/json' -d'

{

"name": "Missy Ketchat",

"sex":"female",

"age":22,

"interests":"Singing Cooking Dancing"

}

curl -X PUT "localhost:9200/person/_doc/5" -H 'Content-Type: application/json' -d'

{

"name": "Hal Spito",

"sex":"male",

"age":31,

"interests":"Sports Singing Hiking"

}

7.0 Search or Query

The query can be provided either as a query parameter or in the body of a GET. Yes, Elasticsearch accepts query data in the body of a GET request.

7.1 Query string example

To retrieve all documents:

curl -X GET "localhost:9200/person/_search?q=*"

Response is not shown to save space.

Exact match search as query string:

curl -X GET "localhost:9200/person/_search?q=sex:female"

{"took":14,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":2,"max_score":0.18232156,"hits":[{"_index":"person","_type":"_doc","_id":"2","_score":0.18232156,"_source":

{

"name": "Kelly Kidney",

"sex":"female",

"age":35,

"interests":"Dancing Cooking Painting"

}

},{"_index":"person","_type":"_doc","_id":"4","_score":0.18232156,"_source":

{

"name": "Missy Ketchat",

"sex":"female",

"age":22,

"interests":"Singing Cooking Dancing"

}

7.2 GET body examples

Query syntax when sent as body is much more expressive and rich. It merits a blog of its own.

This query finds persons with singing and dancing in the interest field. This is full text search on a field.

curl -X GET "localhost:9200/person/_search" -H 'Content-Type: application/json' -d'

{

"query": {

"bool": {

"should": [

{ "match": { "interests": "singing" } },

{ "match": { "interests": "dancing" } }

]

}

{"took":15,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":3,"max_score":0.87546873,"hits":[{"_index":"person","_type":"_doc","_id":"4","_score":0.87546873,"_source":

{

"name": "Missy Ketchat",

"sex":"female",

"age":22,

"interests":"Singing Cooking Dancing"

}

},{"_index":"person","_type":"_doc","_id":"5","_score":0.2876821,"_source":

{

"name": "Hal Spito",

"sex":"male",

"age":31,

"interests":"Sports Singing Hiking"

}

},{"_index":"person","_type":"_doc","_id":"2","_score":0.18232156,"_source":

{

"name": "Kelly Kidney",

"sex":"female",

"age":35,

"interests":"Dancing Cooking Painting"

}

Below is a range query on a field.

curl -X GET "localhost:9200/person/_search" -H 'Content-Type: application/json' -d'

{

"query": {

"range": {

"age": [

{ "gte": 30, "lte":40 }

]

}

{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"person","_type":"_doc","_id":"5","_score":1.0,"_source":

{

"name": "Hal Spito",

"sex":"male",

"age":31,

"interests":"Sports Singing Hiking"

}

},{"_index":"person","_type":"_doc","_id":"2","_score":1.0,"_source":

{

"name": "Kelly Kidney",

"sex":"female",

"age":35,

"interests":"Dancing Cooking Painting"

}

}]}}

8.0 Update a document

$curl -X POST "localhost:9200/person/_doc/5/_update" -H 'Content-Type: application/json' -d'

{

"doc": { "name": "Hal Spito Jr" }

}

After executing the above update, do a search for "Jr". The above document will be returned.

9.0 Delete a document

curl -X DELETE "localhost:9200/person/_doc/1"

This will delete the document with id for 1. Any searches will not return this document anymore

10. Delete Index

curl -X DELETE "localhost:9200/person"

{"acknowledged":true}

That deletes the index we created.

11. Conclusion

This has been a brief introduction of elasticsearch just enough to get you started. There are lot of more details in each category of APIs. We will explore them in subsequent APIs.

Search vs Database : Do I need a search engine ?

2018-06-23T15:30:00.000-07:00

Since the beginning of time, applications have been developed with a database at backend to store application data.

Relational databases like Oracle, Mysql etc took databases to the next level with the relation model, transaction, SQL. These are hugely successful for the last 30+ years.

In the last 10+ years, Big data databases like HBase, Cassandra, MongoDb etc arrived to solve data at scale issues which was not handled by the relational databases. These databases handled scale, high availability and replication better than relational database.

In the last 10 years, also available are search engines like Apache Solr and ElasticSearch that also store your data like a database, but offer much better search and analytics than a traditional database.

So when do you use a database and when to use a search engine ? This is what is discussed in this blog. Or do you need both ?

Some differences between a database and search engine are :

1.0 Indexes

In a database, to search efficiently, you define indexes. But then you are required to search based on index key. If you search with some other fields, the index cannot be used and the search is inefficient.

A search engine by default will index by all fields. This gives tremendous flexibility. If you add a new type of search to your application, you do not need a new index.

2.0 Full text search

A search engine excels at full text search.

Say you have document one with line "Hello from england".
And another document with line "Hello from england and europe".

A search for the term "england" will return 2 documents. A search for term "europe" will return second document.

Databases on the other hand are more convenient for exact value search.

3.0 Flexible document format

Databases are limited in the structure of data - such row and columns or key/value pairs.

Search engines generally consume a wider variety of documents. While json is the most popular format for documents that a search engine consumes, third party libraries are available to parse word docs, pdfs etc for consumption by search engines.

4.0 Analysis and Mapping

Every document stored in a search engine goes through a process of analysis and mapping.

So if you store a document "the Hello 21 from England on 2018-06-15 *", it make get tokenized based on space, certain tokens like * or "the" could get discarded, all the other tokens made lowercase, 21 recognized as a integer, 2018-06-15 recognized as a date.

When you search, the search query goes through a similar process.

The benefit of this process is that whether you search for Hello or hello or hElLo, the document is found. Whether you search for england or UK or Britain, the document is still found. Whether you search for 2018-06-15 or 15 July 2018, the document is still found.

5.0 Write once read many times

As mentioned above, search engine is very efficient for search and or in other words better for reading.

However, the analysis and indexing and storage process for a search engine can be expensive. Update to a document could lead to reindexing.

For this reason, search engines are better suited when your documents are written once, updated rarely, but need to be searched and read many times.

6.0 Database better at OLTP

For reason mentioned above, Search engines become inefficient if the documents they store are updated frequently as would done in an online transaction processing system.

A traditional database is more suited for such usage scenarios.

Another place where a traditional database is better where ACID or even less transactional integrity is important.

7.0 Analytics

The popular open source search engines ElasticSearch and Apache Solr have done a great job making it easy do analytics - from basic counting, aggregation, summarization, faceting etc.

Analytics on data is much easier and powerful in a search engine than a database

8.0 Summary

If

your queries change frequently
your need to search on fields that change
you need to search on a large variety of fields
you have variety of document formats
you need full text search
you need analytics
your data access pattern is write/update few times but read many many times

then, a search engine in your architecture will certainly help.

Note that it does not have to be one or the other. Most modern architectures use both a database and search engine. Depending on the use case you may choose to store some data in database and other data in a search engine. Or you may choose for store your data in both a search engine for better querying and a database for transactional integrity.

Tomcat vs Dropwizard

2018-04-03T06:38:00.000-07:00

For the last 15 years, for Java web applications, Apache Tomcat has been the gold standard as web application server.

More recently, for cloud and micro services architecture, that require deployment of a large number of services, a number of newer frameworks are replacing traditional application servers like Tomcat.

One such framework is Dropwizard. Instead of giving your application to a complex application server, Dropwizard brings an embedded HTTP server Jetty into your plain Java application and significantly simplifies the development model.

While both enable you to achieve the same end goal of building Java web services and applications, they are different in many ways.

1. Infrastructure

With Tomcat, the web container infrastructure is separate from the application. Tomcat is a packaged separately and runs as it own process. The application is developed and packaged separately as a war. It is then deployed to the tomcat.

Dropwizard on the other hand is a like a library that you add as a dependency to your application. Dropwizard bundles the web server Jetty that will be embedded in your application.

2. Operating system processes

With Tomcat, there is one Java process for many applications. It is more difficult to tune the JVM for production for issues like garbage collection, since they depend on application characteristic.

With Dropwizard, there is one Java process for one application. Easier to tune the JVM. Process can be managed easily using linux tools.

3. Development model

With tomcat, you code classes as per Servlets or JAX-RS specifications, but in the end, you produce a war file.

With Dropwizard, the application you write is a normal java application that starts from the main method. You still code JAX-RS web resource class or Servlets (rare). But in the end you produce a simple jar and run the application by invoking the class that has the main method.

4. Monolithic vs Micro services

With Tomcat , you can deploy multiple application wars to the same JVM. This can lead to a monolithic process that is running multiple applications. Harder to manage in production as application characteristics vary.

With Dropwizard, then model is suited to building micro services. One process for one application or service. Since running is as simple as running a java class with a main method, you run one for each micro service. Easier to manage in production.

5. Class loading

In addition to JVM provided bootstrap, extension and system class loaders, Tomcat has to have application class loaders to load classes from application wars and provide isolation between applications. While many tomcat developers never deal with this, it does sometimes lead to class loading issues.

Dropwizard based applications have only the JVM provided class loaders unless the developer writes additional classloaders. This reduces complexity.

6. Debugging and integration with Ide

Some IDEs claim to be able to do it. But given the resources Tomcat takes, debugging by running tomcat in the IDE is a real pain. Remote debugging is the only real option.

With Dropwizard , you are developing just a plain JAVA application. So it real easy to run and debug the application from within the IDE.

7. Fringe benefits

In addition to Jetty, Dropwizard bundles number of other libraries like Jersey, Jackson, Guava, Logback that are necessary to web services development. It also provides a very simple yaml based configuration model for your application.

For reasons mentioned above application servers based technologies having been dying for that last few years and Tomcat is not immune to the paradigm shift. If you are developing REST based micro-services for the cloud, Dropwizard is a compelling choice.

MongoDb Query tutorial and cheatsheet

2017-12-03T09:30:00.000-08:00

Mongodb querying is easy and very powerful. But it is handy to have a cheatsheet around when digging for data. In this tutorial, we list and describe some simple useful MongoDB queries.

If you are new to Mongodb, you can read my mongodb introduction.

At the bottom of this page, there is some example json representing some customers.

Copy that to a file say customer.json.

Import into your mongodb database using the command

mongoimport --db yourtestdb --collection customer --file customer.json

1. Find all documents in a collection

> db.customer.find()

{ "_id" : ObjectId("5a22eae84427950fd314ccca"), "firstname" : "Dana", "lastname" : "Dealer", "age" : 60, "sex" : "F", "status" : "Y", "address" : { "city" : "Seattle", "state" : "WA" }, "favorites" : [ "yellow", "orange" ], "recent" : [ { "product" : "p5", "price" : 110 }, { "product" : "p2", "price" : 66 } ] }

{ "_id" : ObjectId("5a22eae84427950fd314cccb"), "firstname" : "Dan", "lastname" : "RunsFra", "age" : 23, "sex" : "M", "status" : "N", "address" : { "city" : "LOS Angeles", "state" : "CA" }, "favorites" : [ "red", "organge" ], "recent" : [ { "product" : "p1", "price" : 85 }, { "product" : "p4", "price" : 8 } ] }

{ "_id" : ObjectId("5a22eae84427950fd314cccc"), "firstname" : "Mike", "lastname" : "North", "age" : 45, "sex" : "M", "status" : "Y", "address" : { "city" : "burlingame", "state" : "CA" }, "favorites" : [ "red", "blue" ], "recent" : [ { "product" : "p1", "price" : 85 }, { "product" : "p2", "price" : 66 } ] }

2. Find all documents based on 1 field equality

> db.customer.find({"lastname":"Dealer"})

3. Find all documents based on multiple fields AND

AND is implicit

> db.customer.find({"firstname":"Dana","lastname":"Dealer"})

Same query with explicit $and operator

> db.customer.find({$and : [{"firstname":"Dana"},{"lastname":"Dealer"}]})

4. Multiple fields OR

db.customer.find({$or : [{"sex":"F"},{status:"N"}]})

5. Comparison operator

db.customer.find({"age":{$lt:30}} )

db.customer.find({"age":{$gt:50}} )

6. Embedded document nested field

db.customer.find({"address.state":"CA"})

{ "_id" : ObjectId("5a22eae84427950fd314cccb"), "firstname" : "Dan", "lastname" : "RunsFra", "age" : 23, "sex" : "M", "status" : "N", "address" : { "city" : "LOS Angeles", "state" : "CA" }, "favorites" : [ "red", "orange" ], "recent" : [ { "product" : "p1", "price" : 85 }, { "product" : "p4", "price" : 8 } ] }

7. Array element

db.customer.find({"favorites":"blue"})

8. Array of embedded docs

db.customer.find({"recent.price":{$gt:90}})

9. Project only certain fields - such as only lastname

db.customer.find({},{"lastname":1})

{ "_id" : ObjectId("5a22eae84427950fd314ccca"), "lastname" : "Dealer" }

{ "_id" : ObjectId("5a22eae84427950fd314cccb"), "lastname" : "RunsFra" }

{ "_id" : ObjectId("5a22eae84427950fd314cccc"), "lastname" : "North" }

10. Sort

Ascending by age

db.customer.find({}).sort({"age":1})

Descending by age

db.customer.find({}).sort({"age":-1})

Appendix 1 : Sample data

{
"firstname": "Mike",
"lastname": "North",
"age": 45,
"sex": "M",
"status": "Y",
"address": {
"city": "burlingame",
"state": "CA"
},
"favorites": ["red", "blue"],
"recent": [{
"product": "p1",
"price": 85
}, {
"product": "p2",
"price": 66
}]
}
{
"firstname": "Dan",
"lastname": "RunsFra",
"age": 23,
"sex": "M",
"status": "N",
"address": {
"city": "LOS Angeles",
"state": "CA"
},
"favorites": ["red", "orange"],
"recent": [{
"product": "p1",
"price": 85
}, {
"product": "p4",
"price": 8
}]
}
{
"firstname": "Dana",
"lastname": "Dealer",
"age": 60,
"sex": "F",
"status": "Y",
"address": {
"city": "Seattle",
"state": "WA"
},
"favorites": ["yellow", "orange"],
"recent": [{
"product": "p5",
"price": 110
}, {
"product": "p2",
"price": 66
}]
}

Related Blogs :

1. Mongo DB Introduction

Cloud service vs Software as a service

2017-09-30T14:24:00.001-07:00

Everyday we use some awesome cloud services or applications like Gmail, Whatsapp, Waze etc.

If I write a web application and put it on a server that I rent from a hosting service at $3.99 a month, is it a cloud service or is it "software as a service" ?. Or is it just a plain vanilla web application ?

Even if I am write a modern application, and it is hosted on AWS or google cloud, does that automatically make it a "cloud" application ?

Today, no software company says, we are "software as a service". Everyone says they have a cloud service.

In this blog, I describe the characteristics that makes an application a real "cloud" application.

An example of a real cloud application is Gmail. As long as I have a connection to the internet, I am always able to access my mail. I can access it from any browser, any mail client, any phone, any device. I can access my email from any place in the world. A billion other people trying to access their emails at the same time does not affect me. I can still do my email stuff. If I try to get an email that I got 10 years ago, even though I am communicating with some server on the west coast, that may not have that data, gmail will get the data from a server that has stored that email. If that server is down, gmail will get it from another server in the same data center that has a replica of the data. If the entire data center is down, gmail will get it from another data center in the same region. If the entire region is down, gmail might get my email from a server in a data center in completely different region say Europe.

The characteristics of a real cloud service are :

(1) Location independence

A user of a cloud service must be able to use the service from any location without any degradation in service.

If the service has just one server in mountain view, then when I travel to China, accessing it is going to be horribly slow.

The location independence comes from geographically distributing servers and replicating data to where it is served.

(2) Scale horizontally

As the service becomes popular and the number of users go up, the number of requests go up, the data size goes up, there should be no degradation in service. It should scale by adding more servers.
Load balancers will distribute requests to a clusters of servers.

(3) Highly available

Service should be available 24*7. You have data replication and redundancy built in. A failure of a server and even a data center should not lead to stoppage of service

(4) Device independence

You should be able to access the service from any device that can access the internet - browser, mobile device, IOT etc.

(5) Self healing

The service infrastructure should monitor itself , detect failures early , so that down times are minimal

(6) Commodity hardware and (open source software)

Given the scale of a real cloud service, even for the large companies, it is affordable only using commodity hardware and software.

(7) Micro services

The software is generally built as micro services that communicate using simple protocols like REST. Monolithic applications are harder to maintain and fix.

Gmail, amazon shopping website, Waze, Whatspp etc are examples of real cloud applications. Under the hood they are powered by real cloud scale infrastructures.

The good news for the rest of us building cloud applications is that we do not have to build every thing from scratch. There are 2 broad options

Option 1 : Rent physical cloud but build software and data infrastructure

First there is the physical cloud : You needs machines either physical or virtual on the internet, distributed and across many regions. This part can be rented from Cloud vendors like Amazon, Google, Microsoft and others. You will not want to build a physical cloud unless you are close to being another Google or Amazon.

Then there is the data and software part. These are the micro service you build, the distributed databases and message brokers you use. You do the management of data , the replication, the software scaling. There are many open source frameworks , databases , caches, message brokers to help.

A good approach is to build and test the software locally with characteristics listed above and then deploy to the physical cloud for production.

The advantage of this approach is the your service will work on a physical cloud from any vendor. It works even if you decide to run it off the internet or "in premise"/intranet.

Option 2: Rent platform as a service

If you prefer not to deal with infrastructure, cloud vendors have combined the physical cloud and software into "platform as a service". Google App engine or AWS lamda , RDS are examples of this.
Here the cloud vendor manages both the physical cloud and software infrastructure and you will write just the application code. The downside of this approach is vendor lock in. This is appropriate if you do not have the relevant expertise for option 1.

Summary

In summary a "real" cloud application is one that scales horizontally and is highly available with the same quality of service irrespective of where the user is, what device he uses or how many users are using the service at a time. Simply writing a monolithic application and putting it on amazon ec2 or google compute is not a cloud service.

However if you design and build your application with the characteristics listed above, your application is "cloud" ready. You can deploy it to a physical cloud anytime.

Cache consistency issues in distributed applications

2017-09-16T15:53:00.000-07:00

Your typical enterprise web application is

Going to the database for every read or write is expensive. Developers try to improve read performance by storing values in a cache like memcached or redis.

Cache is in memory storage. Performance is greatly improved by reading from memory than going to secondary storage like disk where database or files.

On reads, the application first checks cache. If the value is found in cache, it read from there. On a cache miss, the app will read from database and then update the cache so the subsequent reads do not go to the database.

On writes,the application needs to write to the database and update the cache as well, so the subsequent reads get the updated value.

The approach of using a cache to improve read performance works very well when your reads greatly outnumber writes. That is say most requests are reading ( say 80%) and few requests update the data.

Frequent writes or updates to data complicate matters. Any writes to the database need to be reflected in the cache.

1.0 Common mistakes with caches:

These problems are mostly caused by multiple clients threads (improperly) updating the cache.

1.1. Race condition between reader / writer threads

Thread 1 wants to read a value.
It goes to cache and does not find it.
It reads the value from DB

Thread 2 updates the value in DB and updates the cache

Thread 1 sets the outdated value in cache.
Until there is another update to the same value, every one is reading the outdated value.

1.2 Race condition between writer threads

Minor variation of 1.1

At time t1, thread1 updates database value x to x1

At time t2, thread2 update database value to x2.
thread2 updates cache value to x2.

thread1 overwrites x2 to x1.

Subsequent readers are reading an incorrect value x1.

Soln : locking x in cache, update database, update cache , release lock on x
downside : locking in 2 places cache and db deadlocks

1.3 Cache not cleaned up on database rollback

This happens when cache is updated prior to database transaction commit.

thread 1 update value in db
before the transaction commits, it updates the cache
transaction rolls back
cache has outdated value

1.4 Reading before commit

This is a rare situation that could happen when cache is updated post database transaction commit.

Thread 1 is in the process of updating a value x.
x is uncommitted.
cache is not updated.

Other parts of code in Thread read the value from cache for other purposes. They reading an out dated value.

Soln: A thread that needs to reuse values it changed should store values locally and use from local until the value is committed to both database and cache.

2.0 Strategies for elimination cache race conditions :

2.1 Locking the value in cache

The strategy is

-- lock the value to be updated in cache
-- update in database
-- update in cache
-- unlock the cache lock

While this can work, the disadvantage of this approach is

-- locking twice. Database transaction does some locking. Now we have additional locking in cache. Negative for performance
-- Improper locking can lead to deadlocks

2.2 Checking timestamps and/or previous values

In the cache , in addition to value, store the update timestamp from db.
Before updating the cache, check the timestamp and only update if you have a latter timestamp.

If you do not want the overhead of storing timestamp in cache, another approach could

-- 1 previous value = read the cache value before db operation
-- 2 do the database operation
3 new value = get the latest db value
-- 4 compare and swap : set new value in cache, if current cache value == previous value
-- 5 if 4 succeeded , we are done
-- 6 previous value = current cache. Goto 3

2.3 Update cache using an updater thread

Any thread with a db operation like create , update, or even a read after a cache miss, does not directly update the cache.

Instead the request to update cache is put on a queue. Another thread reads the message one by one and updates the cache.

A disadvantage is that there is time delay before the updated value is available in cache. Also in the case of cache misses, you might see multiple messages in the queue for the same cache update.

This is the preferred solution. If you can tolerate the time delay, it can eliminate race conditions and is easy to implement.

2.4 Versioning

We can steal ideas from MVCC which is used in database

The locking strategy in 1 locks both readers and writers.

We can improve on this by not requiring reads to locks.

Readers reads the latest snapshot value.
Writers lock not the value but a copy of the value. We allow only one copy additional writers will be blocked.
When the write is done with update ( commit), the updated copy is copied to the snapshot.

You can reduce the locking on writer even further by each writer his copy. Also assign say a version or transaction id to each copy. When a transaction commits, copy the value to snapshot.

3.0 Conclusion

In summary, consistency problems can arise due to multiple threads updating a cache and the backing database. Option 3 , updating the cache using a single update thread and fix these issues. This is a simple solution that will work for most scenarios. Option 2 is a non locking technique. Option 1 locking is the least scalable.Option 4 versioning is the most work to implement.

Distributed Consensus: Raft

2017-07-04T14:50:00.000-07:00

In the Paxos blog, we discussed the distributed consensus problem and how the Paxos algorithm describes a method for a cluster of servers to achieve consensus on a decision.

However the Paxos as a protocol is hard to understand and even harder to implement.

Examples of problems that need consensus are :

- Servers needing to agree on a value, such as whether a distributed lock is acquired or not.
- Servers needing to agree on order of events.
- Server need to agree on the state of a configuration value
- Any problem where you need 100% consistency in a distributed environment.

in a highly available environment.

The raft algorithm described https://raft.github.io/ solves the same problem, It is easier to understand and implement.

Overview

The key elements of raft are:

Leader Election
Log Replication
Consistency ( in the spec they refer to this as safety)

A server in the cluster is the leader. All other servers are followers.

The leader is elected by majority vote.

Every consensus decision begins with the leader sending a value to followers.

If a majority of followers respond having received the value, the leader commits the value and then tells all servers to commit the value.

Clients communicate with leader only.

If a leader crashes, another leader is elected. Messages between leaders and followers enable the followers to determine if leader is still alive.

If a follower does not receive messages from the leader for a certain period, it can try to become a leader by soliciting votes.

If multiple leaders try to get elected at the same time, it is possible there is no majority. In such situations, the candidates try to get elected again after a random delay.

In Raft, time is set of sequential terms. Term is time of certain length. Leadership is for a term.

2 Main RPC messages between leader and followers :

Request Vote : sent when a candidate solicits vote.
Append Entry : sent by leader to replicate a log entry

Scenario 1 : leader election cluster start up

let us say 5 servers s1 to s5.

Every server is a follower.

No one is getting messages from a leader.

s1 and s3 decide to become leaders (called candidates) and send message to other servers to solicit vote.
Servers always vote for themselves.
s2 and s4 respond to s1.
s1 is elected leader

Scenario 2 : log replication

A client connects to a leader s1 to set x = 3.

s1 writes x=3 to its log. But its state is unchanged.

At this point the change is uncommitted.

s1 sends appendEntry message to all followers that x =3. Each follower writes that entry to log.

Followers respond to s1 that change is written to log.

When majority of followers respond, s1 commits the change. It applies the change to its state so x is now 3.

Followers are told to commit by piggybacking the last committed entry, in the next appendEntry message. Followers commit the entry by applying the change to its state.

When a change is committed, all previous changes are considered committed.

The next appendEntry message from leader to followers will include the previous committed entry. The servers can they commit any previous entries they have not yet committed.

The cluster of servers has consensus.

Scenario 3 : Leader goes down

When a leader goes down, one or more of the followers can detect that there are no messages from the leader and decide to be come a candidate by soliciting votes.

But the leader that just went down has the accurate record of committed entries, that some of the followers might not have.

If a follower that was behind on committed entries became a leader, it could force other servers with later committed entries to overwrite their entries. That should not be allowed. Committed entries should never change.

Raft prevents this situation by requiring candidate to send with the requestVote message the term and index of the latest message it accepted from the previous leader. Each Follower rejects requestVote with a term/index lower than its highest term/index.

Since a leader only commits entries accepted by a majority of servers and a majority of servers is required to get elected, it follows that a majority or a least half of the remaining servers have accepted that highest committed entry of the leader that went down.Thus a follower that does not have the highest committed entry from the previous leader can never get elected.

Scenario 4 : Catch up for servers that have been down

It is possible for a follower to miss committed entries either because it went down or did not receive the message.

To ensure followers catch up and stay consistent with leaders, RAFT has a consistency check.

Every append entry message also includes term and index of previous message in leaders log. If it does not match in the follower, the follower rejects the new message. When AppendEntry is accepted by a server, it means leader and server have identical entries.

When a follower rejects an appendEntry, the server retries that follower with a previous entry. This continues until the follower accepts an entry. Once an entry is accepted, the leader will again send subsequent entries that will be accepted.

Leader maintains a nextIndex for each follower. This is the index in the log that the leader will send to the follower next.

Scenario 5 : Cluster membership changes

Cluster membership changes refers to a bunch of servers being added or removed from the cluster.
May be even the current leader is no longer in new configuration.

This needs some attention because it is possible we end up with 2 leaders and 2 majorities.

Raft takes a two phase approach

First switch to joint consensus
-- entries committed to servers in both configuration
-- 2 leaders one from each configuration
-- majority from each configuration needs to approve stuff

Second switch to new configuration

Leader receives request to change configuration to new.
Leader uses appendEntry to send old,new config pair to followers
Once (old,new) is committed , we are in joint consensus period.
Leader then sends appendEntry for new configuration.
Once committed, new configuration is in effect.

Summary

Consensus is required when you need 100% consistency in a distributed environment that is also highly available. Raft simplifies distributed consensus by breaking the problem into leader election and log replication. Easier to understand means easier to implement and use to solve real world problems.

A future blog will go into an implementation.

References:

1. In Search of an Understandable Consensus Algorithm by Diego Ongaro and John Ousterhout Stanford University. https://raft.github.io/raft.pdf

Related Blogs:

1. Distributed Consensus : PAXOS

Distributed Systems : Basic Paxos

2016-11-12T15:19:00.000-08:00

1.0 Introduction

How to build reliable highly available distributed systems that are consistent ? Paxos is a protocol that addresses this problem.

Paxos was authored by Leslie Lamport in his paper "Part time parliament" and explained better in his paper "Paxos made simple". It has been implemented and used in many of the modern distributed systems built by Google, Amazon, Microsoft etc.

Consider a banking system with clients c1,c2 and server s.

c1 can issue command to s : add $200 to account A
c2 can issue command: add 2% interest to A

A single server can easily determine the order c1,c2 and execute the commands.

But if S crashes, no client can do any work. The traditional way to solve this problem is to fail over using standby. Another server S' , identical to S is standing around doing nothing. When S crashes, the system detects that S is no longer servicing requests, starts sending requests to S'. For reasons that merit a blog of its own, fail over using standby is hard to implement , to test and more expensive. That will not be discussed here.

A second limitation is that when the number of client requests increase, the server may not be able to keep up and respond in a reasonable time.

The way to solve scalability and fail over issues is to have multiple servers says s1 and s2 servicing clients with replication between s1 ans s2, so the clients are presented with a single system view. Commands that execute on s1 due to its clients are also made to execute on s2 and vice versa, so that both s1 and s2 are in the same state.

In Figure 1 Both S1 and S2 are active and servicing clients.

Figure 1 : Multiple server cluster with replication

However a difference in the order of execution can lead to a consistency issue.

Assume account A has $1000
Say c1 connects to s1 invokes command subtract 200 from account A.
Say c2 connects to s2 and invokes command debit 5% interest to account A.

If the order of execution is c1,c2 then c1 increases A to 1200. c2 increases it to 1260. If the order of execution is c2,c1, then c2 increases A to 1250. c1 increases it to 1250. One server may have a value 1260 , while the other may have a value 1250.

To ensure consistent results, both servers need to agree on the order of execution. In other words, there needs to be consensus among the servers. You can have more then 2 servers and the same is true.

Paxos is a protocol for achieving consensus among a group of servers.

2.0 Paxos assumptions

A group of distributed servers communicating with each other.

Asynchronous communication

non byzantine : no devious unpredictable stuff

Messages can be lost.

A majority of servers are always available. So your system should have an odd number of server 3,5,7... so that a majority can be established. Common logic tells us that with an odd number of servers, any two majorities should have at least one overlapping member. This observation is critical to the correctness of the protocol.

Server can join or leave the system at any time.

3.0 What Paxos achieves

Reliable system with unreliable components.

Only one value may be chosen.

The value is chosen when it chosen by a majority of the servers.

Once a value is chosen, it cannot be changed.

This "one value" concept can be hard to understand for a first time reader. How is choosing just one value useful in solving any real world problems ? In reality , the value is likely to be a command that needs to be executed on the server. It could be command and data or both. Value is a simplification.

For this to be useful, the servers probably need consensus on not just one value but several values. That can be achieved by a minor extension to basic Paxos and will be discussed in a subsequent blog.

4.0 Actors in Paxos

Proposers propose a value to be chosen. Proposers are generally the ones handling client requests.

Acceptors respond to proposers and can be part of the majority that lead to a value being chosen.

Learners learn the chosen value and may put it to some use.

In reality, a single server may function as all 3 and this is what we will assume

5.0 The protocol

(1) Proposer proposes a value (n,v) where n is a proposal number and v is a value.

(2) If an acceptor has not received any other proposal, it sends a response agreeing to not accept
any other proposals with number less than n.

If proposal number is less than what it has accepted or agreed to accept, it can ignore the proposal.

If it has other lower number proposals accepted with value v or any other v', it responds with the accepted proposal number and value v'.

The acceptor continues to do this with any subsequent proposal it receives. It must remember the highest proposal number it has. This is important because as described in step 5, it should never accept any lower numbered proposals.

(3)The proposer examines responses to its proposal.

If majority of acceptors responded with value v', then mean that v' is either chosen or has a good chance of being chosen. Proposer must take v' as value . If majority does not have value, it can stay with original v.

(4) Proposer sends accept message (n,v')

(5) When an acceptor receives an accept message (n,v) or (n,v') . It must accept the value if n is still the highest proposal it has.

Between step 2 and 5, other proposers could have send other proposals with number higher than n. If the acceptor has any such proposals, it cannot accept n.

In either case, it returns to the proposer, the highest proposal number it has.

(6) The proposer can use the proposal number in response from 5 to determine if its accept message is accepted. If the proposal number is the same as n, then it known that n is accepted.

Other wise the returned number is that of the larger proposal number that is around. It has to go back to step 1 and start with new proposal number greater than this.

6.0 Notes

Proposals have order as indicated by n. Newer proposals override older proposals. If an acceptor has received proposal n. It can ignore all proposals less than n.

Proposal numbers need to be unique across proposers.

Multiple rounds of propose and accept may be necessary before a majority for a chosen value is reached.

Once a value is chosen, future proposers will also have to choose that value. That is the only way we can get to one and only one value chosen.

Proposals are ordered. Older ones are ignored or rejected

7.0 Examples

In this section we go through some scenarios of how the protocol works.

7.1 Case 1 : Value chosen for the first time

Figure 2 : Value chosen for first time

This is the most basic case of no value yet chosen and a value proposed for the first time.

3 servers s1,s2,s3. 2 is majority

1. s1 sends proposal 1 with value X to s2
2. s2 has no previous proposal or value , so it responds agreeing to not accept any proposals numbers less than 1
3. s1 has agreement from majority. So it sends accept message to s2 which accepts and X is the chose value.

7.2 Case 2 : Value proposed after one already chosen

Figure 3 : Value proposed after one chosen
s1 and s2 have agreed on value X.

s3 does not know of this. s3 send proposal 2 with value Y to s2.

s1 responds that it has accepted proposal 1 with value X.

s2 has to update its value to X. s2 sends an accept message with value X which is accepted.

s1,s2,s3, all have value X.

7.3 Case 3: No value yet chosen two competing proposals 1 wins

Figure 4: Competing proposals

There are 5 servers s1,s2,s3,s4,s5. 3 is majority

s1 sends proposal 1 with value X to s2,s3
s2 agrees to accept 1,X
Before s3 receives accept message for (1,X) s5 sends (2,Y) to s3,s4
Now s3 cannot accept (1,x) because its highest proposal is 2.
s3,s4 respond agreement to (2,Y) to s5
s3 ignores proposal (1,X)
s5 sends accept(2,Y) to s3,s4 which accept
s1 sends a new proposal (3,X).
s3 responds (2,Y)
s1 sends accept(3,Y)
s1 and s2 also agree on Y

7.4 Case 4 : Not making progress or liveness

Figure 5 : Not making progress

s1 proposes values to s2 ,s3. s5 proposes values to s3,s4.

s1 proposes (1,X). s3 agrees to not accept proposal less than 1.
Before (1,X) can be accepted s5 proposes (2,Y). s3 now agrees not to accept less than 2
When s1 tries to get (1,X) accepted, It will not get accepted because there is a proposal 2. It sends out a proposal (3,x).
s2 will not be able to get (2,Y) accepted because there is a (3,X). It sends outs a (4,Y).
s1 will not be able to get (3,X) accepted because there is a (4,Y). It sends out a (5,x)

This may go on and on. One way to avoid this is for each server to introduce a random delay before issuing the next proposal, there by givings the others a chance to get their proposals accepted.

Another solution is to have a leader among the servers and have the leader be the only one that issues proposals.

8.0 Paxos usage in real world

The basic protocol enable servers to arrive at a consensus on one value. How does one value apply to real world system ? To solve real world problems like the one described in the introduction, you have run multiple instances or interactions of Paxos. For a group of servers to agree on the order of a set of commands, think of a list of command 0 .. n. Each Paxos instance would pick a command at each index. This is multi Paxos and merits a blog or discussion on its own.

Some real world usages of Paxos have been to arrive at consensus on locks, configuration changes, counters.

9.0 References

"Part time parliament" by Leslie Lamport
"Paxos made Simple" by Leslie Lamport
"Time, clocks and the ordering of events in a distributed system" by Leslie Lamport

JAVA 8 : Lambdas tutorial

2015-10-20T18:11:00.000-07:00

Lambdas are the biggest addition to JAVA in not just release 8 but several releases. But when you look at the cryptic lambda syntax, like most regular programmers, you are left wondering why one should write code this way. 6. The purpose of this tutorial is to introduce lambdas, so that you can start using them in real code.

Overview

Lambdas facilitate defining, storing and passing as parameters blocks of code. They may be stored in variables for later use or passed as parameters to methods who may invoke the code. This style of programming is known as functional programming.

You might argue that JAVA already supported functional programming using anonymous classes. But that approach is considered verbose.

Example

Listing 1 shows the old way to pass executable code to a thread.

public void Listing1_oldWayRunnable() {
        Runnable r = new Runnable() {
            @Override
            public void run() {
                System.out.println("Hello Anonymous") ;
            }
        } ;
        Thread t = new Thread(r) ;
        t.start() ;
    }

Listing 2 shows the new way using lambdas.

public void Listing2() {

        Thread t = new Thread(()->System.out.println("Hello Lambdas")) ;
        t.start() ;
    }

Listing 2 has no anonymous class. It is much more compact.

()->System.out.println is the lambda.

Syntax

The syntax is

(type)->statement
Where type is the parameter passed in. In our example, there was no parameter. Hence the syntax was ()->statement

If you had multiple parameters, the syntax would be
(type1,type2)->statement

If you had multiple statements, the syntax would be a
(type) ->{statement1; statement2} ;

Storing in a variable

The lambda expression can also be stored in variable and passed around as shown in listing 3.

public void Listing3() {
        Runnable r = ()->System.out.println("Hello functional interface") ;
        Thread t = new Thread(r) ;
        t.start() ;
    }

Functional interface

JAVA 8 introduces a new term functional interface. It is an interface with just one abstract method that needs to be implemented. The lambda expression provides the implementation for the method. For that reason, lambda expressions can be assigned to variables that are functional interfaces. In the example above Runnable is the functional interface.

You can create new functional interfaces. They are ordinary interfaces but with only one abstract method. @FunctionalInterface is an annotation that may be used to document the fact that an interface is functional.

Listing 5 show the definition and usage of a functional interface.

@FunctionalInterface
    public interface Greeting {
        public void sayGreeting() ;
    }

    public static void greet(Greeting s) {
        s.sayGreeting();
    }

    @Test
    public void Listing5() {
        // old way
        greet(new Greeting() {
            @Override
            public void sayGreeting() {
                System.out.println("Hello old way") ;
            }
        }) ;

        // lambda new way
        greet(()->System.out.println("Hello lambdas")) ;
    }
}

Once again you can see that the code with lambdas is much more compact. Within an anonymous class, the "this" variable resolves to the anonymous class. But within a lambda, the this variable resolves to the enclosing class.

java.util.Function

The java.util.Function package in JDK 8 has several starter ready to use functional interfaces. For example the Consumer interface takes a single argument and returns no result. This is widely used in new methods in the java.util.collections package. Listing 6 shows one such use with the foreach method added to Iterable interface, that can be used to process all elements in a collection.

@Test
    public void Listing6() {
         List l = Arrays.asList(1,2,3,4,5,6,7,8,9) ;
         l.forEach((i)->System.out.println(i*i)) ;
    }

In summary, Java 8 lambdas introduce a new programming style to java. It attempts to bring JAVA up to par with other languages that claim to be superior because they support functional programming. It is not all just programming style. Lambdas do provide some performance advantages. I will examine them more in future blogs.

ConcurrentHashMap vs ConcurrentSkipListMap

2015-07-20T20:19:00.000-07:00

In the blog Map classes, we discussed the map classes in java.util package. In blog ConcurrentHashMap, we ventured into concurrent collections and discussed the features of ConcurrentHashMap, which offers much superior concurrency than a conventional HashMap.

In this blog we discuss another concurrent map, the ConcurrentSkipListMap and compare it with ConcurrentHashMap. Package java.util has a HashMap and TreeMap. Have you ever wondered why java.util.concurrent has a ConcurrentHashMap, but no ConcurrentTreeMap and why there is a ConcurrentSkipListMap ?

In the non concurrent Collections, there is a HashMap and TreeMap. HashMap for O(1) time complexity and TreeMap for maintaining a sorted order but O(logn) complexity. The implementation of a tree map is not a ordinary binary search tree(BST), because a BST that is not balanced degrades in performance to O(n) for input that is already sorted. TreeMap is implemented as a Red black tree, whose implementation is complex and involves balancing the tree (moving the nodes around) when nodes are added or removed. The complexity is even more when you try to make the implementation concurrent (safe for concurrent use). For that reason there is no ConcurrentTreeMap in java.util.concurrent.

A concurrent implementation of SkipList is simpler. Hence, for a Map that is ordered and concurrent,the implementators choose SkipList.

What is a Skip List ?

A skiplist is an ordered linked list with o(log n) worst case search time. An ordinary linked list has o(n) worst case search time. A skip list provides faster search by maintaining layers of links, allowing the search to skip nodes. As shown in the figure, the lowest layer is an ordinary linked list. But each higher layer skips some (more) nodes.

level4 10-------------------------------------100-null
level3 10-----------------50-----------------100-null
level2 10-------30------ 50-----70---------100-null
level1 10 -20 -30 -40 -50 -60-70-80-90-100-null

Let us you need to find 80 in the list.
Start are highest level 4. Search linearly to find the node that is equal to or whose next node is greater than 80. At level 4, 100 is greater than 80. So at node 10, move down to level 3.

At level 3, node 10, 50 is less than 80. Move to node 50. Next node 100 is greater that 50. Move down to level 2 at node 50.

At level 2 node 50, next node is 70 which is less than 80. Move to node 70. Next node is 100 which is greater than 80. Move to level 1 at node 70.

At level 1, this is the last level. Keep going forward from 70 till you find 80 or reach end of the list.

Adding more levels can leads to faster search.

Skiplist has O(logn) performance for search, insert and delete. Depending on number of levels, it does use some extra space. Space complexity is O(nlogn).

In general, you will use a ConcurrentHashMap, if you must have O(1) for both get and put operations, but do not care about the ordering in the collection. You will use a ConcurrentSkipListMap if you need an ordered collection (sorted), but can tolerate O(logn) performance for get and put.

Lastly, SkipList is easier to implement than a balanced tree and is become the data structure of choice for ordered concurrent Map.

Apache Cassandra: Compaction

2015-05-19T10:41:00.000-07:00

In Cassandra vs HBase, I provided an an overview of Cassandra. In Cassandra data model, I covered data modeling in Cassandra. In this blog, I go a little bit into Cassandra internals and discuss Compaction, a topic that is a source of grief for many users. Very often you hear that during compaction, performance degrades. We will discuss what compaction is, why it is necessary and the different types of compaction.

Compaction is process of merging multiple SSTables into larger tables. It removes data that has been marked for deletion and reduces fragmentation. Generally it happens automatically in the background, but can be started manually as well.

Why is compaction necessary ?

Cassandra is optimized for writes. A write is first written in memory to a table called Memtable. When Memtable reaches a certain size it is written in its entirety to disk as a new SSTable. SStable has an index which consists of sorted keys, which point to the location in file that has the columns. SSTables are immutable. They are never updated.

The high throughput for writes is achieved by always appending and never seeking before writing . Updates to existing keys are also written to the current Memtable and eventually written to a new SStable. There are no disk seeks while writing.

Obviously, over time there are going to be several SSTables on disk. Not only that, but the latest column values for a single key might be spread over several SSTables.

How does this affect reads ?

Reading from one SSTable is easy. Find the key in the index. Keys are sorted. So a binary search would find the key. After that it is one disk seek to the location of the columns.

But as pointed out earlier, the updates for a single key might be spread over several SSTables. So for the latest values, Cassandra would need to read several SSTables and merge updates based on timestamps before returning columns.

Rather than do this for every read, it is worthwhile to merge SSTables in the background, so that when a read request arrives, Cassandra needs to just read from fewer SSTables ( one would be ideal).

Compaction

Compaction is the process of merging SSTables in order to

read columns for partition key from as few SSTables as possible
remove deleted data
reduce fragmentation

We did not talk about delete earlier. When Cassandra receives a request to delete a partition key, it merely marks it for deletion but does not actually remove the data associated with the key. The term used in Cassandra is "tombstone". A tombstone is created. During compaction, tombstones are supposed to be removed.

Types of Compaction

Size tiered compaction:

This is based on number of SSTables and size of table. A compaction is triggered when the number tables and their size reaches a certain threshhold. Tables of similar size are grouped into buckets for compaction. Smaller tables are merged into a larger table.

Some disadvantages of size tiered compaction are that read performance can vary because the columns for a partition key can be spread over several SSTables. A lot for free space ( double the current storage) is required during compaction, since the merge process is making a copy.

Leveled compaction:

There are multiple levels of SSTables. SSTables within a level are of the same size and non overlapping (Within each level, a partition key will be in one SSTable only) . SSTables in the higher levels are larger. Data from the lower levels is merged into SSTables of the higher levels.
Leveled compaction tries to ensure that most reads happen from 1 SSTable. The worst read performance is bound by the number of levels. This works well for read heavy workloads because Cassandra knows which SSTable within each level to check for the key. But more work needs to be done during compaction especially for write(insert) heavy workloads. Due to the extra work to ensure a fixed number of SSTables, there is a lot more IO.

Data tiered compaction:

Data written within a certain period of time say 1 hr is merged in one SSTable. This works well when you are writing time series data and querying based on timestamp. A query such as give me columns written in the last 1 hr can be serviced by reading just 1 SSTable. This also makes it easy to remove tombstones that are based on TTL. Data with the same TTL is likely to be in the same SSTable and the entire SSTable can be dropped.

Manual compaction:

This is compaction started manually using the nodetool compact command. A keyspace and table are specified. If you do not specify the table, the compaction will run on all tables. This is called a major compaction. It involves a lot of IO and is generally not done.

In summary, compaction is really fundamental to distributed databases like Cassandra. Without the append only architecture, write throughput would be much lower. And high write through put is necessary for high scalable systems or stated in another way - writes are much harder to scale and are generally the bottleneck. Read can be scaled easily by de-normalization , replication and caching.

Even with relational databases, applications do not go to Oracle or MySql for every read. Typically there is cache like Memcached or Redis, that caches frequently read data. For predictable read performance consider fronting Cassandra with a fast cache. Another strategy is to use different Cassandra clusters for different workloads. Read requests can be sent to clusters optimized for read.

Lastly , Leveled compaction works better for read intensive loads where as Data tiered compaction is suited for time series data and when the there is steady write rate. Size tiered compaction is used with write intensive workloads. But there is no silver bullet. You have to try, measure and tune for optimal performance with your workload.

Related Blogs:

Cassandra vs HBase
Cassandra data model
Choosing Cassandra

Apache Kafka : New producer API in 0.8.2

2015-03-28T09:45:00.000-07:00

In Kafka version 0.8.2, there is a newer, better and faster version of the Producer API. You might recall from earlier blogs that the Producer is used to send messages to a topic. If you are new to Kafka, please read following blogs first.

Apache Kafka Introduction
Apache Kafka JAVA tutorial #1

Some features of the new producer are :

Asynchronously send messages to a topic.
Send returns immediately. Producer buffers messages and sends them to broker in the background.
Thanks to buffering, many messages sent to broker at one time without waiting for responses.
Send method returns a Future<RecordMetadata>. RecordMetadata has information on the record like which partition it stored in and what the offset is.
Caller may optionally provide a callback, which gets called when the message is acknowledged.
Buffer can at times fill up. Buffer size is configurable and can be configured using the total.memory.bytes configuration property.
If the buffer fills up, the Producer can either block or throw an exception. The behavior is controlled by the block.on.buffer.full configuration property.

In the rest of the blog we will use Producer API to rewrite the Producer we wrote in tutorial #1

For this example, you will need the following

For this tutorial you will need

(1) Apache Kafka 0.8.2
(2) JDK 7 or higher. An IDE of your choice is optional
(3) Apache Maven
(4) Source code for this sample from https://github.com/mdkhanga/my-blog-code so you can look at working code.

In this tutorial we take the Producer we wrote in Step 5 Kafka tutorial 1 and rewrite it using the new API. We will send messages to a topic on a Kafka Cluster and consume it with the consumer we wrote in that tutorial.

Step 1: Step up a Kafka cluster and create a topic

If you are new to Kafka, you can read and follow the instructions in my tutorial 1 to setup a cluster and create a topic.

Step 2: Get the source code for tutorial 1,2,3 from https://github.com/mdkhanga/my-blog-code

Copy KafkaProducer.java to KafkaProducer082.java. We will port KafkaProducer082 to the new producer API.

Step 3: Write the new Producer

Update the maven dependencies in pom.xml.

For the new producer you will need

<dependency>
          <groupId>org.apache.kafka</groupId>
          <artifactId>kafka-clients</artifactId>
          <version>0.8.2.0</version>
</dependency>

The rest of the client code also needs to be updated to 0.8.2.

<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka_2.10</artifactId>
    <version>0.8.2.0</version>
</dependency>

The new producer will not work if rest of the client uses 0.8.1 or lower versions.

Step 3.1: Imports

Remove the old imports and add these.

import org.apache.kafka.clients.producer.KafkaProducer ;
import org.apache.kafka.clients.producer.ProducerRecord;

Note the packages.

Step 3.2: Create the producer

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("request.required.acks", "1");

KafkaProducer producer = new KafkaProducer(props);

As in the past, you provide some configuration like which broker to connect to as Properties. The key and value serializers have to be provided. There are no default values.

Step 3.3: Send Messages

String date = "04092014" ;
String topic = "mjtopic" ;

for (int i = 1 ; i <= 1000000 ; i++) {

   String msg = date + " This is message " + i ;
   ProducerRecord data = new ProducerRecord(topic,
            String.valueOf(i), msg);

    Future rs = producer.send(data, new Callback() {
        @Override
        public void onCompletion(RecordMetadata recordMetadata, Exception e) {

          System.out.println("Received ack for partition=" + recordMetadata.partition() +
   " offset = " + recordMetadata.offset()) ;
        }
      });

      try {
        RecordMetadata rm = rs.get();
        msg = msg + " partition = " + rm.partition() + " offset =" + rm.offset() ;
        System.out.println(msg) ;
      } catch(Exception e) {
        System.out.println(e) ;
      }

}

As mentioned earlier. The send is async and it will batch messages before sending to the broker. The send method immediately returns a Future that has the partition and offset in the partition for message send. We provide a callback to the send method whose onCompletion method is called when an acknowledgement for the message is received.

Step 4: Start the Consumer

mvn exec:java -Dexec.mainClass="com.mj.KafkaConsumer"

Step 5: Start the Producer

mvn exec:java -Dexec.mainClass="com.mj.KafkaProducer082"

You should start seeing messages in the consumer.

In summary, the new producer API is asynchronous, scalable and returns useful metadata on the message sent.

Related Blogs:
Apache Kafka Introduction
Apache Kafka JAVA tutorial #1
Apache Kafka JAVA tutorial #2
Apache Kafka JAVA tutorial #3

MongoDB tutorial #1 : Introduction

2015-01-23T17:49:00.000-08:00

In the blog NoSQL, I provided an introduction to NoSql databases. We have discussed some NoSql databases such as HBase, Cassandra , Redis. In this blog, we discuss MongoDB, a document oriented database, which is in contrast to the key value stores we discussed earlier. MongoDB is currently one of the more popular NoSql databases, primarily due to its ease of use and simpler programming model. But there have been reports that it lags in scalability or performance compared to other NoSql databases. And it has more moving parts. But its ease of use and low learning curve makes it an attractive choice in many scenarios.

The key features of MongoDB are:

The unit of storage like a record in relational databases or key-value pair in key value stores, is a document or more precisely a JSON document.

{ "employee_id":"12345",
"name":"John doe",
"department": "database team",
"title":"architect",
"start_date":"1/1/2015" }

Documents are stored in collections.
Collection can be indexed by field.
Indexing support for faster queries.
No schema is required for the collection.
MongoDB is highly available using replication and automatic failover. Write happens to a primary server but can be replicated to multiple replicas. If the primary goes down, one of the replicas takes over as the primary.
Read operations can be scaled by sending the reads to the replicas as well.
Write operations are scaled by sharding.
Sharding is automatic.But has a couple of moving parts

Sharding is based on a key which is an indexed field or a indexed compound field.
Sharding can be range based or hash based. With range based, partitioning is based on key range, so that values close to each other are together. With Hash based, the partioning is based on a hash of the key.
Data set is divided into chunks. Each shard manages some chunks
Query routers are used to send the request to the right shard.
Config servers hold meta data on which chunks are with which shard.
If a chunk grows too large, it is broken up. If some shards own more chunks than others, the cluster is automatically rebalanced by redistributing the chunks.

In the rest of the blog, let us fire up a mongodb instance, create some data and learn how to query it.

Step 1: Download Mongo

You can download the server from www.mongodb.org/downloads.
I like to download the generic linux version and untar it.

Untar/unzip it to a directory of your choice.

Step 2 : Start the server

Decide on a directory to store the data. Say ~/mongodata. Create the directory.

Change to the directory where you installed mongo. To start the server, type the command.

bin/mongod -dbpath ~/mongodata

Step 3: Start the mongo client

bin/mongo

Step 4: Create and insert some data into a collection

Create and use a database.
> use testDb ;

Create a employee document and insert into the employees collection.
> emp1 = { "employee_id":"12345", "name":"John doe", "department": "database team", "title":"architect", "start_date":"1/1/2015" }
> db.employees.insert(emp1)

Retrieve the document.
> db.employees.find()
{ "_id" : ObjectId("54c2de34426d3d4ea1226498"), "employee_id" : "12345", "name" : "John doe", "department" : "database team", "title" : "architect", "start_date" : "1/1/2015" }

Step 5 : Insert a few more employees

> emp2 = { "employee_id":"12346", "name":"Ste Curr", "department": "database team", "title":"developer1", "start_date":"12/1/2013" }
> db.employees.insert(emp2)

> emp3 = { "employee_id":"12347", "name":"Dre Grin", "department": "QA team", "title":"developer2", "start_date":"12/1/2011" }
> db.employees.insert(emp3)

> emp4 = { "employee_id":"12348", "name":"Daev Eel", "department": "Build team", "title":"developer3", "start_date":"12/1/2010" }
> db.employees.insert(emp4)

Step 6: Queries

Query by attribute equality
> db.employees.find({"name" : "Ste Curr"} )
{ "_id" : ObjectId("54c2e0de426d3d4ea1226499"), "employee_id" : "12346", "name" : "Ste Curr", "department" : "database team", "title" : "developer1", "start_date" : "12/1/2013" }

Query by attribute with regex condition
> db.employees.find({"department":{$regex : "data*"}})
{ "_id" : ObjectId("54c2de34426d3d4ea1226498"), "employee_id" : "12345", "name" : "John doe", "department" : "database team", "title" : "architect", "start_date" : "1/1/2015" }
{ "_id" : ObjectId("54c2e0de426d3d4ea1226499"), "employee_id" : "12346", "name" : "Ste Curr", "department" : "database team", "title" : "developer1", "start_date" : "12/1/2013" }

Query using less than , greater than conditions
> db.employees.find({"employee_id":{$gte : "12347"}})
{ "_id" : ObjectId("54c2e382426d3d4ea122649a"), "employee_id" : "12347", "name" : "Dre Grin", "department" : "QA team", "title" : "developer2", "start_date" : "12/1/2011" }
{ "_id" : ObjectId("54c2e3af426d3d4ea122649b"), "employee_id" : "12348", "name" : "Daev Eel", "department" : "Build team", "title" : "developer3", "start_date" : "12/1/2010" }

> db.employees.find({"employee_id":{$lte : "12346"}})
{ "_id" : ObjectId("54c2de34426d3d4ea1226498"), "employee_id" : "12345", "name" : "John doe", "department" : "database team", "title" : "architect", "start_date" : "1/1/2015" }
{ "_id" : ObjectId("54c2e0de426d3d4ea1226499"), "employee_id" : "12346", "name" : "Ste Curr", "department" : "database team", "title" : "developer1", "start_date" : "12/1/2013" }

Step 7: Cursors

Iterate through results.
> var techguys = db.employees.find()
> while ( techguys.hasNext() ) printjson( techguys.next() )
{
    "_id" : ObjectId("54c2de34426d3d4ea1226498"),
    "employee_id" : "12345",
    "name" : "John doe",
    "department" : "database team",
    "title" : "architect",
    "start_date" : "1/1/2015"
}
.
.
.

Step 8: Delete records

Delete one record
> db.employees.remove({"employee_id" : "12345"})
WriteResult({ "nRemoved" : 1 })

Delete all records
> db.employees.remove({})
WriteResult({ "nRemoved" : 3 })

As you can see MongoDb is pretty easy to use. Download and give it a try.

Apache Kafka JAVA tutorial #3: Once and only once delivery

2015-01-09T17:45:00.001-08:00

In Apache Kafka introduction, I provided an architectural overview on the internet scale messaging broker. In JAVA tutorial 1, we learnt how to send and receive messages using the high level consumer API. In JAVA tutorial 2, We examined partition leaders and metadata using the lower level Simple consumer API.

A key requirement of many real world messaging applications is that a message should be delivered once and only once to a consumer. If you have used the traditional JMS based message brokers, this is generally supported out of the box, with no additional work from the application programmer. But Kafka has distributed architecture where the messages to a topic are partitioned for scalability and replicated for fault tolerance and hence the application programmer has to do a little more to ensure once and only once delivery.

Some key features of the Simple Consumer API are:

To fetch a message, you need to know the partition and partition leader.
You can read messages in the partition several times.
You can read from the first message in the partition or from a known offset.
With each read, you are returned an offset where the next read can happen.
You can implement once and only once read, by storing the offsets with the message that was just read, thereby making the read transactional. In the event of a crash, you can recover because you know what message was last read and where the next one should be read.
Not covered in this tutorial, but the API lets you determine how many partitions there are for a topic and who the leader for each partition is. While fetching message, you connect to the leader. Should a leader go down, you need to fail over by determining who the new leader is, connect to it and continue consuming messages

For this tutorial you will need

(1) Apache Kafka 0.8.1
(2) Apache Zookeeper
(3) JDK 7 or higher. An IDE of your choice is optional
(4) Apache Maven
(5) Source code for this sample from https://github.com/mdkhanga/my-blog-code if you want to look at working code

In this tutorial, we will
(1) start a Kafka broker
(2) create a topic with 1 partition
(3) Send a messages to the topic
(4) Write a consumer using Simple API to fetch messages.
(5) Crash the consumer and restart it ( several times). Each time you will see that it reads the next message after the last one that was read.

Since we are focusing of reading messages from a particular offset in a partition, we will keep other things simple by limiting ourselves to 1 broker and 1 partition.

Step 1: Start the broker

bin/kafka-server-start.sh config/server1.properties

For the purposes of this tutorial, one broker is sufficient as we are reading from just one partition.

Step 2: Create the topic

bin/kafka-topics.sh --create --zookeeper localhost:2181 --partitions 1 --topic atopic

Again for the purposes of this tutorial we just need 1 partition.

Step 3: Send messages to the topic

Run the producer we wrote in tutorial 1 to send say 1000 messages to this topic.

Step 4: Write a consumer using SimpleConsumer API

The complete code is in the file KafkaOnceAndOnlyOnceRead.java.

Create a file to store the next read offset.

static {
    try {
      readoffset = new RandomAccessFile("readoffset", "rw");
    } catch (Exception e) {
      System.out.println(e);
    }
}

Create a SimpleConsumer.

SimpleConsumer consumer = new SimpleConsumer("localhost", 9092, 100000, 64 * 1024, clientname);

If there is a offset stored in the file, we will read from the offset. Otherwise, we read from the beginining of the partition -- EarliestTime.

long offset_in_partition = 0 ;
    try {
      offset_in_partition = readoffset.readLong();
    } catch(EOFException ef) {
      offset_in_partition =     getOffset(consumer,topic,partition,kafka.api.OffsetRequest.EarliestTime(),clientname) ;
    }

The rest of the code is in a

while (true) {

}

loop. We will keep reading messages or sleep if there are none.

Within the loop, we create a request and fetch messages from the offset.

FetchRequest req = new FetchRequestBuilder()
          .clientId(clientname)
          .addFetch(topic, partition, offset_in_partition, 100000).build();
FetchResponse fetchResponse = consumer.fetch(req);

Read messages from the response.

for (MessageAndOffset messageAndOffset : fetchResponse.messageSet(topic, partition)) {
        long currentOffset = messageAndOffset.offset();
        if (currentOffset < offset_in_partition) {
          continue;
        }
        offset_in_partition = messageAndOffset.nextOffset();
        ByteBuffer payload = messageAndOffset.message().payload();

        byte[] bytes = new byte[payload.limit()];
        payload.get(bytes);
        System.out.println(String.valueOf(messageAndOffset.offset()) + ": " + new String(bytes, "UTF-8"));
        readoffset.seek(0);
        readoffset.writeLong(offset_in_partition);
        numRead++;
        messages++ ;

        if (messages == 10) {
          System.out.println("Pretend a crash happened") ;
          System.exit(0);
        }
}

For each message that we read, we check that the offset is not less than the one we want to read from. If it is, we ignore the message. For efficiency, Kafka batches messages. So you can get messages already read. For each valid message, we print it and write the next read offset to the file. If the consumer were to crash, when restarted, it would start reading from the last saved offset.

For demo purposes, the code exits after 10 messages. If you run this program several times, you will see that it starts reading exactly from where it last stopped. You can change that value and experiment.

Step 5: Run the consumer several times.

mvn exec:java -Dexec.mainClass="com.mj.KafkaOnceAndOnlyOnceRead"

210: 04092014 This is message 211
211: 04092014 This is message 212
212: 04092014 This is message 213
213: 04092014 This is message 214
214: 04092014 This is message 215
215: 04092014 This is message 216
216: 04092014 This is message 217
217: 04092014 This is message 218
218: 04092014 This is message 219
219: 04092014 This is message 220

run it again

mvn exec:java -Dexec.mainClass="com.mj.KafkaOnceAndOnlyOnceRead"

220: 04092014 This is message 221
221: 04092014 This is message 222
222: 04092014 This is message 223
223: 04092014 This is message 224
224: 04092014 This is message 225
225: 04092014 This is message 226
226: 04092014 This is message 227
227: 04092014 This is message 228
228: 04092014 This is message 229
229: 04092014 This is message 230

In Summary, it is possible to implement one and only once delivery of messages in Kafka by storing the read offset.

Related Blogs:

Apache Kafka Introduction
Apache Kafka JAVA tutorial #1
Apache Kafka JAVA tutorial #2
Apache Kafka 0.8.2 New Producer API

Apache Kafka Java tutorial #2

2014-11-20T18:04:00.000-08:00

In the blog Kafka introduction, I provided an overview of the features of Apache Kafka, an internet scale messaging broker. In Kafka tutorial #1, I provide a simple java programming example for sending and receiving messages using the high level consumer API. Kafka also provides a Simple consumer API that provides greater control to the programmer for reading messages and partitions. Simple is a misnomer and this is a complicated API. SimpleConsumer connects directly to the leader of a partition and is able to fetch messages from an offset. Knowing the leader for a partition is a preliminary step for this. And if the leader goes down, you can recover and connect to the new leader.

In the tutorial, we will use the "Simple" API to find the lead broker for a topic partition.

To recap some Kafka concepts

Broker in Kafka is a cluster of brokers
Messages are sent to and received from topics
Topics are partitioned across brokers
For each partition there is 1 leader broker and 1 or more replicas
Ordering of messages is maintained only within a partition

To manage read positions within a topic, it has to be done at partition level and You need to know the leader for that partition.

For this tutorial you will need

(1) Apache Kafka 0.8.1
(2) Apache Zookeeper
(3) JDK 7 or higher. An IDE of your choice is optional
(4) Apache Maven
(5) Source code for this sample from https://github.com/mdkhanga/my-blog-code if you want to look at working code

In this tutorial, we will
(1) create a 3 node kafka cluster
(2) create a topic with 12 partitions
(3) Write code to determine the leader of the partition
(4) Run the code to determine the leaders of each partition.
(5) Kill one broker and run again to determine the new leaders

Note that Kafka-topics --describe command lets you do the same. But we are doing it programatically for the sake of learning and because it is useful is some usecases.

Step 1 : Create a cluster

Follow the instruction is tutorial 1 to create a 3 node cluster.

Step 2 : Create a topic with 12 partitions

/usr/local/kafka/bin$ kafka-topics.sh --create --zookeeper host1:2181 --replication-factor 2 --partitions 12 --topic mjtopic

Step 3 : Write code to determine the leader for each partition

We use the SimpleConsumer API.
PartitionLeader.java

import kafka.javaapi.PartitionMetadata;
import kafka.javaapi.TopicMetadata;
import kafka.javaapi.TopicMetadataRequest;
import kafka.javaapi.consumer.SimpleConsumer;

SimpleConsumer consumer = new SimpleConsumer("localhost", 9092,
        100000, 64 * 1024, "leaderLookup");
List topics = Collections.singletonList("mjtopic");
TopicMetadataRequest req = new TopicMetadataRequest(topics);
kafka.javaapi.TopicMetadataResponse resp = consumer.send(req);
List metaData = resp.topicsMetadata();
int[] leaders = new int[12] ;
for (TopicMetadata item : metaData) {
      for (PartitionMetadata part : item.partitionsMetadata()) {
          leaders[part.partitionId()] = part.leader().id() ;
      }
}
for (int j = 0 ; j < 12 ; j++) {
      System.out.println("Leader for partition " + j + " " + leaders[j]) ;
}

SimpleConsumer can connect to any broker that is online. We construct a TopicMetadataRequest with the topic we are interested in and send it to broker with the consumer.send call. A TopicMetaData is returned which contains a set of PartitionMetaData ( one for each partition ). Each PartitionMetaData has the leader and replicas for that partition.

Step 4 : Run the code

Leader for partition 0 1
Leader for partition 1 2
Leader for partition 2 3
Leader for partition 3 1
Leader for partition 4 2
Leader for partition 5 3
Leader for partition 6 1
Leader for partition 7 2
Leader for partition 8 3
Leader for partition 9 1
Leader for partition 10 2
Leader for partition 11 3

Step 5 : Kill node 3 and run the code again

Leader for partition 0 1
Leader for partition 1 2
Leader for partition 2 1
Leader for partition 3 1
Leader for partition 4 2
Leader for partition 5 1
Leader for partition 6 1
Leader for partition 7 2
Leader for partition 8 1
Leader for partition 9 1
Leader for partition 10 2
Leader for partition 11 1

You can see the broker 1 has assumed leadership for broker 3's partitions.

In summary, one of the things you can use the SimpleConsumer API is to examine topic partition metadata. We will use this code in future tutorials to determine the leader of a partition.

Related blogs:

Apache Kafka Introduction
Apache Kafka JAVA tutorial #1
Apache Kafka JAVA tutorial #3
Apache Kafka 0.8.2 New Producer API

ServletContainerInitializer : Discovering classes in your Web Application

2014-10-10T18:52:00.000-07:00

In my blog on java.util.ServiceLoader, we discussed how it can be used to discover third party implementations of your interfaces. This can be useful if your application is a container that executes code written by developers. In this blog, we discuss dynamic discovery and registration for Servlets.

All Java Web developers are already familiar with javax.servlet.ServletContextListerner interface. If you want to do initialization when the application starts or clean up when it is destroyed, you implement the contextInitialized and contextDestroyed methods of this interface.

In Servlet 3.0 specification, they added a couple interesting features that help with dynamicity, that are particularly useful to developers of libraries or containers.

(1) javax.servlet.ServletContainerInitializer is another interface that can notify your code of application start.

Library or container developers typically provide an implementation of this interface. The implementation should be annotated with the HandlesTypes annotation. When the application starts, the Servlet container calls the OnStart method of this interface, passing in as a parameter a set of all classes that implement, extend or are annotated with the type(s) declared in the HandlesTypes annotation.

(2) The specification also add a number of methods to dynamically register Servlets, filters and listeners. You will recall that previously, if you needed to add a new Servlet to you application, you needed to modify web.xml.

Combining (1) and (2), it should be possible to dynamically discover and add Servlets to a web application. This is a powerful feature that allows you to make the web application modular and spread development across teams without build dependencies. Note that this technique can be used to discover any interface, class or annotation. I am killing 2 birds with one stone by using this to discover servlets.

In the rest of the blog, we will build a simple web app, that illustrates the above concepts. For this tutorial you will need

(1) JDK 7.x or higher
(2) Apache Tomcat or any Servlet container
(3) Apache Maven

In this example we will

(1) We will implement SevletContainerInitializer called WebContainerInitializer and package it in a jar containerlib.jar.
(2) To make the example interesting, we will create a new annotation @MyServlet, which will act like the @WebServlet annotation in the servlet specification. WebContainerInitializer will handle types that are annotated with @MyServlet.
(3) We will write a simple web app that has a Servlet annotated with @MyServlet and has containerlib.jar in the lib directory. No entries in web.xml.
(4) When the app starts, the servlet is discovered and registered. You can go to a browser and invoke it.

Before we proceed any further, you may download the code from my github respository, So you can look at the code as I explain. The code for this example is in the dynamicservlets directory.

Step 0: Get the code

git clone https://github.com/mdkhanga/my-blog-code.git

dynamicservlets has 2 subdirectories: containerlib and dynamichello.

The containerlib project has the MyServlet annotation and the WebContainerInitializer which implements ServletContainerInitializer.

DynamicHello is a web application that uses containerlib jar.

Step 1: The MyServlet annotation
MyServlet.java
@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.TYPE)
public @interface MyServlet {
    String path() ;
}

The annotation applies to classes and is used as
@MyServlet(path = "/someuri")

Step 2: A Demo servlet
HelloWorldServlet.java
@MyServlet(path = "/greeting")
public class HelloWorldServlet extends HttpServlet {

    protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
        PrintWriter p = response.getWriter() ;
        p.write(" hello world ");
        p.close();
    }

}

This is a simple hello servlet that we discover and register. Nothing needs to be added to web.xml.

Step 3: WebContainerInitializer
WebContainerInitializer.java
This is the implementation of ServletContainerInitializer.

@HandlesTypes({MyServlet.class})
public class WebContainerInitializer implements ServletContainerInitializer {

    public void onStartup(Set> classes, ServletContext ctx)
            throws ServletException {

        for (Class c : classes) {
            MyServlet ann = (MyServlet)c.getAnnotation(MyServlet.class) ;
            ServletRegistration.Dynamic d = ctx.addServlet("hello", c) ;
            d.addMapping(ann.path()) ;

        }

    }

The implementation needs to be in separate jar and included as a jar in the lib directory of the application war. WebContainerInitializer is annotated with @HandleTypes that takes MyServlet.class as parameter. When the application starts, the servlet container finds all classes that are annotated with MyServlet and passes them to the onStartup method. In the onStartup method, we go through each class found by the container, get the value of the path attribute from the annotation and register the servlet.

To make this work, we need one more thing, which is in the META-INF/services directory, a file whose name is javax.servlet.ServletContainerInitializer, which contains 1 line com.mj.WebContainerInitializer. If you are wondering why this is required, please see my this blog.

Step 4: Build and run the app

To build,
cd containerlib
mvn clean install
cd dynamichello
mvn clean install

This builds dynamichello/target/dynamichello.war that can be deployed to tomcat or any servlet container.
When the application starts, you will see the following messages in the log

Initializing container app .....
Found ...com.mj.servlets.HelloWorldServlet
path = /greeting

Point you browser to http://localhost:8080/hello/greeting.

The servlet will respond with a hello message.

In summary, this technique can be used to dynamically discover classes during application startup. This is typically used to implement libraries or containers such as JAX-RS implementation. This allows implementations to be provided by different developers. There is no hard wiring.

Discovering third party API/SPI implementations using java.util.ServiceLoader

2014-09-20T17:00:00.000-07:00

One interface, many implementations is a very well known object oriented programming paradigm. If you write the implementations yourself then you know what those implementations are and you can write a factory class or method that creates and returns the right implementation. You might also make this config driven and inject the correct implementation based on configuration.

What if third parties are providing implementations of your interface? If you know those implementations in advance, then you could do the same as in the case above. But one downside is that code change is required to add or use new implementations or to remove them. You could come up with a configuration file, where implementations are listed and your code uses the list to determine what is available. Downside is that configuration has to be updated by you and this is non standard approach, in that, every API developer could come up with his own format for the configuration. Fortunately JAVA has a solution.

In JDK6, they introduced java.util.ServiceLoader, a class for discovering and loading classes.

It has a static load method that can be used to create a ServiceLoader that will find and load all of a particular Type.

public static<T> ServiceLoader<T> load(Class<T> service)

You would use it as
ServiceLoader<SortProvider> sl = ServiceLoader.load(SortProvider.class) ;
This creates a ServiceLoader that can find and load every SortProvider in the classpath.

The Iterator method returns an Iterator to the implementations founds that will be loaded lazily.
Iterator<SortProvider> it_sl = sl.Iterator() ;

You can iterate over what is found and store it in a Map or somewhere else in memory.
while (its.hasNext()) {
            SortProvider sp = its.next() ;
            log("Found provider " + sp.getProviderName()) ;
            sMap.put(sp.getProviderName(),sp) ;
}

How does ServiceLoader know where to look ?

Implementors package their implementation in a jar
jar should have a META-INF/services directory
services directory should have a file whose name is the fully qualified name of the Type
file has a list of fully qualified name of implementations of type
jar is installed to the classpath

I have a complete API/SPI example for a Sort interface below that you can download at https://github.com/mdkhanga/my-blog-code. This sample is in msort directory. You should download the code first, so that you can look at code while reading the text below. This example illustrates how ServiceLoader is used to discover implementations from third party service providers. Sort interface can be used for sorting data. Service providers can provide implementations of various Sort algorithms. In the example,

1. com.mj.msort.Sort is the main Sort API. It has 2 sort methods. One for Arrays and one of
collections. 2 implementations are provided - bubblesort and mergesort. But anybody can write additional implementations.

2. com.mj.msort.spi.SortProvider is the SPI.Third party implementors of Sort must also implement the SortProvider interface. The SPI provides another layer of encapsulation. We don't want to know the implementation details. We just want an instance of the implementation.

3. SPI providers need to implement Sort and SortProvider.

4. com.mj.msort.SortServices is a class that can discover and load SPI implementations and make them available to API users. It uses java.util.ServiceLoader to load SortProviders. Hence SortProvider also needs to be packaged as required by java.util.ServiceLoader for it to be discovered.

This is the class that brings everything together. It uses ServiceLoader to find all implementations of SortProviders and stores them in a Map. It has a getSort method that programmers can call to get a specific implementation or whatever is there.

5. Sample Usage

Sort s = SortServices.getSort(...
s.sort(...

In summary, ServiceLoader is a powerful mechanism to find and load classes of a type. It can used to build highly extensible and dynamic services. As an additional exercise, you can create your own implementation of SortProvider in your own jar and SortServices will find it as long as it is on the classpath.