Stories by Franck Pachot on Medium

JSONB vs. BSON: Tracing PostgreSQL and MongoDB Wire Protocols

Franck Pachot — Sun, 21 Dec 2025 20:27:51 GMT

There is an essential difference between MongoDB’s BSON and PostgreSQL’s JSONB. Both are binary JSON formats, but they serve different roles. JSONB is purely an internal storage format for JSON data in PostgreSQL. BSON, on the other hand, is MongoDB’s native data format: it is used by application drivers, over the network, in memory, and on disk.

JSONB: PostgreSQL internal storage format

JSONB is a storage format, as defined by the PostgreSQL documentation:

PostgreSQL offers two types for storing JSON data: json and jsonb

PostgreSQL uses JSONB solely for internal storage, requiring the entire structure to be read to access a field, as observed in JSONB DeTOASTing (read amplification).

BSON: MongoDB storage and exchange format

BSON is used for storage and also as an exchange format between the application and the database, as defined in the BSON specification:

BSON can be compared to binary interchange formats, like Protocol Buffers. BSON is more “schema-less” than Protocol Buffers

On the application side, the MongoDB driver converts application objects to BSON, which supports more data types than JSON or JSONB, including datetime and binary. This BSON is sent and received over the network and stored and manipulated on the server as-is, with no extra parsing. Both the driver and the database can efficiently access fields via the binary structure because BSON includes metadata such as field length prefixes and explicit type information, even for large or nested documents.

PostgreSQL protocol is JSON (text), not JSONB

To illustrate this, I’ve written a small Python program that inserts a document into a PostgreSQL table with a JSONB column, and queries that table to retrieve the document:

from sqlalchemy import Column, Integer, create_engine
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy.orm import declarative_base, sessionmaker

Base = declarative_base()

class Item(Base):
    __tablename__ = 'items'
    id = Column(Integer, primary_key=True)
    data = Column(JSONB)  # our JSONB column

# Connect to Postgres
engine = create_engine('postgresql+psycopg2://', echo=True)
Session = sessionmaker(bind=engine)
session = Session()

# Create table
Base.metadata.create_all(engine)

# Insert an object into JSONB column
obj = {"name": "widget", "price": 9.99, "tags": ["new", "sale"]}
session.add(Item(data=obj))
session.commit()

# Read back the table
for row in session.query(Item).all():
    print(row.id, row.data)

The program uses SQLAlchemy to send and retrieve Python objects to and from PostgreSQL via the Psycopg2 driver. I’ve stored it in demo.py.

When I run the program, with python demo.py, before it displays the final result, it logs all SQL statements:

2025-12-21 12:50:22,484 INFO sqlalchemy.engine.Engine select pg_catalog.version()
2025-12-21 12:50:22,485 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine select current_schema()
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine show standard_conforming_strings
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-12-21 12:50:22,487 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 12:50:22,488 INFO sqlalchemy.engine.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_visible(c.oid) and relname=%(name)s
2025-12-21 12:50:22,488 INFO sqlalchemy.engine.Engine [generated in 0.00015s] {'name': 'items'}
2025-12-21 12:50:22,489 INFO sqlalchemy.engine.Engine
CREATE TABLE items (
        id SERIAL NOT NULL,
        data JSONB,
        PRIMARY KEY (id)
)
2025-12-21 12:50:22,489 INFO sqlalchemy.engine.Engine [no key 0.00011s] {}
2025-12-21 12:50:22,491 INFO sqlalchemy.engine.Engine COMMIT
2025-12-21 12:50:22,493 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 12:50:22,494 INFO sqlalchemy.engine.Engine INSERT INTO items (data) VALUES (%(data)s) RETURNING items.id
2025-12-21 12:50:22,494 INFO sqlalchemy.engine.Engine [generated in 0.00018s] {'data': '{"name": "widget", "price": 9.99, "tags": ["new", "sale"]}'}
2025-12-21 12:50:22,495 INFO sqlalchemy.engine.Engine COMMIT
2025-12-21 12:50:22,497 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 12:50:22,498 INFO sqlalchemy.engine.Engine SELECT items.id AS items_id, items.data AS items_data
FROM items
2025-12-21 12:50:22,498 INFO sqlalchemy.engine.Engine [generated in 0.00013s] {}

1 {'name': 'widget', 'tags': ['new', 'sale'], 'price': 9.99}

To see what is sent and received through the network by the PostgreSQL protocol, I run the program with strace, showing the sendto and recv system calls with their arguments: strace -e trace=sendto,recvfrom -yy -s 1000 python demo.py.

Like most SQL database drivers, the protocol is basic: send SQL commands as text, and fetch a tabular result set. In the PostgreSQL protocol’s messages, the first letter is the message type ( Q for Simple Query Message, followed by the length of the message, and the message in text, X to terminate the session, C for command completion status, T abd D for the resultset).

Here is the output, the lines starting with the timestamp are the logs from SQL Alchemy, those starting with sendto and recv are the network system calls with the message to the database, and the result from it

Where is the trace when inserting one document:

2025-12-21 16:52:20,278 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 16:52:20,279 INFO sqlalchemy.engine.Engine INSERT INTO items (data) VALUES (%(data)s) RETURNING items.id
2025-12-21 16:52:20,279 INFO sqlalchemy.engine.Engine [generated in 0.00029s] {'data': '{"name": "widget", "price": 9.99, "tags": ["new", "sale"]}'}

sendto(3[::1]:5432]>, "Q\0\0\0\nBEGIN\0", 11, MSG_NOSIGNAL, NULL, 0) = 11

recvfrom(3[::1]:5432]>, "C\0\0\0\nBEGIN\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 17

sendto(3[::1]:5432]>, "Q\0\0\0vINSERT INTO items (data) VALUES ('{\"name\": \"widget\", \"price\": 9.99, \"tags\": [\"new\", \"sale\"]}') RETURNING items.id\0", 119, MSG_NOSIGNAL, NULL, 0) = 119

recvfrom(3[::1]:5432]>, "T\0\0\0\33\0\1id\0\0\0@\310\0\1\0\0\0\27\0\4\377\377\377\377\0\0D\0\0\0\v\0\1\0\0\0\0011C\0\0\0\17INSERT 0 1\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 62
2025-12-21 16:52:20,281 INFO sqlalchemy.engine.Engine COMMIT

sendto(3[::1]:5432]>, "Q\0\0\0\vCOMMIT\0", 12, MSG_NOSIGNAL, NULL, 0) = 12

recvfrom(3[::1]:5432]>, "C\0\0\0\vCOMMIT\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 18

It started a transaction ( Q\0\0\0\nBEGIN), received command completion ( C\0\0\0\nBEGIN), then sent the full text of the INSERT command, including the JSON payload ( Q\0\0\0vINSERT INTO items (data) VALUES ('{\"name\": \"widget\", \"price\": 9.99, \"tags\": [\"new\", \"sale\"]}). It subsequently received command completion ( INSERT 0 1) and the returned ID ( T\0\0\0\33\0\1id, D\0\0\0\v\0\1\0\0\0\001).

Here is the trace when I query and fetch the document:

2025-12-21 16:52:20,283 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 16:52:20,285 INFO sqlalchemy.engine.Engine SELECT items.id AS items_id, items.data AS items_data
FROM items
2025-12-21 16:52:20,285 INFO sqlalchemy.engine.Engine [generated in 0.00024s] {}

sendto(3[::1]:5432]>, "Q\0\0\0\nBEGIN\0", 11, MSG_NOSIGNAL, NULL, 0) = 11

recvfrom(3[::1]:5432]>, "C\0\0\0\nBEGIN\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 17

sendto(3[::1]:5432]>, "Q\0\0\0FSELECT items.id AS items_id, items.data AS items_data \nFROM items\0", 71, MSG_NOSIGNAL, NULL, 0) = 71

recvfrom(3[::1]:5432]>, "T\0\0\0>\0\2items_id\0\0\0@\310\0\1\0\0\0\27\0\4\377\377\377\377\0\0items_data\0\0\0@\310\0\2\0\0\16\332\377\377\377\377\377\377\0\0D\0\0\0I\0\2\0\0\0\0011\0\0\0:{\"name\": \"widget\", \"tags\": [\"new\", \"sale\"], \"price\": 9.99}C\0\0\0\rSELECT 1\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 157

It started another transaction, sent the SELECT statement as text and received the result as JSON text ( D\0\0\0I\0\2\0\0\0\0011\0\0\0:{\"name\": \"widget\", \"tags\": [\"new\", \"sale\"], \"price\": 9.99}).

Finally, the transaction ends, and the sessionis disconnected:

sendto(3[::1]:5432]>, "Q\0\0\0\rROLLBACK\0", 14, MSG_NOSIGNAL, NULL, 0) = 14

recvfrom(3[::1]:5432]>, "C\0\0\0\rROLLBACK\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 20

sendto(3[::1]:5432]>, "X\0\0\0\4", 5, MSG_NOSIGNAL, NULL, 0) = 5

If you want to dig into the code, the server-side parsing is in jsonb_send and jsonb_recv ("The type is sent as text in binary mode"), and while it tests the version before converting to text, there's only one version. The client-side for Psycopg2 shows that register_default_jsonb is the same as register_default_json

Comparing with MongoDB (BSON from end-to-end)

To compare with MongoDB, created the following demo-mongodb.py:

from pymongo import MongoClient
client = MongoClient("mongodb://127.0.0.1:27017")
db = client.my_database
insert_result = db.items.insert_one({"name": "widget", "price": 9.99, "tags": ["new", "sale"]})
print("Inserted document ID:", insert_result.inserted_id)
for doc in items_collection.find():
    print(doc["_id"], doc)

I used the same strace command, but displaying all characters as hexadecimal to be able to decode them with bsondump:

$ strace -e trace=sendto,recvfrom -xx -yy -s 1000 python demo-mongodb.py 2>&1

Here is the network request for the insert:

sendto(5127.0.0.1:27017]>, "\xd6\x00\x00\x00\x51\xdc\xb0\x74\x00\x00\x00\x00\xdd\x07\x00\x00\x00\x00\x00\x00\x00\x5a\x00\x00\x00\x02\x69\x6e\x73\x65\x72\x74\x00\x06\x00\x00\x00\x69\x74\x65\x6d\x73\x00\x08\x6f\x72\x64\x65\x72\x65\x64\x00\x01\x03\x6c\x73\x69\x64\x00\x1e\x00\x00\x00\x05\x69\x64\x00\x10\x00\x00\x00\x04\x31\xb8\x9a\x81\xfd\x35\x42\x1a\x88\x44\xa8\x69\xe8\xba\x6f\x30\x00\x02\x24\x64\x62\x00\x0c\x00\x00\x00\x6d\x79\x5f\x64\x61\x74\x61\x62\x61\x73\x65\x00\x00\x01\x66\x00\x00\x00\x64\x6f\x63\x75\x6d\x65\x6e\x74\x73\x00\x58\x00\x00\x00\x07\x5f\x69\x64\x00\x69\x48\x3f\x7f\x87\x46\xd5\x2e\xe2\x0b\xbc\x0b\x02\x6e\x61\x6d\x65\x00\x07\x00\x00\x00\x77\x69\x64\x67\x65\x74\x00\x01\x70\x72\x69\x63\x65\x00\x7b\x14\xae\x47\xe1\xfa\x23\x40\x04\x74\x61\x67\x73\x00\x1c\x00\x00\x00\x02\x30\x00\x04\x00\x00\x00\x6e\x65\x77\x00\x02\x31\x00\x05\x00\x00\x00\x73\x61\x6c\x65\x00\x00\x00", 214, 0, NULL, 0) = 214

recvfrom(5127.0.0.1:27017]>, "\x2d\x00\x00\x00\x06\x00\x00\x00\x51\xdc\xb0\x74\xdd\x07\x00\x00", 16, 0, NULL, NULL) = 16

recvfrom(5127.0.0.1:27017]>, "\x00\x00\x00\x00\x00\x18\x00\x00\x00\x10\x6e\x00\x01\x00\x00\x00\x01\x6f\x6b\x00\x00\x00\x00\x00\x00\x00\xf0\x3f\x00", 29, 0, NULL, NULL) = 29

Inserted document ID: 69483f7f8746d52ee20bbc0b

Here is the fetch query that receives the document:

sendto(5127.0.0.1:27017]>, "\x70\x00\x00\x00\xff\x5c\x49\x19\x00\x00\x00\x00\xdd\x07\x00\x00\x00\x00\x00\x00\x00\x5b\x00\x00\x00\x02\x66\x69\x6e\x64\x00\x06\x00\x00\x00\x69\x74\x65\x6d\x73\x00\x03\x66\x69\x6c\x74\x65\x72\x00\x05\x00\x00\x00\x00\x03\x6c\x73\x69\x64\x00\x1e\x00\x00\x00\x05\x69\x64\x00\x10\x00\x00\x00\x04\x31\xb8\x9a\x81\xfd\x35\x42\x1a\x88\x44\xa8\x69\xe8\xba\x6f\x30\x00\x02\x24\x64\x62\x00\x0c\x00\x00\x00\x6d\x79\x5f\x64\x61\x74\x61\x62\x61\x73\x65\x00\x00", 112, 0, NULL, 0) = 112

recvfrom(5127.0.0.1:27017]>, "\xc5\x00\x00\x00\x07\x00\x00\x00\xff\x5c\x49\x19\xdd\x07\x00\x00", 16, 0, NULL, NULL) = 16

recvfrom(5127.0.0.1:27017]>, "\x00\x00\x00\x00\x00\xb0\x00\x00\x00\x03\x63\x75\x72\x73\x6f\x72\x00\x97\x00\x00\x00\x04\x66\x69\x72\x73\x74\x42\x61\x74\x63\x68\x00\x60\x00\x00\x00\x0$\x30\x00\x58\x00\x00\x00\x07\x5f\x69\x64\x00\x69\x48\x3f\x7f\x87\x46\xd5\x2e\xe2\x0b\xbc\x0b\x02\x6e\x61\x6d\x65\x00\x07\x00\x00\x00\x77\x69\x64\x67\x65\x74\x00\x01\x70\x72\x69\x63\x65\x00\x7b\x14\xae\x47\xe1\xfa\x23\x40\x04\x74\x61\x67\x73\x00\x1c\x00\x00\x00\x02\x30\x00\x04\x00\x00\x00\x6e\x65\x77\x00\x02\x31\x00\x05\x00\x00\x00\x73\x61\x6c\x65\x00\x00\x00\x00\x12\x69\x64\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x6e\x73\x00\x12\x00\x00\x00\x6d\x79\x5f\x64\x61\x74\x61\x62\x61\x73\x65\x2e\x69\x74\x65\x6d\x73\x00\x00\x01\x6f\x6b\x00\x00\x00\x00\x00\x00\x00\xf0\x3f\x00", 181, 0, NULL, NULL) = 181

69483f7f8746d52ee20bbc0b {'_id': ObjectId('69483f7f8746d52ee20bbc0b'), 'name': 'widget', 'price': 9.99, 'tags': ['new', 'sale']}

I use bsondump, available in the MongoDB container, to decode the messages.

Insert starts with a 20 bytes message header: Total message size in little-endian = 0xd6 = 214 bytes \xd6\x00\x00\x00, requestID \x51\xdc\xb0\x74, responseTo (0 for client->server) \x00\x00\x00\x00, opCode = 2013 (OP_MSG) \xdd\x07\x00\x00, \x00\x00\x00\x00 and then starts BSON:

root@9574ecd2d248:/# bsondump <(echo -ne '\x5a\x00\x00\x00\x02\x69\x6e\x73\x65\x72\x74\x00\x06\x00\x00\x00\x69\x74\x65\x6d\x73\x00\x08\x6f\x72\x64\x65\x72\x65\x64\x00\x01\x03\x6c\x73\x69\x64\x00\x1e\x00\x00\x00\x05\x69\x64\x00\x10\x00\x00\x00\x04\x31\xb8\x9a\x81\xfd\x35\x42\x1a\x88\x44\xa8\x69\xe8\xba\x6f\x30\x00\x02\x24\x64\x62\x00\x0c\x00\x00\x00\x6d\x79\x5f\x64\x61\x74\x61\x62\x61\x73\x65\x00\x00\x01\x66\x00\x00\x00\x64\x6f\x63\x75\x6d\x65\x6e\x74\x73\x00\x58\x00\x00\x00')
{  
  "insert": "items",  
  "ordered": true,  
  "lsid": {  
    "id": {  
      "$binary": {  
        "base64": "Mbiagf01QhqIRKhp6LpvMA==",  
        "subType": "04"  
      }  
    }  
  },  
  "$db": "my_database"  
}  
2025-12-21T19:09:39.214+0000    1 objects found
2025-12-21T19:09:39.214+0000    unexpected EOF
root@9574ecd2d248:/#

This shows unexpected EOF because the "documents" array is actually sent in the next section of the OP_MSG, not embedded here. The second BSON section starts with its own length field (\x58\x00\x00\x00 = 88 bytes) and contains the actual document to be inserted:

root@9574ecd2d248:/# bsondump <(echo -ne '\x58\x00\x00\x00\x07\x5f\x69\x64\x00\x69\x48\x3f\x7f\x87\x46\xd5\x2e\xe2\x0b\xbc\x0b\x02\x6e\x61\x6d\x65\x00\x07\x00\x00\x00\x77\x69\x64\x67\x65\x74\x00\x01\x70\x72\x69\x63\x65\x00\x9a\x99\x99\x99\x99\x99\x23\x40\x04\x74\x61\x67\x73\x00\x1c\x00\x00\x00\x02\x30\x00\x04\x00\x00\x00\x6e\x65\x77\x00\x02\x31\x00\x05\x00\x00\x00\x73\x61\x6c\x65\x00\x00\x00')
{  
  "_id": {  
    "$oid": "69483f7f8746d52ee20bbc0b"  
  },  
  "name": "widget",  
  "price": {  
    "$numberDouble": "9.8"  
  },  
  "tags": [  
    "new",  
    "sale"  
  ]  
}  
2025-12-21T19:09:49.278+0000    1 objects found
root@9574ecd2d248:/#

BSON holds the document in a flexible binary format, including all field names, datatypes, and values, which is what is exchanged between the application driver and the database server.

I can do the same with the query result:

root@9574ecd2d248:/# bsondump <(echo -ne '\xb0\x00\x00\x00\x03\x63\x75\x72\x73\x6f\x72\x00\x97\x00\x00\x00\x04\x66\x69\x72\x73\x74\x42\x61\x74\x63\x68\x00\x60\x00\x00\x00\x03\x30\x00\x58\x00\x00\x00\x07\x5f\x69\x64\x00\x69\x48\x3f\x7f\x87\x46\xd5\x2e\xe2\x0b\xbc\x0b\x02\x6e\x61\x6d\x65\x00\x07\x00\x00\x00\x77\x69\x64\x67\x65\x74\x00\x01\x70\x72\x69\x63\x65\x00\x7b\x14\xae\x47\xe1\xfa\x23\x40\x04\x74\x61\x67\x73\x00\x1c\x00\x00\x00\x02\x30\x00\x04\x00\x00\x00\x6e\x65\x77\x00\x02\x31\x00\x05\x00\x00\x00\x73\x61\x6c\x65\x00\x00\x00\x00\x12\x69\x64\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x6e\x73\x00\x12\x00\x00\x00\x6d\x79\x5f\x64\x61\x74\x61\x62\x61\x73\x65\x2e\x69\x74\x65\x6d\x73\x00\x00\x01\x6f\x6b\x00\x00\x00\x00\x00\x00\x00\xf0\x3f\x00')
{  
  "cursor": {  
    "firstBatch": [  
      {  
        "_id": {  
          "$oid": "69483f7f8746d52ee20bbc0b"  
        },  
        "name": "widget",  
        "price": {  
          "$numberDouble": "9.99"  
        },  
        "tags": [  
          "new",  
          "sale"  
        ]  
      }  
    ],  
    "id": {  
      "$numberLong": "0"  
    },  
    "ns": "my_database.items"  
  },  
  "ok": {  
    "$numberDouble": "1.0"  
  }  
}  
2025-12-21T18:44:08.110+0000    1 objects found

Again, the document is received in BSON format, which stores binary values with the correct datatypes.

Conclusion: no JSONB in the application

With PostgreSQL, the JSON text is visible in the network messages, even when it comes from a JSONB column:

In PostgreSQL, storing as TEXT, JSON, or JSONB affects storage and indexing, but the wire protocol still sends and receives plain JSON text. Every query requires the client and server to parse and serialize it, adding CPU overhead and risking a loss of type fidelity for large or complex documents.

MongoDB uses BSON from end to end — in storage and on the wire. Drivers map BSON types directly to application objects, preserving types like dates and binary fields without extra parsing. This reduces CPU cost on both sides, improves scalability, and makes large‑document handling more efficient.

Originally published at https://dev.to on December 21, 2025.

Data Locality vs. Independence: Which Should Your Database Prioritize?

Franck Pachot — Sun, 23 Nov 2025 14:56:24 GMT

When your application needs several pieces of data at once, the fastest approach is to read them from a single location in a single call. In a document database, developers can decide what is stored together, both logically and physically.

Fragmentation has never been beneficial for performance. In databases, the proximity of data — on disk, in memory or across the network — is crucial for scalability. Keeping related data together allows a single operation to fetch everything needed, reducing disk I/O, memory cache misses and network round-trips, thereby making performance more predictable.

The principle “store together what is accessed together” is central to modeling in document databases. Yet its purpose is to allow developers to control the physical storage layout, even with flexible data structures.

In contrast, SQL databases were designed for data independence — allowing users to interact with a logical model separate from the physical implementation managed by a database administrator.

Today, the trend is not to separate development and operations, allowing faster development cycles without the complexity of coordinating multiple teams or shared schemas. Avoiding the separation into logical and physical models further simplifies the process.

Understanding the core principle of data locality is essential today, especially as many databases emulate document databases or offer similar syntax on top of SQL. To qualify as a document database, it’s not enough to accept JSON documents with a developer-friendly syntax.

The database must also preserve those documents intact in storage so that accessing them has predictable performance. Whether they expose a relational or document API, it is essential to know if your objective is data independence or data locality.

Why Locality Still Matters in Modern Infrastructure

Modern hardware still suffers from penalties for scattered access. Hard disk drives (HDDs) highlighted the importance of locality because seek and rotational latency are more impactful than transfer speed, especially for online transactional processing (OTLP) workloads.

While solid state drives (SSDs) remove mechanical delays, random writes remain expensive, and cloud storage adds latency due to network access to storage. Even in-memory access isn’t immune: on multisocket servers, non-uniform memory access (NUMA) causes varying access times depending on where the data was loaded into memory by the first access, relative to the CPU core that processes it later.

Scale-out architecture further increases complexity. Vertical scaling — keeping all reads and writes on a single instance with shared disks and memory — has capacity limits. Large instances are expensive, and scaling them down or up often requires downtime, which is risky for always-on applications.

For example, you might need your maximum instance size for Black Friday but would have to scale up progressively in the lead-up, incurring downtime as usage increases. Without horizontal scalability, you end up provisioning well above your average load “just in case,” as in on-premises infrastructures sized years in advance for occasional peaks — something that can be prohibitively costly in the cloud.

Horizontal scaling allows adding or removing nodes without downtime. However, more nodes increase the likelihood of distributed queries, in which operations that once hit local memory must now traverse the network, introducing unpredictable latency. Data locality becomes critical with scale-out databases.

To create scalable database applications, developers should understand storage organization and prioritize single-document operations for performance-critical transactions. CRUD functions (insert, find, update, delete) targeting a single document in MongoDB are always handled by a single node, even in a sharded deployment. If that document isn’t in memory, it can be read from disk in a single I/O operation. Modifications are applied to the in-memory copy and written back as a single document during asynchronous checkpoints, avoiding on-disk fragmentation.

In MongoDB, the WiredTiger storage engine stores each document’s fields together in contiguous storage blocks, allowing developers to follow the principle “store together what is accessed together.” By avoiding cross-document joins, such as the $lookup operation in queries, this design helps prevent scatter-gather operations internally, which promotes consistent performance. This supports predictable performance regardless of document size, update frequency or cluster scale.

The Relational Promise: Physical Data Independence

For developers working with NoSQL databases, what I exposed above seems obvious: There is one single data model — the domain model — defined in the application, and the database stores exactly that model.

The MongoDB data modeling workshop defines a database schema as the physical model that describes how the data is organized in the database. In relational databases, the logical model is typically independent of the physical storage model, regardless of the data type used, because they serve different purposes.

SQL developers work with a relational model that is mapped to their object model via object relational mapping (ORM) tooling or hand-coded SQL joins. The models and schemas are normalized for generality, not necessarily optimized for specific application access patterns.

The goal of the relational model was to serve online interactive use by non-programmers and casual users by providing an abstraction that hides physical concerns. This includes avoiding data anomalies through normalization and enabling declarative query access without procedural code. Physical optimizations, like indexes, are considered implementation details. You will not find CREATE INDEX in the SQL standard.

In practice, a SQL query planner chooses access paths based on statistics. When writing JOIN clauses, the order of tables in the FROM clause should not matter. The SQL query planner reorders based on cost estimates. The database guarantees logical consistency, at least in theory, even with concurrent users and internal replication. The SQL approach is database-centric: rules, constraints and transactional guarantees are defined in the relational database, independent of specific use cases or table sizes.

Today, most relational databases sit behind applications. End users rarely interact with them directly, except in analytical or data science contexts. Applications can enforce data integrity and handle code anomalies, and developers understand data structures and algorithms. Nonetheless, relational database experts still advise keeping constraints, stored procedures, transactions, and joins within the database.

The physical storage remains abstracted — indexes, clustering, and partitions are administrator-level, not application-level, concepts, as if the application developers were like the non-programmer casual users described in the early papers about relational databases.

How Codd’s Rules Apply to SQL/JSON Documents

Because data locality matters, some relational databases have mechanisms to enforce it internally. For example, Oracle has long supported “clustered tables” for co-locating related rows from multiple columns, and more recently offers a choice for JSON storage as either binary JSON (OSON, Oracle’s native binary JSON) or decomposed relational rows (JSON-relational duality views). However, those physical attributes are declared and deployed in the database using a specific data definition language (DDL) and are not exposed to the application developers. This reflects Codd’s “independence” rules:

Rule 8: Physical data independence
Rule 9: Logical data independence
Rule 10: Integrity independence
Rule 11: Distribution independence

Rules 8 and 11 relate directly to data locality: The user is not supposed to care whether data is physically together or distributed. The database is opened to users who ignore the physical data model, access paths and algorithms. Developers do not know what is replicated, sharded or distributed across multiple data centers.

Where the SQL Abstraction Begins to Weaken

In practice, no relational database perfectly achieves these rules. Performance tuning often requires looking at execution plans and physical data layouts. Serializable isolation is rarely used due to scalability limitations of two-phase locking, leading developers to fall back to weaker isolation levels or to explicit locking (SELECT … FOR UPDATE). Physical co-location mechanisms — hash clusters, attribute clustering — exist, but are difficult to size and maintain optimally without precise knowledge of access patterns. They often require regular data reorganization as updates can fragment it again.

The normalized model is inherently application-agnostic, so optimizing for locality often means breaking data independence ( denormalizing, maintaining materialized views, accepting stale reads from replicas, disabling referential integrity). With sharding, constraints like foreign keys and unique indexes generally cannot be enforced across shards. Transactions must be carefully ordered to avoid long waits and deadlocks. Even with an abstraction layer, applications must be aware of the physical distribution for some operations.

The NoSQL Approach: Modeling for Access Patterns

As data volumes and latency expectations grow, a different paradigm has emerged: give developers complete control rather than an abstraction with some exceptions.

NoSQL databases adopt an application-first approach: The physical model matches the access patterns, and the responsibility for maintaining integrity and transactional scope is pushed to the application. Initially, many NoSQL stores delegated all responsibility, including consistency, to developers, acting as “dumb” key-value or document stores. Most lacked ACID (atomicity, consistency, isolation and durability) transactions or query planners. If secondary indexes were present, they needed to be queried explicitly.

This NoSQL approach was the opposite of the relational database world: Instead of one shared, normalized database, there were many purpose-built data stores per application. It reduces the performance and scalability surprises, but at the price of more complexity.

MongoDB’s Middle Road for Flexible Schemas

MongoDB evolved by adding essential relational database capabilities — indexes, query planning, multidocument ACID transactions — while keeping the application-first document model. When you insert a document, it is stored as a single unit.

In WiredTiger, the MongoDB storage engine, BSON documents (binary JSON with additional datatypes and indexing capabilities) are stored in B-trees with variable-sized leaf pages, allowing large documents to remain contiguous, which differs from the fixed-size page structures used by many relational databases. This avoids splitting a business object across multiple blocks and ensures consistent latency for operations that appear as a single operation to developers.

Updates in MongoDB are applied in memory. Committing them as in-place changes on disk would fragment pages. Instead, WiredTiger uses reconciliation to write a complete new version at checkpoints — similar to copy-on-write filesystems, but with a flexible block size. This may cause write amplification, but preserves document locality. With appropriately sized instances, these writes occur in the background and do not affect in-memory write latency.

Locality defined at the application’s document schema flows all the way down to the storage layer, something that relational database engines typically cannot match with their goal of physical data independence.

How Data Locality Improves Application Performance

Designing for locality simplifies development and operations in several ways:

Transactions: A business change affecting a single aggregate (in the domain-driven design sense) becomes a single atomic read-modify-write on one document — no multiple roundtrips like BEGIN, SELECT … FOR UPDATE, multiple updates and COMMIT.
Queries and indexing: Related data in one document avoids SQL joins and ORM lazy/eager mapping. A single compound index can cover filters and projections across fields that would otherwise be in separate tables, ensuring predictable plans without join-order uncertainty.
Development: The same domain model in the application is used directly as the database schema. Developers can reason about access patterns without mapping to a separate model, making latency and plan stability predictable.
Scalability: Most operations targeting a single aggregate, with shard keys chosen accordingly, can be routed to one node, avoiding scatter-gather fan-out for critical use cases.

MongoDB’s optimistic concurrency control avoids locks, though it requires retry logic on write conflict errors. For single-document calls, retries are handled transparently by the databases, which have a complete view of the transaction intent, making it simpler and faster.

Embedding vs. Referencing in Document Data Modeling

Locality doesn’t mean “embed everything.” It means: Embed what you consistently access together. Bounded one-to-many relationships (such as an order and its line items) are candidates for embedding. Rarely updated references and dimensions can also be duplicated and embedded. High-cardinality or unbounded-growth relationships, or independently updated entities, are better represented as separate documents and can be co-located via shard keys.

MongoDB’s compound and multikey indexes support embedded fields, maintaining predictable, selective access without joins. Embedding within the same document is the only way to guarantee co-location at the block level. Multiple documents in a single collection are not stored close together, except for small documents inserted at the same time, as they might share the same block. In sharding, the shard key ensures co-location on the same node but not within the same block.

In MongoDB, locality is an explicit design choice in domain-driven design:

Identify aggregates that change and are read together.
Store them in one document when appropriate.
Use indexes aligned with access paths.
Choose shard keys so related operations route to one node.

What MongoDB Emulations Miss About Locality

Given the popularity of the document model, some cloud services offer MongoDB-like APIs on top of SQL databases. These systems may expose a MongoDB-like API while retaining a relational storage model, which typically does not maintain the same level of physical locality.

Relational databases store rows in fixed-size blocks (often 8 KB). Large documents must be split across multiple blocks. Here are some examples in popular SQL databases:

PostgreSQL JSONB: Stores JSON in heap tables and large documents in many chunks, using TOAST, the oversized attribute storage technique. The document is compressed and split into chunks stored in another table, accessed via an index. Reading a large document is like a nested loop join between the row and its TOAST table.
Oracle JSON-Relational Duality Views: Map JSON documents to relational tables, preserving data independence rather than physical locality. Elements accessed together may be scattered across blocks, requiring internal joins, multiple I/Os and possibly network calls in distributed setups.

In both scenarios, the documents are divided into either binary chunks or normalized tables. Although the API resembles MongoDB, it remains a SQL database that lacks data locality. Instead, it provides an abstraction that keeps the developer unaware of internal processes until they inspect the execution plan and understand the database internals.

Conclusion

“Store together what is accessed together” reflects realities across sharding, I/O patterns, transactions, and memory cache efficiency. Relational database engines abstract away physical layout, which works well for centralized, normalized databases serving multiple applications in a single monolithic server. At a larger scale, especially in elastic cloud environments, horizontal sharding is essential — and often incompatible with pure data independence. Developers must account for locality.

In SQL databases, this means denormalizing, duplicating reference data, and avoiding cross-shard constraints. The document model, when the database truly enforces locality down to storage offers an alternative to this abstraction and exceptions.

In MongoDB, locality can be explicitly defined at the application level while still providing indexing, query planning and transactional features. When assessing “MongoDB-compatible”systems on relational engines, it is helpful to determine whether the engine stores aggregates contiguously on disk and routes them to a single node by design. If not, the performance characteristics may differ from those of a document database that maintains physical locality.

Both approaches are valid. In database-first deployment, developers depend on in-database declarations to ensure performance, working alongside the database administrator and using tools like execution plans for troubleshooting. In contrast, application-first deployment shifts more responsibility to developers, who must validate both the application’s functionality and its performance.

Originally published at https://thenewstack.io/why-store-together-access-together-matters-for-your-database/

MongoDB High Availability: Replica Set in a Docker Lab

Franck Pachot — Sat, 02 Aug 2025 18:48:03 GMT

Originally published on: https://dev.to/franckpachot/mongodb-high-availability-replicaset-in-a-docker-lab-4jlc

MongoDB guarantees consistent and durable write operations through write-ahead logging, which protects data from instance crashes by flushing the journal to disk upon commit. It also protects against network partitions and storage failures with synchronous replication to a quorum of replicas. Replication and failover are built-in and do not require external tools or extensions. To set up a replica set, start three mongod instances as members of the same replica set using the --replSet option with the same name. To initiate the replica set, connect to one of the nodes and specify all members along with their priorities to become primary for the Raft election.

To experiment with replication, I run it in a lab using Docker Compose, where each node is a container. However, the network and disk latencies are too small compared to real deployments. I use Linux utilities tc and strace to inject some artificial latencies and test the setup in terms of latency, consistency, and resilience.

For this post, I write to the primary and read from each node to explain the write concern and its consequences for latency. Take this as an introduction. The examples don’t show all the details, which also depend on read concerns, sharding, and resilience to failures.

Replica Set

I use the following Dockerfile to add some utilities to the MongoDB image:

FROM mongodb/mongodb-community-server
USER root
RUN apt-get update && apt-get install -y iproute2 strace

I start 3 replicas with the following Docker Compose service:

mongo:
    build: .
    volumes:
      - .:/scripts:ro
    # inject 100ms network latency and 50ms disk sync latency 
    cap_add:
      - NET_ADMIN   # for tc
      - SYS_PTRACE  # for strace
    command: |
     bash -xc '
     tc qdisc add dev eth0 root netem delay 100ms ;
     strace -e inject=fdatasync:delay_enter=50000 -f -Te trace=fdatasync -o /dev/null mongod --bind_ip_all --replSet rs0 --logpath /var/log/mongod
     '
    deploy:
      replicas: 3

The command injects a 100ms network latency: with tc qdisc add dev eth0 root netem delay 100ms (it requires NET_ADMIN capability). The MongoDB server is started with strace (it requires SYS_PTRACE capability), which injects a delay of 50000 microseconds (delay_enter=50000) on each call to fdatasync

I declared a service to initiate the replicaset:

init-replica-set:
    build: .
    depends_on:
      mongo:
        condition: service_started
    entrypoint: |
      bash -xc '
        sleep 3 ; 
        mongosh --host mongo --eval "
         rs.initiate( {_id: \"rs0\", members: [
          {_id: 0, priority: 3, host: \"${COMPOSE_PROJECT_NAME}-mongo-1:27017\"},
          {_id: 1, priority: 2, host: \"${COMPOSE_PROJECT_NAME}-mongo-2:27017\"},
          {_id: 2, priority: 1, host: \"${COMPOSE_PROJECT_NAME}-mongo-3:27017\"}]
         });
        ";
        sleep 1
      '

Read after Write application

I use a service to run the client application:

client:
    build: .
    depends_on:
      init-replica-set:
        condition: service_completed_successfully
    volumes:
      - .:/scripts:ro
    entrypoint: |
      bash -xc '
        mongosh --host mongo -f /scripts/read-and-write.js
      '

The read-and-write.js script connects to each node with direct connection, labeled 1️⃣, 2️⃣, and 3️⃣, and also connects to the replica set, labeled 🔢, which writes to the primary and can read from secondary nodes:

const connections = {    
  "🔢": 'mongodb://rs-mongo-1:27017,rs-mongo-2:27017,rs-mongo-3:27017/test?replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=true&w=majority&journal=true',    
  "1️⃣": 'mongodb://rs-mongo-1:27017/test?directConnection=true&connectTimeoutMS=900&serverSelectionTimeoutMS=500&socketTimeoutMS=300',    
  "2️⃣": 'mongodb://rs-mongo-2:27017/test?directConnection=true&connectTimeoutMS=900&serverSelectionTimeoutMS=500&socketTimeoutMS=300',    
  "3️⃣": 'mongodb://rs-mongo-3:27017/test?directConnection=true&connectTimeoutMS=900&serverSelectionTimeoutMS=500&socketTimeoutMS=300',    
};

After defining the connection strings, the script attempts to establish separate connections to each MongoDB node in the replica set, as well as a connection using the replica set URI that can send reads to secondaries. It continuously retries connections until at least one node responds and a primary is detected. The script keeps references to all active connections.

Once the environment is ready, the script enters an infinite loop to perform and monitor read and write operations. On each loop iteration, it first determines the current primary node. It then writes a counter value, which is a simple incrementing integer, to the primary node by updating a document identified by the client’s hostname. After performing the write call, it reads the same document from all nodes — primary, secondaries, and the replica set URI — recording the value retrieved from each and the time it takes for the read to return.

For every read and write, the script logs details, including the value read or written, the node that handled the operation, the time it took, and whether the results match expectations. It uses checkmarks to indicate success and issues mismatch warnings if a value is stale. If an operation fails (such as when a node is temporarily unavailable), the script automatically attempts to reconnect to that node in the background for future operations.

I made all this available in the following repo:

https://github.com/FranckPachot/lab-mongodb-replicaset/tree/blog-202507-mongodb-high-availability-replicaset-in-a-docker-lab

Just start it with:

docker compose up --build

Write Concern majority — wait for network and disk

The connection string specifies w=majority

Once initialized, each line shows the value that is written to the replica set connection 🔢 and read from each connection 🔢,1️⃣, 2️⃣,3️⃣:

Screenshot:

Here is a sample output:

client-1            | 2025-07-08T20:19:01.044Z Write 19 to 🔢 ✅(  358ms) Read 19 from 🔢 ✅(  104ms) 19 from 1️⃣ ✅(  105ms) 19 from 2️⃣ ✅(  105ms) 19 from 3️⃣ ✅(  105ms) client e0edde683498
client-1            | 2025-07-08T20:19:02.111Z Write 20 to 🔢 ✅(  357ms) Read 20 from 🔢 ✅(  104ms) 20 from 1️⃣ ✅(  104ms) 20 from 2️⃣ ✅(  105ms) 20 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:03.179Z Write 21 to 🔢 ✅(  357ms) Read 21 from 🔢 ✅(  103ms) 21 from 1️⃣ ✅(  104ms) 21 from 2️⃣ ✅(  103ms) 21 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:04.244Z Write 22 to 🔢 ✅(  357ms) Read 22 from 🔢 ✅(  103ms) 22 from 1️⃣ ✅(  103ms) 22 from 2️⃣ ✅(  104ms) 22 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:05.310Z Write 23 to 🔢 ✅(  357ms) Read 23 from 🔢 ✅(  105ms) 23 from 1️⃣ ✅(  105ms) 23 from 2️⃣ ✅(  104ms) 23 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:06.377Z Write 24 to 🔢 ✅(  357ms) Read 24 from 🔢 ✅(  105ms) 24 from 1️⃣ ✅(  105ms) 24 from 2️⃣ ✅(  104ms) 24 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:07.443Z Write 25 to 🔢 ✅(  357ms) Read 25 from 🔢 ✅(  104ms) 25 from 1️⃣ ✅(  104ms) 25 from 2️⃣ ✅(  104ms) 25 from 3️⃣ ✅(  104ms) client e0edde683498
client-1            | 2025-07-08T20:19:08.508Z Write 26 to 🔢 ✅(  357ms) Read 26 from 🔢 ✅(  104ms) 26 from 1️⃣ ✅(  104ms) 26 from 2️⃣ ✅(  104ms) 26 from 3️⃣ ✅(  105ms) client e0edde683498

The program verifies that the read gets the latest write (✅), but keep in mind, this is not guaranteed. The default write concern is ‘majority’, which serves as a durability guarantee. It ensures that a write operation is saved to persistent storage on the majority of replicas in the journal. However, it does not wait for the write to be applied to the database and to be visible by reads. The goal is to measure the latency involved in acknowledging durability.

With an artificial latency of 100ms on the network and 50ms on the disk, we observe a connection time of 100ms to a node for both read and write operations.
For writes, it adds 250ms for the majority write concern:

100ms for a secondary to pull the write operation (oplog)
50ms to sync the journal to disk on the secondary
100ms for the secondary to update the sync state to the primary

The total duration is 350ms. It also includes syncing to disk on the primary, which occurs in parallel with the replication.

MongoDB replication differs from many databases in that it employs a mechanism similar to Raft to achieve consistency across multiple nodes. However, changes are pulled by the secondary nodes rather than pushed by the primary. The primary node waits for a commit state, indicated by a Hybrid Logical Clock timestamp, sent by the secondary.

Write Concern: 0 — do not wait for durability

Another difference when comparing with traditional databases is that the client driver is part of the consensus protocol. To demonstrate it, I changed w=majority to w=0 not to wait for any acknowledgment of the write call, and restarted the client, with five replicas of it:

docker compose up --scale client=5

The write is faster, not waiting on the network or disk, but the value that is read is stale:

client-5            | 2025-07-08T20:48:50.823Z Write 113 to 🔢 🚫(    1ms) Read 112 from 🔢 🚫(  103ms) 113 from 1️⃣ ✅(  103ms) 112 from 2️⃣ 🚫(  103ms) 112 from 3️⃣ 🚫(  103ms) client e0e3c8b1bafd
client-3            | 2025-07-08T20:48:50.824Z Write 113 to 🔢 🚫(    1ms) Read 112 from 🔢 🚫(  104ms) 113 from 1️⃣ ✅(  104ms) 112 from 2️⃣ 🚫(  104ms) 112 from 3️⃣ 🚫(  104ms) client 787c2676d17e
client-2            | 2025-07-08T20:48:51.459Z Write 114 to 🔢 🚫(    1ms) Read 113 from 🔢 🚫(  105ms) 114 from 1️⃣ ✅(  104ms) 113 from 2️⃣ 🚫(  105ms) 113 from 3️⃣ 🚫(  104ms) client 9fd577504268
client-1            | 2025-07-08T20:48:51.520Z Write 114 to 🔢 🚫(    1ms) Read 113 from 🔢 🚫(  105ms) 114 from 1️⃣ ✅(  105ms) 113 from 2️⃣ 🚫(  104ms) 113 from 3️⃣ 🚫(  104ms) client e0edde683498
client-4            | 2025-07-08T20:48:51.522Z Write 114 to 🔢 🚫(    1ms) Read 113 from 🔢 🚫(  103ms) 114 from 1️⃣ ✅(  103ms) 113 from 2️⃣ 🚫(  103ms) 113 from 3️⃣ 🚫(  103ms) client a6c1eaab69a7
client-5            | 2025-07-08T20:48:51.530Z Write 114 to 🔢 🚫(    0ms) Read 113 from 🔢 🚫(  103ms) 114 from 1️⃣ ✅(  103ms) 113 from 2️⃣ 🚫(  103ms) 113 from 3️⃣ 🚫(  103ms) client e0e3c8b1bafd
client-3            | 2025-07-08T20:48:51.532Z Write 114 to 🔢 🚫(    1ms) Read 113 from 🔢 🚫(  104ms) 114 from 1️⃣ ✅(  103ms) 113 from 2️⃣ 🚫(  103ms) 113 from 3️⃣ 🚫(  103ms) client 787c2676d17e
client-2            | 2025-07-08T20:48:52.168Z Write 115 to 🔢 🚫(    1ms) Read 114 from 🔢 🚫(  103ms) 115 from 1️⃣ ✅(  103ms) 114 from 2️⃣ 🚫(  103ms) 114 from 3️⃣ 🚫(  103ms) client 9fd577504268
client-4            | 2025-07-08T20:48:52.230Z Write 115 to 🔢 🚫(    1ms) Read 114 from 🔢 🚫(  103ms) 115 from 1️⃣ ✅(  103ms) 114 from 2️⃣ 🚫(  103ms) 114 from 3️⃣ 🚫(  103ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:48:52.229Z Write 115 to 🔢 🚫(    1ms) Read 114 from 🔢 🚫(  104ms) 115 from 1️⃣ ✅(  104ms) 114 from 2️⃣ 🚫(  103ms) 114 from 3️⃣ 🚫(  103ms) client e0edde683498
client-5            | 2025-07-08T20:48:52.237Z Write 115 to 🔢 🚫(    2ms) Read 114 from 🔢 🚫(  103ms) 115 from 1️⃣ ✅(  103ms) 114 from 2️⃣ 🚫(  103ms) 114 from 3️⃣ 🚫(  103ms) client e0e3c8b1bafd
client-3            | 2025-07-08T20:48:52.240Z Write 115 to 🔢 🚫(    1ms) Read 114 from 🔢 🚫(  103ms) 115 from 1️⃣ ✅(  103ms) 114 from 2️⃣ 🚫(  103ms) 114 from 3️⃣ 🚫(  103ms) client 787c2676d17e
client-2            | 2025-07-08T20:48:52.876Z Write 116 to 🔢 🚫(    1ms) Read 115 from 🔢 🚫(  103ms) 116 from 1️⃣ ✅(  104ms) 115 from 2️⃣ 🚫(  104ms) 115 from 3️⃣ 🚫(  103ms) client 9fd577504268
client-4            | 2025-07-08T20:48:52.936Z Write 116 to 🔢 🚫(    1ms) Read 115 from 🔢 🚫(  103ms) 116 from 1️⃣ ✅(  104ms) 115 from 2️⃣ 🚫(  103ms) 115 from 3️⃣ 🚫(  103ms) client a6c1eaab69a7

The write occurs immediately, succeeding as soon as it is buffered on the driver. While this doesn’t guarantee the durability of the acknowledged writes, it does avoid the costs associated with any network latency. In scenarios such as IoT, prioritizing throughput is crucial, even if it means accepting potential data loss during failures.

Because the write is acknowleged immediately, but has to be replicated and applied on other nodes, I read stale values (indicated by 🚫) except when the time to read was higher than the time to replicate and apply, but there’s no guarantee on it.

Write Concern: 1 journal: false

I adjusted the write concern to w=1, which means that the system will wait for acknowledgment from the primary node. By default, this acknowledgment ensures that the journal recording the write operation is saved to persistent storage. However, I disabled it by setting journal=false, allowing the write latency to be reduced to just the network time to the primary, which is approximately 100ms:

client-2            | 2025-07-08T20:50:08.756Z Write 10 to 🔢 ✅(  104ms) Read 10 from 🔢 ✅(  105ms) 10 from 1️⃣ ✅(  105ms) 10 from 2️⃣ ✅(  104ms) 10 from 3️⃣ ✅(  104ms) client 9fd577504268
client-4            | 2025-07-08T20:50:08.949Z Write 10 to 🔢 ✅(  103ms) Read 10 from 🔢 ✅(  105ms) 10 from 1️⃣ ✅(  105ms) 10 from 2️⃣ ✅(  106ms) 10 from 3️⃣ ✅(  105ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:50:08.952Z Write 10 to 🔢 ✅(  103ms) Read 10 from 🔢 ✅(  104ms) 10 from 1️⃣ ✅(  104ms) 10 from 2️⃣ ✅(  104ms) 10 from 3️⃣ ✅(  105ms) client e0edde683498
client-3            | 2025-07-08T20:50:08.966Z Write 10 to 🔢 ✅(  103ms) Read 10 from 🔢 ✅(  104ms) 10 from 1️⃣ ✅(  105ms) 10 from 2️⃣ ✅(  104ms) 10 from 3️⃣ ✅(  104ms) client 787c2676d17e
client-5            | 2025-07-08T20:50:08.970Z Write 10 to 🔢 ✅(  103ms) Read 10 from 🔢 ✅(  105ms) 10 from 1️⃣ ✅(  105ms) 10 from 2️⃣ ✅(  105ms) 10 from 3️⃣ ✅(  105ms) client e0e3c8b1bafd
client-2            | 2025-07-08T20:50:09.569Z Write 11 to 🔢 ✅(  103ms) Read 11 from 🔢 ✅(  104ms) 11 from 1️⃣ ✅(  104ms) 11 from 2️⃣ ✅(  104ms) 11 from 3️⃣ ✅(  104ms) client 9fd577504268
client-4            | 2025-07-08T20:50:09.762Z Write 11 to 🔢 ✅(  104ms) Read 10 from 🔢 🚫(  105ms) 11 from 1️⃣ ✅(  106ms) 11 from 2️⃣ ✅(  105ms) 11 from 3️⃣ ✅(  105ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:50:09.765Z Write 11 to 🔢 ✅(  103ms) Read 11 from 🔢 ✅(  107ms) 10 from 1️⃣ 🚫(  104ms) 11 from 2️⃣ ✅(  105ms) 11 from 3️⃣ ✅(  106ms) client e0edde683498
client-3            | 2025-07-08T20:50:09.778Z Write 11 to 🔢 ✅(  105ms) Read 11 from 🔢 ✅(  104ms) 11 from 1️⃣ ✅(  105ms) 11 from 2️⃣ ✅(  105ms) 11 from 3️⃣ ✅(  104ms) client 787c2676d17e
client-5            | 2025-07-08T20:50:09.782Z Write 11 to 🔢 ✅(  103ms) Read 11 from 🔢 ✅(  105ms) 11 from 1️⃣ ✅(  104ms) 11 from 2️⃣ ✅(  105ms) 11 from 3️⃣ ✅(  105ms) client e0e3c8b1bafd
client-2            | 2025-07-08T20:50:10.381Z Write 12 to 🔢 ✅(  103ms) Read 11 from 🔢 🚫(  105ms) 11 from 1️⃣ 🚫(  105ms) 12 from 2️⃣ ✅(  105ms) 12 from 3️⃣ ✅(  105ms) client 9fd577504268
client-1            | 2025-07-08T20:50:10.578Z Write 12 to 🔢 ✅(  104ms) Read 12 from 🔢 ✅(  106ms) 12 from 1️⃣ ✅(  105ms) 12 from 2️⃣ ✅(  105ms) 12 from 3️⃣ ✅(  106ms) client e0edde683498
client-4            | 2025-07-08T20:50:10.579Z Write 12 to 🔢 ✅(  104ms) Read 12 from 🔢 ✅(  106ms) 12 from 1️⃣ ✅(  106ms) 12 from 2️⃣ ✅(  105ms) 12 from 3️⃣ ✅(  105ms) client a6c1eaab69a7
client-5            | 2025-07-08T20:50:10.594Z Write 12 to 🔢 ✅(11751ms) Read 11 from 🔢 🚫(  106ms) 12 from 1️⃣ ✅(  106ms) 11 from 2️⃣ 🚫(  106ms) 11 from 3️⃣ 🚫(  105ms) client e0e3c8b1bafd
client-3            | 2025-07-08T20:50:10.592Z Write 12 to 🔢 ✅(11753ms) Read 11 from 🔢 🚫(  105ms) 12 from 1️⃣ ✅(  105ms) 11 from 2️⃣ 🚫(  105ms) 11 from 3️⃣ 🚫(  105ms) client 787c2676d17e

It is important to understand the consequences of failure. The change is written to the filesystem buffers, but may not have been fully committed to disk since fdatasync() is called asynchronously every 100 milliseconds. This means that if the Linux instance crashes, up to 100 milliseconds of acknowledged transactions could be lost. However, if the MongoDB instance fails, there is no data loss, as the filesystem buffers remain intact.

Write Concern: 1 journal: true

Still with w=1, but the default journal=true, an fdatasync() is run before the acknowledgment of the write, to guarantee durability on that node. With my injected latency, it adds 50 milliseconds:

client-1            | 2025-07-08T20:52:34.922Z Write 48 to 🔢 ✅(  155ms) Read 48 from 🔢 ✅(  105ms) 48 from 1️⃣ ✅(  105ms) 47 from 2️⃣ 🚫(  105ms) 48 from 3️⃣ ✅(  105ms) client e0edde683498
client-3            | 2025-07-08T20:52:35.223Z Write 50 to 🔢 ✅(  154ms) Read 50 from 🔢 ✅(  104ms) 50 from 1️⃣ ✅(  105ms) 49 from 2️⃣ 🚫(  105ms) 50 from 3️⃣ ✅(  105ms) client 787c2676d17e
client-2            | 2025-07-08T20:52:35.276Z Write 49 to 🔢 ✅(  155ms) Read 49 from 🔢 ✅(  104ms) 49 from 1️⃣ ✅(  105ms) 48 from 2️⃣ 🚫(  105ms) 49 from 3️⃣ ✅(  105ms) client 9fd577504268
client-5            | 2025-07-08T20:52:35.377Z Write 49 to 🔢 ✅(  155ms) Read 49 from 🔢 ✅(  105ms) 49 from 1️⃣ ✅(  104ms) 48 from 2️⃣ 🚫(  105ms) 49 from 3️⃣ ✅(  104ms) client e0e3c8b1bafd
client-4            | 2025-07-08T20:52:35.430Z Write 50 to 🔢 ✅(  154ms) Read 50 from 🔢 ✅(  104ms) 50 from 1️⃣ ✅(  105ms) 49 from 2️⃣ 🚫(  105ms) 50 from 3️⃣ ✅(  105ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:52:35.785Z Write 49 to 🔢 ✅(  154ms) Read 49 from 🔢 ✅(  103ms) 49 from 1️⃣ ✅(  103ms) 48 from 2️⃣ 🚫(  103ms) 49 from 3️⃣ ✅(  103ms) client e0edde683498
client-3            | 2025-07-08T20:52:36.086Z Write 51 to 🔢 ✅(  154ms) Read 51 from 🔢 ✅(  104ms) 51 from 1️⃣ ✅(  105ms) 50 from 2️⃣ 🚫(  104ms) 51 from 3️⃣ ✅(  104ms) client 787c2676d17e
client-2            | 2025-07-08T20:52:36.140Z Write 50 to 🔢 ✅(  154ms) Read 50 from 🔢 ✅(  105ms) 50 from 1️⃣ ✅(  104ms) 49 from 2️⃣ 🚫(  104ms) 50 from 3️⃣ ✅(  105ms) client 9fd577504268
client-5            | 2025-07-08T20:52:36.241Z Write 50 to 🔢 ✅(  155ms) Read 50 from 🔢 ✅(  104ms) 50 from 1️⃣ ✅(  103ms) 49 from 2️⃣ 🚫(  103ms) 50 from 3️⃣ ✅(  104ms) client e0e3c8b1bafd
client-4            | 2025-07-08T20:52:36.294Z Write 51 to 🔢 ✅(  154ms) Read 51 from 🔢 ✅(  102ms) 51 from 1️⃣ ✅(  103ms) 50 from 2️⃣ 🚫(  103ms) 51 from 3️⃣ ✅(  103ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:52:36.645Z Write 50 to 🔢 ✅(  154ms) Read 50 from 🔢 ✅(  103ms) 50 from 1️⃣ ✅(  103ms) 49 from 2️⃣ 🚫(  103ms) 50 from 3️⃣ ✅(  103ms) client e0edde683498
client-3            | 2025-07-08T20:52:36.950Z Write 52 to 🔢 ✅(  154ms) Read 52 from 🔢 ✅(  104ms) 52 from 1️⃣ ✅(  103ms) 51 from 2️⃣ 🚫(  103ms) 52 from 3️⃣ ✅(  104ms) client 787c2676d17e
client-2            | 2025-07-08T20:52:37.003Z Write 51 to 🔢 ✅(  154ms) Read 51 from 🔢 ✅(  105ms) 51 from 1️⃣ ✅(  105ms) 50 from 2️⃣ 🚫(  105ms) 51 from 3️⃣ ✅(  104ms) client 9fd577504268
client-5            | 2025-07-08T20:52:37.103Z Write 51 to 🔢 ✅(  155ms) Read 51 from 🔢 ✅(  103ms) 51 from 1️⃣ ✅(  104ms) 50 from 2️⃣ 🚫(  104ms) 51 from 3️⃣ ✅(  104ms) client e0e3c8b1bafd
client-4            | 2025-07-08T20:52:37.155Z Write 52 to 🔢 ✅(  155ms) Read 52 from 🔢 ✅(  104ms) 52 from 1️⃣ ✅(  104ms) 51 from 2️⃣ 🚫(  104ms) 52 from 3️⃣ ✅(  103ms) client a6c1eaab69a7
client-1            | 2025-07-08T20:52:37.508Z Write 51 to 🔢 ✅(  154ms) Read 51 from 🔢 ✅(  104ms) 51 from 1️⃣ ✅(  104ms) 50 from 2️⃣ 🚫(  104ms) 51 from 3️⃣ ✅(  104ms) client e0edde683498

In summary, MongoDB allows applications to balance performance (lower latency) and durability (resilience to failures) rather than relying on one-size-fits-all configuration that waits even when it is not necessary according to business requirements. For any given setup, the choice must consider the business requirements as well as the infrastructure: resilience of compute and storage services, local or remote storage, and network latency between nodes. In a lab, injecting network and disk latency can help simulate scenarios that illustrate the consequences of reading from secondary nodes or recovering from a failure.

To fully understand how it works, I recommend checking your understanding by reading the documentation on Write Concern and practicing in a lab. The defaults may vary per driver and version, and the consequences may not be visible without a high load or failure. In current versions, MongoDB favors data protection with the write consistency defaulting to “majority” and journaling to true (writeConcernMajorityJournalDefault), but if you set w:1 journaling defaults to false.

Unique Index on NULL Values in SQL & NoSQL databases — an example

Franck Pachot — Wed, 15 Jan 2025 08:06:02 GMT

Unique Index on NULL Values in SQL & NoSQL databases — an example

https://dev.to/franckpachot/unique-index-on-null-values-in-sql-nosql-34ej

You can create a unique index explicitly or implicitly with a unique constraint to ensure that a group of columns has no duplicates.
However, how do you handle columns that are NULL or have no value? Is the lack of a value a simple indicator that you don’t want to duplicate, or should it be treated as unknown and potentially equivalent to any value, thus not violating the unique constraint?

I will consider the following three scenarios:

NoSQL absence of value in a document, like in MongoDB
SQL standard for NULL, as seen in PostgreSQL or YugabyteDB
Oracle Database implementation, which varies from the SQL standard

Let’s take a telco example. I record calls from a caller to a callee at a specific time. I create a unique index to protect my database and prevent duplicates, as phone network devices may send a call record twice. In SQL, I would declare such table and index:

create table calls (
    id int generated always as identity primary key,
    time timestamp,
    callee varchar(11),
    caller varchar(11)
);

create unique index  calls_time_callee_caller_idx
 on calls (time, callee, caller)
;

Although it may not be the best choice of datatype, it is runnable on all SQL databases.

I’ve declared no NOT NULL columns. Usually, the call time, caller, and callee are known, but there may be No Caller ID. I’ll use this to expose the behavior with NULL values in a UNIQUE INDEX.

NULL’s three-valued logic in SQL can be confusing (read @rtukpe’s SQL NULLs are Weird!). To make it easier, I will start with a NoSQL database that prioritizes developer experience.

NoSQL behavior: MongoDB

A call record is a document sent from a telco network device. I use only the key attributes for this demo, but it usually includes many nested data structures. A document database may be a good fit for ingesting it.

In NoSQL, I don’t need to create the collection beforehand, but I create the unique index to protect my database integrity:

db.calls.createIndex( 
 { "time": 1 , "callee": 1, "caller" : 1 }, { unique: true } 
) ;

I insert a first record:

mongodb> db.calls.insertOne( 
 { time: new Date("2025-02-10T08:00:00Z"),
   callee: "+0000000000",
   caller: "+1111111111" 
} );

{
  acknowledged: true,
  insertedId: ObjectId('677d64a4e74b535fe6d4b0c2')
}

I insert a second record from a different caller:

mongodb> db.calls.insertOne( 
 { time: new Date("2025-02-10T08:00:00Z"),
   callee: "+0000000000",
   caller: "+2222222222" 
} );

{
  acknowledged: true,
  insertedId: ObjectId('677d64a4e74b535fe6d4b0c2')
}

I receive the first record again:

mongodb> db.calls.insertOne( 
 { time: new Date("2025-02-10T08:00:00Z"),
   callee: "+0000000000",
   caller: "+1111111111" 
} );

MongoServerError: E11000 duplicate key error collection: mongodb.calls index: time_1_callee_1_caller_1 dup key: { time: new Date(1739174400000), callee: "+0000000000", caller: "+1111111111" }

MongoDB raises an error to guarantee the integrity of my database. That was my goal when creating a unique index.

I received a record where the caller is unknown:

mongodb> db.calls.insertOne( 
 { time: new Date("2025-02-10T08:00:00Z"),
   callee: "+0000000000",
   caller: null 
} );

{
  acknowledged: true,
  insertedId: ObjectId('677d6a4be74b535fe6d4b0cb')
}

I receive the same record, and once again, MongoDB prevents the insertion of a duplicate record:

mongodb> db.calls.insertOne( 
 { time: new Date("2025-02-10T08:00:00Z"),
   callee: "+0000000000",
   caller: null 
} );

MongoServerError: E11000 duplicate key error collection: mongodb.calls index: time_1_callee_1_caller_1 dup key: { time: new Date(1739174400000), callee: "+0000000000", caller: null }

I get the same with a record that has no caller attribute, which is similar to caller: null because null in MongoDB is an absence of value:

mongodb> db.calls.insertOne( 
 { time: new Date("2025-02-10T08:00:00Z"),
   callee: "+0000000000"
} );

MongoServerError: E11000 duplicate key error collection: mongodb.calls index: time_1_callee_1_caller_1 dup key: { time: new Date(1739174400000), callee: "+0000000000", caller: null }

Here are the three records that were inserted — no duplicates:

mongodb> db.calls.find();
[
  {
    _id: ObjectId('677d65a6e74b535fe6d4b0c5'),
    time: ISODate('2025-02-10T08:00:00.000Z'),
    callee: '+0000000000',
    caller: '+1111111111'
  },
  {
    _id: ObjectId('677d65ade74b535fe6d4b0c6'),
    time: ISODate('2025-02-10T08:00:00.000Z'),
    callee: '+0000000000',
    caller: '+2222222222'
  },
  {
    _id: ObjectId('677d6a4be74b535fe6d4b0cb'),
    time: ISODate('2025-02-10T08:00:00.000Z'),
    callee: '+0000000000',
    caller: null
  }
]

Nulls represent the explicit absence of value in MongoDB. Two documents with the same lack of value in a key are considered to have the same key, raising a duplicate key in a unique index.

This contrasts with SQL, which has a fixed structure requiring all columns to be present in every record. In SQL, NULL signifies unknown values rather than an absence, and two unknown values are not the same until they are known to have the same value.

It is easy to get the same behavior in MongoDB. I re-create my index as a partial index that ignores null or unexisting caller so that the unique constraint concerns only documents where all key attributes are present:

mongodb> db.calls.dropIndex(
 { time: 1, callee: 1, caller: 1 }
);

{ nIndexesWas: 2, ok: 1 }

mongodb> db.calls.createIndex( 
  { time: 1, callee: 1, caller: 1 },
  { 
    unique: true, 
    partialFilterExpression: { 
      caller: { $type: "string" }  
    }
  }
);

time_1_callee_1_caller_1

Such an index considers only entries with a caller of type string, ignoring the absence of a value or null, which is of type null.

I can insert documents with the absence of caller, or null caller:

mongodb> db.calls.insertMany( [
 { time: new Date("2025-02-10T08:00:00Z"),
   callee: "+0000000000",
   caller: null 
 } ,
 { time: new Date("2025-02-10T08:00:00Z"),
   callee: "+0000000000"
 } 
] );

{
  acknowledged: true,
  insertedIds: {
    '0': ObjectId('677d6f3ae74b535fe6d4b0d0'),
    '1': ObjectId('677d6f3ae74b535fe6d4b0d1')
  }
}

Still, it detects duplicates when the caller is present:

mongodb> db.calls.insertOne( 
 { time: new Date("2025-02-10T08:00:00Z"),
   callee: "+0000000000",
   caller: "+1111111111" 
} );

MongoServerError: E11000 duplicate key error collection: mongodb.calls index: time_1_callee_1_caller_1 dup key: { time: new Date(1739174400000), callee: "+0000000000", caller: "+1111111111" }

I have five calls, with some having no caller or null caller:

mongodb> db.calls.find();
[
  {
    _id: ObjectId('677d65a6e74b535fe6d4b0c5'),
    time: ISODate('2025-02-10T08:00:00.000Z'),
    callee: '+0000000000',
    caller: '+1111111111'
  },
  {
    _id: ObjectId('677d65ade74b535fe6d4b0c6'),
    time: ISODate('2025-02-10T08:00:00.000Z'),
    callee: '+0000000000',
    caller: '+2222222222'
  },
  {
    _id: ObjectId('677d6a4be74b535fe6d4b0cb'),
    time: ISODate('2025-02-10T08:00:00.000Z'),
    callee: '+0000000000',
    caller: null
  },
  {
    _id: ObjectId('677d6f3ae74b535fe6d4b0d0'),
    time: ISODate('2025-02-10T08:00:00.000Z'),
    callee: '+0000000000',
    caller: null
  },
  {
    _id: ObjectId('677d6f3ae74b535fe6d4b0d1'),
    time: ISODate('2025-02-10T08:00:00.000Z'),
    callee: '+0000000000'
  }
]

This mimics the SQL behavior where the unique constraint is raised only when the value is known:

mongodb> db.calls.updateOne( 
 { caller: null },
 { $set: { caller: "+3333333333" } } 
);

{
  acknowledged: true,
  insertedId: null,
  matchedCount: 1,
  modifiedCount: 1,
  upsertedCount: 0
}

mongodb> db.calls.updateOne( 
 { caller: null }, 
 { $set: { caller: "+3333333333" } } 
);

MongoServerError: E11000 duplicate key error collection: mongodb.calls index: time_1_callee_1_caller_1 dup key: { time: new Date(1739174400000), callee: "+0000000000", caller: "+3333333333" }

With MongoDB, two document keys with a null attribute or without this attribute are the same index key. You can create a partial index to ignore them.

SQL behavior: PostgreSQL (or YugabyteDB)

SQL is different: NULL is not the absence of a column value. If a column is absent, you must create a different table to store your row without it. A NULL in SQL is an unknown value, possibly because it is not known at the insert time and is expected to be updated later.
A unique constraint is raised only when duplicates exist among the known values, ignoring the null ones.

I’ve created a table with the description above in YugabyteDB 2.25, which is compatible with PostgreSQL 15:

yugabyte=> select version();
                                                                                            version                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 15.2-YB-2.25.0.0-b0 on aarch64-unknown-linux-gnu, compiled by clang version 17.0.6 (https://github.com/yugabyte/llvm-project.git 9b881774e40024e901fc6f3d313607b071c08631), 64-bit
(1 row)

yugabyte=# \d calls
                                    Table "public.calls"
 Column |            Type             | Collation | Nullable |           Default
--------+-----------------------------+-----------+----------+------------------------------
 id     | integer                     |           | not null | generated always as identity
 time   | timestamp without time zone |           |          |
 callee | character varying(11)       |           |          |
 caller | character varying(11)       |           |          |
Indexes:
    "calls_pkey" PRIMARY KEY, lsm (id ASC)
    "calls_time_callee_caller_idx" UNIQUE, lsm ("time" ASC, callee ASC, caller ASC)

Duplicate keys are detected:

yugabyte=# insert into calls (time, callee, caller)
 values ('2025-02-10T08:00:00Z', '+0000000000', '+1111111111')
;
INSERT 0 1
yugabyte=# insert into calls (time, callee, caller)
 values ('2025-02-10T08:00:00Z', '+0000000000', '+2222222222')
;
INSERT 0 1
yugabyte=# insert into calls (time, callee, caller)
 values ('2025-02-10T08:00:00Z', '+0000000000', '+1111111111')
;

ERROR:  duplicate key value violates unique constraint "calls_time_callee_caller_idx"

I’m able to insert multiple similar rows when one of the columns in key is null:

yugabyte=# insert into calls (time, callee, caller)
 values ('2025-02-10T08:00:00Z', '+0000000000', null)
;
INSERT 0 1
yugabyte=# insert into calls (time, callee, caller)
 values ('2025-02-10T08:00:00Z', '+0000000000', null)
;
INSERT 0 1

You must think about it like this:

a SQL constraint raises an error only when its condition evaluates to false
equality between unknown values (nulls) is neither true nor false. It is unknown
the unknown condition result is null, not false, and then doesn’t raise an error
it is different from a SELECT DISTINCT or SELECT UNIQUE that shows only what is known to be unique, ignoring the unknown ones

In PostgreSQL 15 (as well as YugabyteDB 2.25 and 2025.1, which will be released soon), you can change this behavior with the NULLS NOT DISTINCT clause of CREATE INDEX. It will detect existing duplicates and reject future duplicate insertions:

yugabyte=# create unique index calls_unique_index 
 on calls (time, callee, caller)
 NULLS NOT DISTINCT
;

ERROR:  could not create unique index "calls_unique_index"

Higher versions of PostgreSQL will show which row failed:

DETAIL:  Key ("time", callee, caller)=(2025-02-10 08:00:00, +0000000000, null) is duplicated.

I check them and delete them:

yugabyte=# select * from calls;
 id |        time         |   callee    |   caller
----+---------------------+-------------+-------------
  1 | 2025-02-10 08:00:00 | +0000000000 | +1111111111
  2 | 2025-02-10 08:00:00 | +0000000000 | +2222222222
  4 | 2025-02-10 08:00:00 | +0000000000 |
  5 | 2025-02-10 08:00:00 | +0000000000 |
(4 rows)

yugabyte=# delete from calls where id=5;
DELETE 1

When YugabyteDB CREATE INDEX fails during backfilling and without transactional DDL, it may remain INVALID and have to be dropped:

yugabyte=> \d calls
                                    Table "public.calls"
 Column |            Type             | Collation | Nullable |           Default
--------+-----------------------------+-----------+----------+------------------------------
 id     | integer                     |           | not null | generated always as identity
 time   | timestamp without time zone |           |          |
 callee | character varying(11)       |           |          |
 caller | character varying(11)       |           |          |
Indexes:
    "calls_pkey" PRIMARY KEY, lsm (id ASC)
    "calls_time_callee_caller_idx" UNIQUE, lsm ("time" ASC, callee ASC, caller ASC)
    "calls_unique_index" UNIQUE, lsm ("time" ASC, callee ASC, caller ASC) NULLS NOT DISTINCT INVALID

yugabyte=> drop index calls_unique_index;
DROP INDEX

As I removed the duplicates and the NULLS NOT DISTINCT index was successful:

yugabyte=# create unique index calls_unique_index 
 on calls (time, callee, caller)
 NULLS NOT DISTINCT
;
CREATE INDEX

                                    Table "public.calls"
 Column |            Type             | Collation | Nullable |           Default
--------+-----------------------------+-----------+----------+------------------------------
 id     | integer                     |           | not null | generated always as identity
 time   | timestamp without time zone |           |          |
 callee | character varying(11)       |           |          |
 caller | character varying(11)       |           |          |
Indexes:
    "calls_pkey" PRIMARY KEY, lsm (id ASC)
    "calls_time_callee_caller_idx" UNIQUE, lsm ("time" ASC, callee ASC, caller ASC)
    "calls_unique_index" UNIQUE, lsm ("time" ASC, callee ASC, caller ASC) NULLS NOT DISTINCT

With such an index, two-valued logic is used, and similar keys are duplicated, even if they contain nulls:

yugabyte=# insert into calls (time, callee, caller)
 values ('2025-02-10T08:00:00Z', '+0000000000', null)
;
ERROR:  duplicate key value violates unique constraint "calls_unique_index"

In SQL, all-null index entries follow the same rule. With NULLS NOT DISTINCT, only one can be inserted:

yugabyte=# insert into calls (time, callee, caller)
 values (null, null, null)
;
INSERT 0 1

yugabyte=# insert into calls (time, callee, caller)
 values (null, null, null)
;
ERROR:  duplicate key value violates unique constraint "calls_unique_index"

PostgreSQL follows the SQL standard, and YugabyteDB follows PostgreSQL behavior. Markus Winand references the few databases that implement NOT NULL DISTINCT:

YugabyteDB doesn’t need a dedicated line because it is PostgreSQL compatible: Support NULLS NOT DISTINCT on unique Index

Oracle Database behavior

I’ve created a table with the description above:

Oracle 23ai> info calls
TABLE: CALLS
         LAST ANALYZED:
         ROWS         :
         SAMPLE SIZE  :
         INMEMORY     :DISABLED
         COMMENTS     :

Columns
NAME         DATA TYPE           NULL  DEFAULT    COMMENTS
*ID          NUMBER(38,0)        No    "ADMIN"."ISEQ$$_149713".nextval
 TIME        TIMESTAMP(6)        Yes
 CALLEE      VARCHAR2(11 BYTE)   Yes
 CALLER      VARCHAR2(11 BYTE)   Yes

Indexes
INDEX_NAME                            UNIQUENESS    STATUS    FUNCIDX_STATUS    COLUMNS
_____________________________________ _____________ _________ _________________ _______________________
ADMIN.SYS_C0024911                    UNIQUE        VALID                       ID
ADMIN.CALLS_TIME_CALLEE_CALLER_IDX    UNIQUE        VALID                       TIME, CALLEE, CALLER

Duplicate keys are detected:

Oracle 23ai> insert into calls (time, callee, caller)
 values (timestamp '2025-02-10 08:00:00', '+0000000000', '+1111111111')
;

1 row inserted.

Oracle 23ai> insert into calls (time, callee, caller)
 values (timestamp '2025-02-10 08:00:00', '+0000000000', '+2222222222')
;

1 row inserted.

Oracle 23ai> insert into calls (time, callee, caller)
 values (timestamp '2025-02-10 08:00:00', '+0000000000', '+1111111111')
;

Error starting at line : 1 in command -
insert into calls (time, callee, caller)
 values (timestamp '2025-02-10 08:00:00', '+0000000000', '+1111111111')
Error report -
ORA-00001: unique constraint (ADMIN.CALLS_TIME_CALLEE_CALLER_IDX) violated on table ADMIN.CALLS columns (TIME, CALLEE, CALLER)
ORA-03301: (ORA-00001 details) row with column values (TIME:10-FEB-25 08.00.00.000000 AM, CALLEE:'+0000000000', CALLER:'+1111111111') already exists

I try to insert multiple similar rows when one of the columns in the key is null, which should be allowed by SQL standard:

Oracle 23ai> insert into calls (time, callee, caller)
 values (timestamp '2025-02-10 08:00:00', '+0000000000', null)
;

1 row inserted.

Oracle 23ai> insert into calls (time, callee, caller)
 values (timestamp '2025-02-10 08:00:00', '+0000000000', null)
;

Error starting at line : 1 in command -
insert into calls (time, callee, caller)
 values (timestamp '2025-02-10 08:00:00', '+0000000000', null)
Error report -
ORA-00001: unique constraint (ADMIN.CALLS_TIME_CALLEE_CALLER_IDX) violated on table ADMIN.CALLS columns (TIME, CALLEE, CALLER)
ORA-03301: (ORA-00001 details) row with column values (TIME:10-FEB-25 08.00.00.000000 AM, CALLEE:'+0000000000', CALLER:NULL) already exists

The Oracle Database behaves like in my MongoDB example rather than the SQL one, or with NOT NULL DISTINCT, even if it doesn’t support this clause. But there’s a difference:

Oracle 23ai> insert into calls (time, callee, caller)
      values (null, null, null)
     ;

1 row inserted.

Oracle 23ai> insert into calls (time, callee, caller)
      values (null, null, null)
     ;

1 row inserted.

Oracle 23ai> insert into calls (time, callee, caller)
      values (null, null, null)
     ;

1 row inserted.

Oracle 23ai> select * from calls;

   ID TIME                               CALLEE         CALLER
_____ __________________________________ ______________ ______________
    7 10-FEB-25 08.00.00.000000000 AM    +0000000000    +1111111111
    8 10-FEB-25 08.00.00.000000000 AM    +0000000000    +2222222222
   10 10-FEB-25 08.00.00.000000000 AM    +0000000000
   12
   13
   14

6 rows selected.

With Oracle, all NULLs are distinct, but some NULLs are more distinct than others 😉

two NULL are considered as NOT DISTINCT when other columns in the index key are equal
two NULL are considered as DISTINCT when all columns in the index key are NULL

The reason is the singular implementation of indexes in Oracle, where a zero-length value in the key represents a null, and null index entries are not indexed (all indexes on nullable values are actually partial indexes). Note that an empty string is a zero-length value, and behaves like a null (similar to another NoSQL database, DynamoDB). This differs from the SQL standard but must remain compatible with how it has always worked.

Oracle 23ai supports none of [NOT] NULL DISTINCT or partial indexing clauses. However, you can use the fact that all indexes are partial and apply NVL, COALESCE, or CASE to replace a NULL with a unique value, like the primary key:

Oracle 23ai> drop index calls_time_callee_caller_idx;

Oracle 23ai> create unique index  calls_time_callee_caller_idx
 on calls ( time, callee, nvl(caller,id) )
;

This index doesn’t consider nulls as violating the unique constraint because the constraint is created with an index on different values:

Oracle 23ai> insert into calls (time, callee, caller)
 values (timestamp '2025-02-10 08:00:00', '+0000000000', null)
;

1 row inserted.

Oracle 23ai> select * from calls;

   ID TIME                               CALLEE         CALLER
_____ __________________________________ ______________ ______________
    7 10-FEB-25 08.00.00.000000000 AM    +0000000000    +1111111111
    8 10-FEB-25 08.00.00.000000000 AM    +0000000000    +2222222222
   10 10-FEB-25 08.00.00.000000000 AM    +0000000000
   12
   13
   14
   15 10-FEB-25 08.00.00.000000000 AM    +0000000000

7 rows selected.

Seeing this, you may wonder how Oracle does when using the Oracle Database API for MongoDB when inserting all null values, and it is compatible with how MongoDB works:

Connecting to:          mongodb://@XXX.adb.us-ashburn-1.oraclecloudapps.com:27017/ora_mdb?authMechanism=PLAIN&authSource=%24external&ssl=true&retryWrites=false&loadBalanced=true&appName=mongosh+2.3.7
Using MongoDB:          4.2.14
Using Mongosh:          2.3.7

ora_mdb> db.calls.createIndex(
...  { "time": 1 , "callee": 1, "caller" : 1 }, { unique: true }
... ) ;
time_1_callee_1_caller_1
ora_mdb> db.calls.insertOne(  { } );
{
  acknowledged: true,
  insertedId: ObjectId('677d97ff5c88a98fedd4b0ca')
}
ora_mdb> db.calls.insertOne(  { } );
Uncaught:
MongoServerError[MONGO-11000]: ORA-00001: unique constraint (ORA.$ora:calls.time_1_callee_1_caller_1) violated on table ORA.calls columns (SYS_NC00005$, SYS_NC00006$, SYS_NC00007$)
ORA-03301: (ORA-00001 details) row with column values (SYS_NC00005$:'01', SYS_NC00006$:'01', SYS_NC00007$:'01') already exists

The behavior looks correct, with the same documents not being considered distinct and raising the duplicate key error. From the message, it seems that the values in the Oracle index are (‘01’,’01',’01'), which explains why it doesn’t behave like an all NULL entry. Of curiosity, here is the index that Oracle creates in the SQL schema for a MongoDB index:

ora_sql> set ddl storage off
DDL Option STORAGE was set to OFF

ora_sql> ddl ora."$ora:calls.time_1_callee_1_caller_1"

  CREATE UNIQUE MULTIVALUE INDEX "ORA"."$ora:calls.time_1_callee_1_caller_1" ON "ORA"."calls" (
JSON_MKMVI(JSON_TABLE( "DATA", '$' PRESENT ON EMPTY MINIMAL CROSS PRODUCT WITH ERROR ON PARALLEL ARRAYS COLUMNS( NESTED PATH '$."time"[*]' COLUMNS( "K0" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH ) , NESTED PATH '$."callee"[*]' COLUMNS( "K1" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH ) , NESTED PATH '$."caller"[*]' COLUMNS( "K2" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH ) ) )  AS "K0","K1","K2"), 
JSON_QUERY("DATA" FORMAT OSON , '$."callee"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE), 
JSON_QUERY("DATA" FORMAT OSON , '$."caller"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS  WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)  MULTIVALUE))
  PCTFREE 10 INITRANS 20 MAXTRANS 68 COMPUTE STATISTICS
  TABLESPACE "DATA" ;

Oracle implements a MongoDB-compatible API on top of the Oracle Database through many transformations in the SQL queries. Storing a document in an OSON column is easy, but indexing documents requires more complexity to emulate MongoDB behavior.

This long post explored an elementary example: a table with nullable columns used in a unique index. All databases behave differently except when they are genuinely compatible, such as PostgreSQL and YugabyteDB.
There are a few things to remember:

Null in MongoDB indicates the absence of a value. Two null or absent keys are considered not distinct and may raise a duplicate key error. Partial indexes can be used to behave differently. In a future blog post, we will see how to differentiate an explicit null from the absence of the attribute.
Null in SQL indicates that the value is unknown, and a three-value logic is applied when comparing to check for duplicates. Comparing with unknown is unknown and does not raise a duplicate error.
PostgreSQL and YugabyteDB are compatible with the SQL standard and allow you to choose the behavior of unique indexes with an SQL:2016 clause.
Oracle (and SQL Server) behave differently, but expression-based indexes can provide workarounds.
In SQL, most columns should be declared NOT NULL to avoid problems. Unfortunately, this is not the default and may require more tables and joins.
I didn’t mention DynamoDB, where nulls indicate an empty value, like in Oracle Database, because it cannot be part of the primary key, and there are no unique secondary indexes.

No database is inherently better or worse than the others. The biggest mistake is neglecting to understand your database’s behavior. When migrating from Oracle to a different database or vice versa, understanding the distinct handling of NULL values is crucial.

I love SQL and consider NULL to mean “exists but not yet known” (except for the behavior described earlier in Oracle and some other cases, like in an outer join result). I have never had trouble using NULL only for existing attributes with unknown values. However, SQL’s differences from other programming languages can lead to frequent misuse of NULL, and a lousy developer experience when used in conditions. In contrast, MongoDB provides a more intuitive API for developers who are used to different languages, and its NULL handling adheres to the same principles, avoiding the confusing three-valued logic.

When I refer to the SQL standard, I’m referring to the definition of UNIQUE CONSTRAINT rather than UNIQUE INDEX. Indexes are implementation details that help find values, including duplicates. SQL defines uniqueness as:

A unique constraint is satisfied if and only if no two rows in a table have the same non-null values in the unique columns.

Read Troubles with Nulls, Views from the Users if you doubt that two-valued logic is more popular than the SQL behavior. Choose the DB that works best for you, but I recommend you learn how it handles nulls in all cases.

Some interesting links on the topic:

https://medium.com/media/ca7d6706f689f21e563086132321bbb8/href

The Log Is (not) The Database

Franck Pachot — Wed, 05 Jun 2024 10:34:27 GMT

The Log Is (not) The Database

There is a common saying “The Log Is The Database”, to explain some database innovations in cloud-native environments. For example, Kafka stores the events that modify data, Amazon Aurora’s database instances send only the Write-Ahead Logging to the storage, and modern databases use LSM Tree to append all changes to a log. I dislike this saying for two reasons:

First, there’s no real innovation. In traditional databases, your changes are first saved in the redo log, transactional log, or write-ahead log (WAL), and once they are saved, they are considered durable. “Write-Ahead Logging” means exactly that it is written first, and it’s not new, but was described in the ARIES paper back in 1992.
Second, if your database consists only of a log, then it’s not really a database but more of a sequentially written file. The primary function of databases is to process data, and a log file isn’t efficient for much else besides disaster recovery. Additionally, a true database allows multiple users to read and write at the same time, managing concurrency, which often requires memory structures that aren’t written to the log (except for higher resilience to failure, like YugabyteDB does with Raft).

Those memory structures cannot scale out, so distributed SQL databases like YugabyteDB are closer to “The Log Is The Database”. However, this would still ignore the fact that the log must be compacted to avoid read amplification when processing data.

I selected this title because when you consider the write-path only, the distributed log serves as the database for YugabyteDB. All data manipulation (including reading and writing intents, managing locks for consistency, and transaction control) is distributed to LSM Trees. Additionally, they are distributed over the network using a shared-nothing architecture, which is why I illustrated it with a mousepad from the previous century that I found in a drawer when working at CERN: “The Network Is The Computer.”

YugabyteDB utilizes PostgreSQL code for the query layer. This layer is stateless and operates on all nodes. Write operations, and some reward intents, are converted into a log of write requests that are sharded based on their key (the primary key for the tables, the indexed columns for secondary indexes), sent in batches to the storage layer, replicated as a Raft log, and then stored in LSM Trees.

Enough talking, let’s look at it. I created a table on YugabyteDB cluster:

yugabyte=# create table demo (
  id bigserial primary key, a int, b int, c int
) split into 1 tablets;
CREATE TABLE

I forced it to a single tablet to make it easer to look at it.

I query the YB-Master Web Console /dump-entities endpoint to get the table UUID (more details here):

# curl -sL http://yb0.pachot.net:7000/dump-entities |
    jq -r '.tables[] | select(.table_name == "demo") '
{
  "table_id": "000033c000003000800000000000408e",
  "keyspace_id": "000033c0000030008000000000000000",
  "table_name": "demo",
  "state": "RUNNING"
}

The same endpoint exposes the details about the tablets.
I extend my JQ script to find the tablet leader identifier:

curl -sL http://yb0.pachot.net:7000/dump-entities |
 jq -r --arg table "demo" -r '
  . as $input |
  (
    $input.tables[] |
    select(.table_name == $table ) |
    .table_id
  ) as $table_id |
  $input.tablets[] |
  select(.table_id == $table_id and .state == "RUNNING" ) |
  .leader as $leader |
  {
    $table, table_id, tablet_id,
    leader: (
      .replicas[] |
      select(.server_uuid == $leader)
    )
  }
'

{
  "table": "demo",
  "table_id": "000033c000003000800000000000408e",
  "tablet_id": "4bccaaaa0fc3486ea565ccc18e325122",
  "leader": {
    "type": "VOTER",
    "server_uuid": "104130d300d64c9f9ed5df25823cd121",
    "addr": "10.0.0.39:9100"
  }
}

With this information, I connect to the node (10.0.0.39) which stores this tablet peer so that I can look at the files.

I find the WAL (Write Ahead Log) for this tablet:

# ls -t $(find / -regex '.*/yb-data/tserver/wals/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/wal-[0-9]+')

/home/opc/10.0.0.39/var/data/yb-data/tserver/wals/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/wal-000000001

As my database is not encrypted (it’s a lab) I can look at the content of the WAL:

# log-dump /home/opc/10.0.0.39/var/data/yb-data/tserver/wals/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/wal-000000001

replicate {
  id {
    term: 1
    index: 1
  }
  hybrid_time: HT{ days: 19869 time: 20:46:57.683178 }
  op_type: NO_OP
  size: 26
  id { term: 1 index: 1 } hybrid_time: 7031834286830297088 op_type: NO_OP committed_op_id { term: 0 index: 0 } noop_request { }
}
replicate {
  id {
    term: 2
    index: 2
  }
  hybrid_time: HT{ days: 19869 time: 20:46:58.648734 }
  op_type: NO_OP
  size: 26
  id { term: 2 index: 2 } hybrid_time: 7031834290785214464 op_type: NO_OP committed_op_id { term: 1 index: 1 } noop_request { }
}

There is no data because my table is empty.

INSERT

I insert one row:

yugabyte=# insert into demo ( a, b, c ) values (1, 1, 1);
INSERT 0 1

I look at the WAL again and see one write:

replicate {
  id {
    term: 2
    index: 3
  }
  hybrid_time: HT{ days: 19869 time: 21:39:37.331124 }
  op_type: WRITE_OP
  size: 134
  write {
    unused_tablet_id:
    write_batch {
      write_pairs_size: 1
      write_pairs {
        Key: SubDocKey(DocKey(0xeda9, [1], []), [])
        Value: Not found (yb/docdb/kv_debug.cc:114): No packing information available
      }
    }
  }
}

The log is structured as a key-value where the value has documents (for table rows and index entries) with sub-documents (for the column values).
Here, the key is my primary key, the value 1, which I inserted into id. Because it is hash sharded (by default YugabyteDB sets the first column of the primary key as HASH) the hash code is added in front of the key (you can run select to_hex(yb_hash_code(1::bigint)) to verify that it is 0xeda9)

The sub-document value is not visible here because it is packed, and I didn’t provide the metadata. Packed rows is an optimization for storing all column values into a single SubDocument, for faster INSERTs.

UPDATE

I update one column:

yugabyte=# update demo set a=2, b=2;
UPDATE 1

The WAL shows a new write with two sub-documents, one for each column value (this is better than PostgreSQL that copies the whole row when you update a single bit):

replicate {
  id {
    term: 2
    index: 4
  }
  hybrid_time: HT{ days: 19869 time: 21:41:01.749055 }
  op_type: WRITE_OP
  size: 238
  write {
    unused_tablet_id:
    write_batch {
      write_pairs_size: 2
      write_pairs {
        Key: SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(1)])
        Value: 2
      }
      write_pairs {
        Key: SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(2)])
        Value: 2
      }
    }
  }
}
replicate {
  id {
    term: 2
    index: 5
  }
  hybrid_time: HT{ days: 19869 time: 21:41:01.752862 }
  op_type: UPDATE_TRANSACTION_OP
  size: 94
  update_transaction {
    transaction_id: 23074c91-3f86-4f28-b8c3-70295392c63b
    status: APPLYING
    tablets: ea42dda9f4634e9bb5193382ce41bf74
    commit_hybrid_time: HT{ days: 19869 time: 21:41:01.752154 }
    sealed: 0
  }
}

The log also holds information about the transaction because it’s a shared-nothing architecture: all states must go through the network.

DELETE

I delete this row:

yugabyte=# delete from demo;
DELETE 1

There’s a new writto the log marking the end of life of the row with the DEL marker, often called a "tombstone":

replicate {
  id {
    term: 2
    index: 6
  }
  hybrid_time: HT{ days: 19869 time: 21:42:13.005451 }
  op_type: WRITE_OP
  size: 183
  write {
    unused_tablet_id:
    write_batch {
      write_pairs_size: 1
      write_pairs {
        Key: SubDocKey(DocKey(0xeda9, [1], []), [])
        Value: DEL
      }
    }
  }
}
replicate {
  id {
    term: 2
    index: 7
  }
  hybrid_time: HT{ days: 19869 time: 21:42:13.007993 }
  op_type: UPDATE_TRANSACTION_OP
  size: 94
  update_transaction {
    transaction_id: 1e334b0c-d9c2-4ea0-a55e-212a3282e011
    status: APPLYING
    tablets: b1298c45b85a4475be2123b270655d82
    commit_hybrid_time: HT{ days: 19869 time: 21:42:13.007542 }
    sealed: 0
  }
}

This log can be used to reconstruct all versions of the table’s rows at any point in time for the MVCC (Multi-Value Concurrency Control) retention. The same content is stored in memory, the LSM Tree MemTable, as soon as the replicated log gets consensus from the quorum, and processing data from it is efficient. The WAL file is used only to recover it, send the changes to a lagging replica, or for asynchronous replication or change data capture. At this point, “The Log Is The Database”, and there’s no presence of my table’s data elsewhere on disk.

As the table grows, it may no longer fit entirely in memory, and processing data from the WAL file would be inefficient. To address this, the MemTable is flushed to an SST File where the values are ordered by key (instead of by time in the WAL), allowing for efficient point or range queries. This marks the end of “The Log Is The Database,” and the database is now stored in files optimized for data retrieval. The WAL can be discarded as soon as it passed the retention for asynchronous replication or follower’s gap resolution.

Flush

I want to keep my table small for this demo but I can force a flush:

$ yb-ts-cli --server-address 10.0.0.39:9100 flush_tablet 4bccaaaa0fc3486ea565ccc18e325122
Successfully flushed tablet <4bccaaaa0fc3486ea565ccc18e325122>

I find the SST file:

$ ls -t $(find / -regex '.*/yb-data/tserver/data/rocksdb/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/[0-9]+.sst')

/home/opc/10.0.0.39/var/data/yb-data/tserver/data/rocksdb/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/000010.sst

I can read it with sst_dump:

$ sst_dump --command=scan --file=/home/opc/10.0.0.39/var/data/yb-data/tserver/data/rocksdb/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/000010.sst --output_format=decoded_regulardb

from [] to []
Process /home/opc/10.0.0.39/var/data/yb-data/tserver/data/rocksdb/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/000010.sst
Sst file format: block-based
SubDocKey(DocKey(0xeda9, [1], []), [HT{ physical: 1716759733007542 }]) -> DEL
SubDocKey(DocKey(0xeda9, [1], []), [HT{ physical: 1716759577331124 }]) -> PACKED_ROW[0](050000000A0000000F000000480000000148000000014800000001)
SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(1); HT{ physical: 1716759661752154 }]) -> 2
SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(2); HT{ physical: 1716759661752154 w: 1 }]) -> 2

I still see all versions, with four sub-documents: the tombstone (DEL), the two columns with new values, and the initial packed row. This is still a log of all changes, but it is ordered by the key before the time. When the table grows, you will have multiple flushes and multiple SST files which will be merged on read. LSM Tree means Log Structure Merge Tree: it is a log-structure of sorted runs that can be merged when iterating on the key.

To see what is inside the packed row, I can provide the metadata (description of the columns) to the SST Dump tool. The metadata is stored in another directory:

ls -t $(find / -regex '.*/yb-data/tserver/tablet-meta/4bccaaaa0fc3486ea565ccc18e325122')

/home/opc/10.0.0.39/var/data/yb-data/tserver/tablet-meta/4bccaaaa0fc3486ea565ccc18e325122

I pass it as a --formatter_tablet_metadata option:

$ sst_dump --command=scan --file=/home/opc/10.0.0.39/var/data/yb-data/tserver/data/rocksdb/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/000010.sst --output_format=decoded_regulardb --formatter_tablet_metadata=/home/opc/10.0.0.39/var/data/yb-data/tserver/tablet-meta/4bccaaaa0fc3486ea565ccc18e325122

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0526 21:56:39.871042 321471 kv_formatter.cc:35] Found info for table ID 000033c000003000800000000000408e (namespace yugabyte, table_type PGSQL_TABLE_TYPE, name demo, cotable_id 00000000-0000-0000-0000-000000000000, colocation_id 0) in superblock
from [] to []
Process /home/opc/10.0.0.39/var/data/yb-data/tserver/data/rocksdb/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/000010.sst
Sst file format: block-based
SubDocKey(DocKey(0xeda9, [1], []), [HT{ physical: 1716759733007542 }]) -> DEL
SubDocKey(DocKey(0xeda9, [1], []), [HT{ physical: 1716759577331124 }]) -> { 1: 1 2: 1 3: 1 }
SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(1); HT{ physical: 1716759661752154 }]) -> 2
SubDocKey(DocKey(0xeda9, [1], []), [ColumnId(2); HT{ physical: 1716759661752154 w: 1 }]) -> 2

The sub-document with the lowest time shows all the column values at the time of the insert. To allow consistent reads, the versions are ordered on the Hybrid Logical Clock, the cluster time, rather than the RocksDB sequence.

Compact

Merging from too many SST files would lower the read performance, but this read amplification is limited by background compaction. In addition to merging, the compaction can remove the intermediate versions when they are beyond the MVCC retention, to lower the space amplification.

I waited 15 minutes, the default MVCC retention (set by --timestamp_history_retention_interval_sec=900) and forced a compaction:

$ yb-ts-cli --server-address 10.0.0.39:9100 compact_tablet 4bccaaaa0fc3486ea565ccc18e325122
Successfully compacted tablet <4bccaaaa0fc3486ea565ccc18e325122>

This writes new SST files and discards the old ones (except if they are used by an active snapshot for Point In Time Recovery or Thin Clones). I look for the new SST files:

$ ls -t $(find / -regex '.*/yb-data/tserver/data/rocksdb/table-000033c000003000800000000000408e/tablet-4bccaaaa0fc3486ea565ccc18e325122/[0-9]+.sst')

In my special case, where I deleted all the rows, there are no remaining SST files. They will be created when new data is inserted, logged, and flushed.

Where is the database?

No database should only serve as a log. Writing to the log first has two reasons:

it is fast to persist to disk, with sequential writes rather than random writes
it allows recovery of the database files by re-playing the changes

Thanks to these two properties, additional structures can be maintained in memory (Shared Buffer Pool in monolithic databases, MemTable in distributed LSM Trees) to sort them on the key before the timestamp, for faster retrieval. However, a large part of the database will be written to disk, with a checkpoint from the shared buffer pool, or a flush from the distributed MemTables. When Amazon Aurora says that there’s no checkpoint happening, they refer to what happens on the single writer instance, but the WAL is applied to the blocks in the storage servers. All databases apply the log to materialize the current state, or a recent state.

In a cloud-native database, the computer is the network, and the log is the database, but that applies only to the write path. For efficient SQL processing, the data files act as the database, the cache is above it for faster access to the frequently read dataset and transaction control. Once written, the log is used to protect memory structures and roll forward to a consistent point after a point-in-time recovery. Note that all SQL DML needs to read before writing, to detect duplicate keys and locked rows, so even SQL writes used more than the log. Finally, the log is not the database, but only the safety net to avoid data loss in case of failure.

Oracle Sharding methods compared to YugabyteDB

Franck Pachot — Tue, 20 Feb 2024 07:04:37 GMT

Oracle has long been a leader in partitioning, distributing, and replicating databases. They offer shared-storage RAC for High Availability and Log-Streaming Data Guard for Disaster Recovery. However, on built-in shared-nothing replication, Oracle is now a follower of the Distributed SQL innovation initiated by Spanner and pursued by CockroachDB, TiDB, and YugabyteDB with Raft consensus replication. In 23c, Oracle added Raft replication to follow its competitors as an alternative to primary-standbys configurations. However, there is still a great deal of innovation to provide this on top of Oracle’s existing partitioning schemes without building a new database architecture, and that’s what I’ll describe here.

Adding Raft replication does not turn a traditional monolithic database into a cloud-native distributed SQL database. The database becomes distributed only when the table rows, index entries, and transaction intents are distributed to multiple Raft groups. This involves sharding/partitioning. Oracle documentation for Globally Distributed Databases 23c defines four Sharded Data Distribution Methods: System-Managed (consistent HASH), User-Defined (LIST or RANGE), Composite Sharding (combining System and User-Defined), and Directory-Based (User-Defined with a mapping table). This provides multiple sharding methods to cover various use cases. So, how does this compare to Distributed SQL databases?

In short, Oracle Globally Distributed Database and YugabyteDB can be used for all kinds of scenarios but in different ways. To be compatible with the existing features, Oracle introduced many new concepts, such as chunks, tablespace sets, table family, shards, shardspace, and partition sets. This adds to the already-known concepts like tablespaces, partitions, and subpartitions. It involves SQL commands and GDS (Global Data Services) commands. You can imagine the operational complexity of such a deployment as well as the many possibilities.

YugabyteDB has a two-layer architecture that simplifies data distribution. The storage and transaction layer has automatic sharding, which uses HASH or RANGE to distribute data to Raft groups for high availability and elasticity. The second layer is PostgreSQL declarative partitioning, which allows users to partition by HASH, RANGE, or LIST and define their placement through tablespaces. One is to distribute automatically, the other to add user-defined data placement preferences or constraints. Before comparing them with Oracle’s sharding methods, let’s describe the YugabyteDB methods first, as they are easier to understand.

YugabyteDB Distribution Methods

Range Sharding (system-managed)

The straightforward method to distribute table rows and index entries across multiple Raft groups (tablets in YugabyteDB) is splitting the range of their key values (primary key for the table, indexed columns for the index). This is a must for all Distributed SQL databases because SQL applications can query data on ranges. like with ‘>’, ‘<’ or ‘between’ in WHERE clauses or to get a sorted result for ORDER BY. All distributed databases provide this: Spanner, CockroachDB, TiDB, YugabyteDB, and some give only this.

Because it is a distribution method, sharding must be automatic so data can be re-balanced automatically when scaling horizontally. We can pre-split a table on specific values, especially when we know the values we will bulk-load, but small tables will start with one shard and be automatically split when growing. YugabyteDB auto-split thresholds are described in the documentation.

The syntax for range sharding is easy in YugabyteDB: you define ASC or DESC as with any SQL index definition. For example, an index on (timestamp ASC, name ASC) will be ordered by timestamp, name and split in the middle when growing.

Hash Sharding (system-managed)

There are two issues with range sharding. Firstly, it isn’t easy to distribute data before knowing the values. Secondly, it can create a hotspot when inserting rows in the same range, such as with timestamps or sequences. A hash value can be applied when a key is only used for point queries (equality predicate on the key). YugabyteDB can do this automatically when you define a hash-key part as it applies a hash function to get a value in the range of 0–65535 that will be added as a prefix to the internal key. Then, range sharding will be applied to this hash value only. YugabyteDB extends the PostgreSQL syntax by adding HASH, like in a primary key defined as ( id HASH ). One advantage of using hash values is that the distribution is predetermined, allowing the database to split the data into multiple tablets automatically. This helps avoid the issue of hotspots that can arise with traditional indexes. For example, if an identifier is generated from a sequence, it will be distributed across multiple tablets, ensuring a more even data distribution. Not all Distributed SQL databases provide Hash Sharding in addition to Range Sharding. YugabyteDB offers the two methods, the default being HASH on the first column of the key.

Hash Sharding + Clustering key

HASH can be combined to ASC/Desc so that a key has two components: a hash key (also called partitioning key in some NoSQL databases) to distribute to multiple shards and a range key (also called clustering key or sort key in some NoSQL databases) to group the values that are queried together. With YugabyteDB, you may declare a multi-column primary key like ( device_id HASH, timestamp ASC).

Note that the hash function here is known as linear or consistent hashing (an excellent definition of the common sharding method definitions is here) and differs from the hash function used in SQL partitioning by hash. It is ideal to scale out and rebalance as the range of hash values can be split further. This method makes sense for high cardinality values. For low cardinality ones, you may prefer to add your bucket number (example here).

Range, List, Hash Partitioning (user-defined)

The sharding methods we have seen above are automatic and done at the storage level. The database manages the distribution over the cluster according to the key definition and the global settings for fault tolerance. When you want more user control on the data placement, for lifecycle management, latency, or data residency reasons, YugabyteDB offers all PostgreSQL partitioning methods on top of the automatic sharding. Typically, you may partition by range on a date to purge the old data quickly. Or you may partition by list of countries to store them in a specific region for regulatory or performance reasons.

This uses the PostgreSQL tablespaces, where YugabyteDB adds some placement information. Tablespaces in PostgreSQL define the location in a single node as a filesystem directory. YugabyteDB tablespaces are global and determine a geographical part of a cluster that spans multiple data centers. Each tablespace can set its specific replication factor and multiple placement blocks mapped to cloud providers, regions, and zones when in the cloud or rack and data centers when on-premises. It can additionally define a preference for the Raft leader placement to reduce latency. Here is an example.

As all PostgreSQL partitioning methods are available, partitioning by Hash can also add modulo-based hashing on top of the consistent hash from automatic sharding. I described it here, but it is rarely needed.

With YugabyteDB, all partitioning methods can be combined with all sharding techniques. Here is an example where a table of people is partitioned by country to store them in specific regions, and each partition is distributed by hash across the availability zones of their region:

create table people (primary key(id hash, country asc), id uuid, country char(2), name text) partition by list (country);

Oracle Distribution Methods

There are four methods described in 23c documentation. They were added through 12c, 19c, and 23c on top of the existing partitioning features and global data services coordinator.

System-Managed Sharding

Oracle’s System-Managed Sharding is the equivalent of YugabyteDB Hash sharding. With YugabyteDB, defining the HASH function in the primary key or index key definition, is sufficient because the distribution is built-in the key-vue distibuted storage (DocDB).

With Oracle, you define the partition key with PARTITION BY CONSISTENT HASH. Even if sharding is automatic, it has to map to the traditional storage attributes: databases, tablespaces, extents, and blocks, and you must additionally create a TABLESPACE SET to create a tablespace on each node. Each node is a complete Oracle Database.

As far as I know, Hash Sharding is the only automatic one in 23c, and no equivalent of Range Sharding can be automatically split when the table grows. The System-Managed Sharding is only Hash and can be used only for high cardinality columns not queried by range.

User-Defined Sharding

Oracle’s User-Defined Sharding is the equivalent of YugabyteDB Range or List Partitioning. Partitions are assigned to tablespaces that define their location in a sub-set of the cluster. With YugabyteDB, this location is a set of placement blocks defining the replication factor and the nodes (cloud provider, cloud region, availability zone) where Raft leaders and followers can be placed.

With Oracle, you define each tablespace with a SHARDSPACE that you must configure in GDSCTL to map to the nodes (shards) because each of those nodes is a monolithic CDB (Container Database).

Directory-Based Sharding

Oracle’s Directory-Based Sharding has no direct equivalent in YugabyteDB because it requires a directory table to store the mapping between column values and partitions. To scale linearly, YugabyteDB avoids such a central table. The use cases fall into other YugabyteDB methods (range sharding for uneven data distribution, list partitioning to group multiple key values, range sharding on additional columns for custom policy). If you use Directory-Based sharding in Oracle and move to YugabyteDB you should look at what you wanted to achieve with it. There’s a good chance that automatic Range Sharding is the solution.

Composite Sharding

Oracle’s Composite Sharding is the equivalent of using YugabyteDB partitioning and sharding. With YugabyteDB, each partition declared with Range, List, and Hash partitioning in the query layer (PostgreSQL) is like a table for the storage layer where sharding applies on the key, so all combinations are possible.

With Oracle, you have to declare PARTITION with SHARDSPACE for system-managed partitioning and PARTITION SET with TABLESPACE SET for user-defined partitioning.

This is different from sub-partitioning which combines multiple user-defined partitioning method. In YugabyteDB, like in PostgreSQL, because a partitions like a table, you can do the same by partitioning a partition, but this should rarely be used given that automatic sharding allows Hash and Range.

Quick Comparison

It isn’t easy to compare the sharding methods between two different architectures.

Oracle Globally Distributed Database adds distribution and replication on top of a set of monolithic databases.
YugabyteDB was designed with built-in sharding in the transaction and storage layer and PostgreSQL on top of it to add all SQL features.

When comparing current versions, Oracle has more possibilities in its legacy partitioning, like operations to merge and split user-defined partitions. Some can be used to work around the lack of automatic range sharding, which is a must for SQL applications with range queries and is implemented in all Distributed SQL databases.

For migrations, you should look at the requirements for sharding (to scale data storage and processing) and partitioning (for geo-distribution). Both databases have their solution, with different operational complexity. You do more with legacy partitioning methods in Oracle and more with automatic sharding methods in Distributed SQL databases. You can do both in YugabyteDB.

Oracle Database Sharding uses monolithic databases to store parts of a global one, with its well-known proprietary RDBMS and the coordination of Global Data Services. YugabyteDB is a new database that is horizontally scalable, open-source, and PostgreSQL-compatible. It also uses proven technology (PostgreSQL, RocksDB, Apache Kudu) but with a different architecture (Distributed SQL).

Isolation Levels — part XII: To go further

Franck Pachot — Mon, 18 Dec 2023 18:32:50 GMT

Isolation Levels — part XII: To go further

SQL isolation levels are typically characterized by their effects, such as anomalies or phenomena, or by their implementation, such as lock duration. However, this approach doesn’t provide much guidance to developers on when to use each level, and that’s what I tried to address in this series.

ANSI SQL does not describe this topic accurately.
Here’s a more detailed explanation of the issue:
A Critique of ANSI SQL Isolation Levels

Once correctly described, those anomalies can be tested, and Martin Kleppmann has created a testing suite for it: https://github.com/ept/hermitage.

The complete description of the isolation level is more complex than what has been described here.
Here is a comprehensive description:
Consistency Models — Kun Xi

You may wonder how Oracle can enforce referential integrity without row share locks and serializable isolation level. The magic relies on using the index on the foreign key as a range lock.
I had put many details in this old presentation:
Indexing Foreign Keys in Oracle

YugabyteDB has one of the most extensive implementations available, with all levels like PostgreSQL, but additionally solves the Read Committed inconsistency with statement restarts like Oracle.
Here is the documentation:
Isolation levels | YugabyteDB Docs

In this final post of the series, I want to make it clear that my aim when comparing different database implementations is to understand them better. Please note that I am not trying to determine which database is better or worse. All of the databases mentioned in this series are utilized for running critical OLTP applications. For instance, some people like to make jokes about Oracle, but the lack of true Serializable doesn’t affect the consistency of existing applications in any way. To avoid conflicts, applications written for Oracle use implicit locking such as SELECT FOR UPDATE, LOCK TABLE, and DBMS_LOCK. Remember that implicit locking was ignored by the original description of transaction isolation.

Originally published at https://dev.to on December 18, 2023.

Isolation Levels — part XI: Read Uncommitted

Franck Pachot — Wed, 13 Dec 2023 21:56:48 GMT

Isolation Levels — part XI: Read Uncommitted

The Read Uncommitted isolation level is designed to prevent dirty reads, which are changes made by other transactions that have not yet been committed. Such changes should not be visible to other transactions. Non-MVCC databases are forced to lock the row being read to ensure that uncommitted transactions are not currently modifying them. These read locks can potentially block the application, for instance, in scenarios where a DBA is counting the rows of a large table. Non-MVCC databases had to allow such dirty reads for these operations, even if they returned inconsistent results.

With modern databases that use MVCC, you can safely ignore this isolation level. MVCC provides a consistent read time without relying on read locks. When a read encounters an uncommitted change, held with a write lock, it will just read the last committed version before the read time. The Read Committed isolation level offers the same concurrency level while avoiding the exposure of uncommitted changes. In PostgreSQL or YugabyteDB Read Committed will be used when setting Read Uncommitted. It exists for SQL compatibility, but you should never have to set it.

Originally published at https://dev.to on December 13, 2023.

Isolation Levels — part X: Non-Transactional Writes

Franck Pachot — Wed, 13 Dec 2023 21:45:29 GMT

Isolation Levels — part X: Non-Transactional Writes

Previous isolation levels described in this series were focused on ensuring consistency of read operations and maintaining the read state from the time of reading to the commit time. Although modern databases with MVCC (Multi-Version Concurrency Control) allow for some level of consistency in read operations, the modified rows must still be locked until the end of the transaction. Successful transactions must appear as if all reads and writes happened instantaneously and atomically at the commit time.

In YugabyteDB, locks for modified rows are stored in the IntentsDB with the new version. This atomicity is achieved through a single status change in the distributed transaction table. All sessions filter the committed changes when reading the IntentsDB by checking the committed status of their transaction. The committed changes are applied later, asynchronously, to the RegularDB in the background and then deleted from the IntentsDB. This process eliminates the need to further read the Intents and transaction status in addition to the versions stored in RegularDB.

In some scenarios, such as when you are uploading large amounts of data into a table that isn’t accessed by the application yet, you might not need this visibility atomicity. In this case, you can choose to consider each row visible as soon as it’s written, even before the commit. Those writes escape to the current transaction visibility and are non-transactional. By doing this, bulk loading becomes faster as it can write directly to the RegularDB. This behavior is activated at the session level using yb_disable_transactional_writes and effectively modifies the write time. With this optimization, the SQL database can be as fast as a NoSQL database for fast data ingest, with all ACID guarantees once the load is completed.

This optimization does not define an isolation level but affects write visibility. It is important to mention this when discussing transaction isolation and race conditions and, like with isolation levels, the performance can be higher when the application is aware of possible concurrent transactions anomalies.

I explained how to use non-transactional write in YugabyteDB but did you know all databases can employ that non-transactional writes? For example, Oracle and PostgreSQL use them to update the sequences. A sequence stores the last value in a table. When you read the next value, it updates it to a higher value. If this was transactional, a rollback should also rollback this update. However, this is not how it works:

For higher concurrency, the update of the sequence is non-transactional. The update is immediately visible by another session, and no lock is held, even when the transaction continues.

YugabyteDB extends this possibility and allows users to disable transactional writes to speed up the operations that can bypass ACID isolation.

Originally published at https://dev.to on December 13, 2023.

Isolation Levels — part IX: Read Committed

Franck Pachot — Sun, 10 Dec 2023 17:40:51 GMT

Isolation Levels — part IX: Read Committed

The lowest level of MVCC databases is Read Committed, which is commonly used as the default setting. However, it is also possibly the least understood and the least database-agnostic. As the name suggests, it only reads committed data but allows for all types of anomalies except dirty reads.

So, does using Read Committed corrupt your database? Not if you understand it and manage race conditions yourself. MVCC databases typically allow concurrent reads and writes without locking the data for reads by default. However, in certain scenarios, it may be necessary to use explicit locking to ensure data consistency. For example, if you are concerned about lost updates, you can use the SELECT FOR SHARE or SELECT FOR UPDATE commands to lock the rows you’ve read. This approach provides protection that is similar to the Cursor Stability or Repeatable Read isolation levels, as it prevents UPDATE or DELETE operations on the read set but with a reduced scope on a statement-by-statement basis. To prevent other anomalies, such as phantom reads, you can use LOCK TABLE to prevent new insertions from altering the read state, since you cannot lock a row that doesn’t exist yet. Some databases also provide an API for custom locks, like PostgreSQL Advisory lock.

What is the advantage of Read Committed over Repeatable Read? A MVCC database allows the database to roll back and restart a statement at the statement level, avoiding the need for the application to handle serialization errors.

Every database is unique when it comes to a transparent restart and explicit locking.

Oracle doesn’t offer a LOCK FOR SHARE option that blocks writers while allowing other readers to access the data. Instead, it uses LOCK FOR UPDATE, which has a lower level of concurrency as readers can block each other. On the other hand, PostgreSQL and YugabyteDB provide shared and exclusive row locks, which enable more efficient data access and better concurrency control.

In case of a conflict between the read state (using MVCC) and the write state (the current state), when Oracle or YugabyteDB encounters such a situation, it can roll back the statement to an implicit savepoint and restart it to ensure a consistent result based on a more recent read time, all of which is done seamlessly and transparently.

In the same condition, SQL Server with READ_COMMITTED_SNAPSHOT implements MVCC for Read Committed. It locks the read state instead of restarting. More details can be found at https://www.dbi-services.com/blog/how-sql-server-mvcc-compares-to-oracle-and-postgresql/, which means that readers still block writers.

When using Read Committed in PostgreSQL, inconsistencies can arise when there is a conflict during a write operation. In such cases, if a row has been modified since it was last read, PostgreSQL will re-read the row to avoid corrupting it. However, this re-read is based on a new time, which can be inconsistent with the previous reads. I think the main reason why it doesn’t rollback and restart is that it requires savepoint before each statements, and those do not scale in PostgreSQL.

To ensure result consistency, YugabyteDB and Oracle follow a different approach. Instead of re-reading the row, they rollback and restart the entire statement. This ensures that the entire dataset reflects the same state from the new read time.

YugabyteDB implements a read restart to ensure statement-level consistency without blocking writes, and SELECT FOR SHARE/UPDATE for explicit locking, providing a powerful Read Committed isolation level.

The main difference between Read Committed and Repeatable Read in MVCC databases lies in the read time. In Read Committed, the read time is the start of the statement, while in Repeatable Reads and higher levels, it is the same for the whole transaction. Having a different read time for each statement doesn’t protect against anomalies in complex transactions, but it allows more transparent statement restarts, which means that the database can roll back a statement (to an implicit savepoint taken before) and restart it transparently with a different read time.

In higher levels, when the read time must be the beginning of the transaction, the entire transaction must be rolled back and restarted. The database cannot perform this action on its own as it lacks knowledge of what else the application has done during the transaction. Therefore, to protect against anomalies with higher isolation levels, an MVCC database must raise a serializable error when a conflict is detected. This allows the application to retry the transaction itself.

This provides a clue for optimizing Read Committed transactions: run the entire business transaction as a single statement with WITH and RETURNING clauses instead of multiple statements.

Here are the characteristics of Read Committed isolation level in YugabyteDB (when --yb_enable_read_committed_isolation=true)

Read time: the start of the statement
Possible anomalies: all (except dirty reads)
Performance overhead: none except when using explicit locking
Development constraint: explicit locking when repeatable reads is necessaryDefault in: PostgreSQL, Oracle, YugabyteDB

Originally published at https://dev.to on December 10, 2023.