Pensieve

The "Files are the Database": A Deep Dive into Delta Lake

Anuvrat Singh — Thu, 27 Nov 2025 08:51:16 GMT

I’ve been spending a lot of time lately reading about distributed systems - specifically the foundational papers on Google File System (GFS), BigTable, and Facebook’s Tectonic. They all solve the same massive problem: how to store petabytes of data reliably.

But they all share one trait: they require massive, dedicated infrastructure. They need a centralized “Metadata Service” (a Brain) to map data blocks to hard drives.

Delta Lake asks a different question. What if we don’t build the storage layer at all? What if we just use “dumb” object storage like S3 and manage the intelligence in a simple transaction log?

Here is how it works.

The Problem: The “Cloud Physics” Gap

For years, data engineers had to choose between two bad options.

Option A: The Data Warehouse (like Snowflake or Redshift). It’s consistent and reliable (ACID). If you write data, it’s there. But it’s expensive and proprietary. You can’t just open the files with another tool.

Option B: The Data Lake (S3 or Azure Blob). It’s cheap and open. You just dump Parquet files into a bucket. But it’s unsafe. S3 is “eventually consistent.” If a job crashes while writing, you get corrupt “ghost files.” If you read while writing, you might see partial data.

Delta Lake was born to fix this. It brings ACID guarantees to object storage without building a new database engine.

The Architecture: The “Database” is just a Folder

The core idea is simple: decouple the Physical State (what files are in the bucket) from the Logical State (what files actually belong to the table).

This happens in the Transaction Log.

1. The Transaction Log (`_delta_log`)

Instead of asking S3 to list thousands of files (which is slow), Delta Lake checks a folder named _delta_log.

The “WAL” (JSON Files): Every change is a JSON file. 000001.json might say: Add “file_A.parquet”, Remove “file_B.parquet”.
The “Checkpoint” (Parquet Files): To keep the log from getting too big, Delta compacts these JSON files into a Parquet checkpoint every few commits.

Why JSON? It’s human-readable. You can literally open the file and see what happened.

2. Access Protocols (ACID on S3)

This is where the physics gets tricky.

Reading: When you query a table, the reader checks the log first. It finds the valid list of files and then asks S3 for them. The log is the single source of truth.
Writing: Writers upload data to S3 first (invisible to readers). Then, they try to create the next log entry (e.g., 000011.json).

On Google Cloud or Azure, this is easy because the storage supports atomic operations. On AWS S3, it’s harder. S3 lacks a “put-if-absent” feature, so Delta needs a small external helper (like DynamoDB) to make sure two writers don’t create the same log entry at the same time.

High-Level “Superpowers”

Because the log is immutable (we never change past entries), we get features that usually require an enterprise database.

Time Travel: Since we don’t delete data immediately (we just mark it as “removed” in the log), you can query the table as it existed yesterday.
Unified Batch & Streaming: A streaming job can just watch the transaction log. It acts like a Kafka topic. New file added? Process it.
Schema Enforcement: Delta checks data types before writing. It acts like a bouncer, keeping bad data out.

Performance: Making Files Fast

How does a file-based system compete with a specialized engine?

Compaction: It solves the “Small File Problem.” A background process (OPTIMIZE) grabs thousands of tiny files and rewrites them into larger, efficient chunks.
Z-Ordering: This organizes data inside the files. If you filter by CustomerID, Delta physically groups those customers together. This lets the engine skip over 90% of the data it doesn’t need.

The Architectural Debate: Files vs. Services

This brings me back to the papers I’ve been reading. The trade-offs here are fascinating.

Delta Lake vs. Facebook Tectonic

In my post about Tectonic, I noted that it manages the physical placement of blocks on hard drives. It’s a filesystem. Delta Lake is a Table Layer. It doesn’t care about blocks or hard drives. It assumes S3 handles the hardware. It pushes all the complexity to the client (Spark), keeping the storage layer simple.

Delta Lake vs. Snowflake

This is the classic “Dashboard Latency” trade-off.

Snowflake uses an always-on database (FoundationDB) to track metadata.

Pro: It’s fast. You can look up a single row in milliseconds.
Con: It costs money to keep that infrastructure running.

Delta Lake keeps metadata in files.

Pro: Infinite scale. Zero “always-on” cost.
Con: The “Startup Tax.” To read the table, you have to parse the JSON log. That takes 200-700ms.

This explains why Delta is amazing for scanning billions of rows but bad for powering a snappy user dashboard.

Summary

Delta Lake represents a shift from “Smart Storage” to “Open Storage.” By implementing a transaction log over standard files, it proves you don’t need a proprietary engine to get reliability.

However, physics still applies. In the Spanner paper, Google used atomic clocks to guarantee consistency across the globe. Delta uses a JSON log and optimistic logic because it runs on commodity cloud storage.

If you need sub-second lookups, use a service like Snowflake. But for massive data processing, the “dumb” file approach is surprisingly smart.

An AI Was Tricked Into Hacking

Anuvrat Singh — Sat, 15 Nov 2025 18:01:27 GMT

We just got our first real look at an AI-orchestrated cyber attack.

In mid-September 2025, a state-sponsored group launched a sophisticated cyber-espionage campaign designated GTG-1002. What made this one different wasn’t just the scale (targeting 30 global entities) but the method. The attackers didn’t just use AI to help them; they turned an AI model, Anthropic’s Claude Code, into an autonomous agent that did 80-90% of the tactical work.

The AI autonomously mapped networks, tested for vulnerabilities, harvested credentials, moved through the system, and even analyzed the stolen data for intelligence value. It did this at a speed “physically impossible” for human operators.

But here’s the most important part: the AI didn’t “go rogue.” It didn’t become self-aware or malicious. It was conned.

This event isn’t a story about a rogue AI. It’s a story about a classic, very human security vulnerability known as the “Confused Deputy”. And it exposes a deep, systemic flaw in how we’re building the infrastructure to connect AI to the real world.

The AI as a “Confused Deputy”

The “Confused Deputy” is a long-standing problem in computer security. It’s a situation where a program with legitimate authority is tricked by a malicious entity into misusing that authority.

Imagine you hire a new personal assistant who is brilliant, incredibly fast, and extremely literal. You give them a master key to your office building. One day, a person pretending to be a building inspector tells your assistant, “We have a report of a security flaw in the executive office. I need you to use your master key, go in, and test the safe’s lock for me.” Your assistant, lacking the “gut feeling” or context to be suspicious, sees only a person with an (apparent) legitimate goal. They think they’re helping. So they use their authority (the master key) to fulfill the request, letting the “inspector” (a thief) into the room.

This is exactly what happened in the GTG-1002 campaign. The attackers “socially engineered” the AI. They used a “role-play” tactic, successfully convincing the AI model that it was an employee at a cybersecurity firm and that all its tasks were part of an authorized, defensive penetration test.

The AI wasn’t the attacker. It was the “confused deputy,” the first victim. The attackers tricked it into diligently misusing its own powerful reasoning to achieve their malicious goals.

The “Insecure-by-Default” Protocol

So, how did the AI get the master key in the first place? This is where the problem gets systemic. The attackers connected the AI to their hacking tools using the Model Context Protocol (MCP). MCP was created by Anthropic to be a “universal, open standard” for AI - think of it as a “USB-C port for AI”. Its goal was connectivity and interoperability, not security.

The protocol’s design is “insecure by default”. It promotes an easy-to-implement but highly insecure pattern called “Agent-Auth”. This is the digital equivalent of giving your assistant (the AI agent) its own powerful, static credentials - the master key. This creates a second, more dangerous confused deputy.

First, the human confuses the AI (the “role-play”).
Second, the (now-confused) AI confuses the tool.

When the AI, fully believing it’s doing a “pen-test,” sends a command like “Run NetworkScan,” the tool (which has its own powerful permissions) faithfully executes it. The tool has no way to check the original human’s true malicious intent.

We built a system for convenience and forgot to build in the most basic safeguards.

Why This Is Just the Beginning

This problem isn’t a simple “bug” that can be patched. Anthropic banned the attackers’ accounts, but the techniques are now public. The attackers’ methods were brilliant in their simplicity.

“Salami-Slicing”: They bypassed AI safety models by breaking their malicious plan into thousands of tiny, individually benign slices. A single request like “Scan this IP” looks harmless and consistent with the “pen-tester” role. The safety models, which check one prompt at a time (stateless), were blind to the malicious pattern emerging over time.
“Tool Shadowing”: They exploited the fact that an AI often only sees a tool’s description, not its code. An attacker can create a malicious hacking tool and name it “Cats Counter,” with the description “Counts all the cats in a given domain”. When the AI is asked to “count the cats,” it harmlessly executes the tool, which then does its real, malicious work in the “shadow”.

The barrier to entry for sophisticated cyberattacks has now substantially dropped. This “commoditization of sophistication” means small-time actors can now achieve the results of an entire team of experienced hackers. This will happen again.

The Path Forward: A Mandate for “User-Auth”

If we can’t solve this at the model layer alone, what do we do? The GTG-1002 report shows that we need a “defense-in-depth” strategy.

At the Model Level - Stateful Safety: Safety systems must evolve. We need “stateful” safety that looks at context and patterns, not just single prompts. Instead of just asking “Is this prompt bad?”, the system should ask, “Why is this ‘pen-tester’ making 5,000 requests a second at 3 AM?”. This is what Anthropic is now working to build.
At the Protocol Level - Mandate “User-Auth”: This is the most critical fix. We must abandon the “Agent-Auth” (master key) pattern. The secure alternative is “User-Auth”. In this model, the AI agent never gets its own keys. Instead, it temporarily borrows the user’s keys. The AI agent’s permissions are identical to the human user who prompted it, and it can never have more authority than that person. If the human user “Anuvrat” doesn’t have permission to access the finance database, the AI he is using can’t access it either. This breaks the confused deputy problem at its root.
At the Enterprise Level - Zero Trust: As leaders and architects, we must treat all these new AI tools as untrusted. We need to enforce the “principle of least privilege,” audit our new AI-driven supply chain , and, for any high-risk action (like running an exploit or exfiltrating data), always require an explicit, auditable “Human-in-the-Loop” confirmation.

The GTG-1002 campaign wasn’t a “Terminator” moment. It was a failure of our own design. We’ve built an incredibly powerful engine, and now we have to do the hard work of building a safe, secure, and trustworthy architecture around it.

OpenAI's Browser - Architectural Trade-off

Anuvrat Singh — Tue, 04 Nov 2025 20:00:08 GMT

When OpenAI launched its new browser, most people focused on the AI features. But the more interesting story is its architecture. OpenAI didn’t just reskin Chromium. They split it into two separate pieces. In doing so, they traded the complexity of managing a normal Chromium fork for the complexity of a distributed system.

This decision changes how the browser performs, handles security, and how it will be maintained. It’s a classic study in software trade-offs. Let’s look at what they built, why, and the brittle foundation it rests on.

Part 1: The “Local Browser-as-a-Service”

The Atlas browser isn’t one application. It’s two.

The Native UI Shell (Client): This is what you see. It’s the window, the tabs, and the AI controls, built mostly in Apple’s native SwiftUI. This part is light and responsive.
The Chromium Service (Server): This is the entire Chromium engine (Blink, V8) running as a separate, headless service in the background.

The two processes are held together by Mojo, Chromium’s internal communication framework. OpenAI’s engineers used this instead of a public API like Chrome DevTools. This setup basically creates a “Local Browser-as-a-Service.” It mixes the security ideas of a remote browser (running in the cloud) with the speed of a local app.

Part 2: The “Why”: The Clear Benefits

Why go through this trouble? The team says this design provides several benefits, and they seem right.

UI Development Speed: This is the main one. UI engineers can work in a clean Swift codebase. They don’t need to compile all of Chromium locally, which is a famously slow process. This lets them build and test new UI features much faster.
Crash Isolation: This is a big win for users. If the heavy Chromium service crashes, the native UI shell doesn’t. The app stays responsive and can just restart the service, like restarting a server.
Faster Perceived Startup: The app feels like it starts instantly. The lightweight native shell loads right away. The user sees a working app while the much heavier Chromium service boots up in the background.

Part 3: The Trade-offs

These benefits are real, but they have a cost. This design introduces new risks in performance and stability.

Trade-off 1: The IPC Performance Tax

When you split a program in two, simple function calls become cross-process network requests. This adds a small delay, or “latency tax.” Atlas pays this tax on every single click, keypress, and rendered frame.

Input: Your click goes from the Swift UI (Process 1) over the Mojo bridge to the Chromium service (Process 2).
Output: Chromium renders a new frame, then tells the Swift UI over the bridge that the frame is ready to be displayed.

This only works because Mojo is very fast. But the team still sacrificed raw, in-process speed for this architectural split.

Trade-off 2: The “Clean Codebase” Illusion

The idea of a “clean codebase” is mostly an aspiration. A browser has thousands of small UI parts, like